sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.97k stars 494 forks source link

Invalid database read error in colabfold_search #276

Open aaronkollasch opened 2 years ago

aaronkollasch commented 2 years ago

Expected Behavior

Hello, I am trying to run batch searches against ColabFoldDB on a SLURM cluster, following the MSA instructions in the README.

Current Behavior

colabfold_search fails at the expandaln step with the error:

Invalid database read for database data file=[db_folder]/uniref30_2103_db.idx, database index=[db_folder]/uniref30_2103_db.idx.index
getData: local id (4294967295) >= db size (22)

Full log file: colabfold_search_output.txt

Steps to Reproduce (for bugs)

  1. bash setup_databases.sh [db_folder] Note: mmseqs createindex was run with --split-memory-limit 128G as mmseqs doesn't detect the SLURM job's memory limit otherwise.
  2. colabfold_search --db-load-mode 0 --mmseqs mmseqs_5185d3c/bin/mmseqs batch_1/input_sequences.fa [db_folder] batch_1/result_s8 Input sequences: input_sequences.fa

It looks like colabfold_search uses --split-memory-limit 0 in the prefilter steps and possibly later steps – I don't think this caused the issue as the job only reached 53 GB usage before it errored, but it would be nice to be able to set this to prevent the job from being killed.

Context

I'm looking to perform a batch search and the cluster jobs have a 250GiB limit, so I'm using --db-load-mode 0, but let me know if that isn't the best option.

Your Environment

@thomashopf

aaronkollasch commented 2 years ago

I recreated the index on a different machine without --split-memory-limit 128G and this error went away. Perhaps it was a one-off corruption of the index, an issue when specifying --split-memory-limit, or something specific to the cluster.