soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 190 forks source link

Creating index for ColabDB failed on cluster. #583

Open IlyesAbdelhamid opened 2 years ago

IlyesAbdelhamid commented 2 years ago

Hello,

I've been encountering an issue for creating index of ColabDB. It looks like it is a memory consumption issue. Could you help me with this matter please? Thank you in advance for your help.

Sincerely, Ilyes

Expected Behavior

An index file of the colabfold_envdb_202108_db is computed for a fast read-in.

Current Behavior

Error: indexdb died slurmstepd: error: Detected 1 oom-kill event(s) in StepId=27501792.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Steps to Reproduce (for bugs)

I am using the following commands to build the database as indicated here: https://colabfold.mmseqs.com/ Uniref30 was successful but not ColabDB.

wget https://raw.githubusercontent.com/sokrypton/ColabFold/main/setup_databases.sh chmod +x setup_databases.sh ./setup_databases.sh database/

MMseqs Output (for bugs)

MMseqs Version: 3b9cf88179737563acfdb83b516c0b5219cc531e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern
Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 1 Split memory limit 0 Verbosity 3 Threads 56 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true

createindex colabfold_envdb_202108_db tmp2 --remove-tmp-files 1 --split 1

MMseqs Version: 3b9cf88179737563acfdb83b516c0b5219cc531e Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out k-mer length 0 Alphabet size aa:21,nucl:5 Compositional bias 1 Compositional bias 1 Max sequence length 65535 Max results per query 300 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Spaced k-mers 1 Spaced k-mer pattern
Sensitivity 7.5 k-score seq:0,prof:0 Check compatible 0 Search type 0 Split database 1 Split memory limit 0 Verbosity 3 Threads 56 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Compressed 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Strand selection 1 Remove temporary files true

indexdb colabfold_envdb_202108_db colabfold_envdb_202108_db --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 1 --split-memory-limit 0 -v 3 --threads 56

Estimated memory consumption: 780G Write VERSION (0) Write META (1) Write SCOREMATRIX3MER (4) Write SCOREMATRIX2MER (3) Write SCOREMATRIXNAME (2) Write SPACEDPATTERN (23) Write GENERATOR (22) Write DBR1INDEX (5) Write DBR1DATA (6) Write DBR2INDEX (7) Write DBR2DATA (8) Write HDR1INDEX (18) Write HDR1DATA (19) Write ALNINDEX (24) Write ALNDATA (25) Index table: counting k-mers [=================================================================tmp2/7152678087979496025/createindex.sh: line 56: 37309 Killed "$MMSEQS" $INDEXER "$INPUT" "$INPUT" ${INDEX_PAR} Error: indexdb died slurmstepd: error: Detected 1 oom-kill event(s) in StepId=27501792.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Your Environment

I am running the script on a cluster. You will find below the batch script parameters:

!/bin/bash

SBATCH --job-name Install_ColabFold_DB

SBATCH --account=def-someuser

SBATCH --time 24:00:00 ### (HH:MM:SS) the job will expire after this time, the maximum is 168:00:00

SBATCH -N 1 ### number of nodes (1 node -> several CPUs)

SBATCH --ntasks 1

SBATCH --cpus-per-task 24

SBATCH --mem-per-cpu 10000

SBATCH -A p_linkpredic

SBATCH -e %j.err ### redirects stderr to this file

SBATCH -o %j.out ### redirects standard output stdout to this file

SBATCH -p haswell ### types of nodes on taurus: west, dandy, smp, gpu

milot-mirdita commented 2 years ago

You need a machine with 1TB of ram to create a pre computed index for the ColabFoldDB.

Are you actually planning to run a lot of small queries (like the ColabFold server)? Or are you just planning to run colabfold_search/colabfold_batch with a bunch or proteins at the same time?

if it’s the second I recommend to not create an index at all. A search without a pre computed index creates the index on the fly and has a lot lower resource requirements.

Precomputing the index only makes sense for something like our API server, where we have repeatedly many small queries and want to pay the cost only once for the index.

IlyesAbdelhamid commented 2 years ago

Thank you for the prompt reply! Ok I see. The idea is to run colabfold_search/colabfold_batch with a bunch or proteins at the same time. I've been using the API server but some of my jobs encountered ratelimits. To avoid this issue, I decided to build the databases and search against them locally.

Sincerely, Ilyes

milot-mirdita commented 2 years ago

Then I would recommend to delete the already created precomputed index (rm *.idx*) and just use colabfold_search without the precomputed index.

IlyesAbdelhamid commented 2 years ago

I wanted to compare the running time of the MSA search against the databases locally and by means of the API server Thus, I provided to colabfold_search a FASTA file containing two protein sequences. It has been running for over two hours now with the option --db-load-mode 3, while the Colab server managed a time of 45 min. Is there any way to process the MSA search as fast as the remote server?

Sincerely, Ilyes