Clustering stuck after merging splits with message about main memory

cef61 commented 1 year ago

Expected Behaviour

Unknown

Current Behaviour

I am trying to re-create the clustered nr database currently featured on the BLAST site. The cluster step appears to stall after merging the split files and I get the message "Cannot touch 215222407074 into main memory". I have 188G of RAM and 63 cores available. I have tried to reduce the amount of memory using the --split-memory-limit 70G, --split-mode 2, --split 2, and --compressed 1 options but it does not appear to have helped. This is my first time using MMseqs so any help would be much appreciated.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

mmseqs cluster --min-seq-id 0.9 --cov-mode 0 -c 0.9 DB DB_clu tmp --remove-tmp-files --threads 40 --split-memory-limit 70G --split-mode 2 --split 4 --compressed 1

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Create directory tmp cluster --min-seq-id 0.9 --cov-mode 0 -c 0.9 DB DB_clu tmp --remove-tmp-files --threads 40 --split-memory-limit 70G --split-mode 2 --split 4 --compressed 1

MMseqs Version: bdd169b3e285299cab792e62d60eb1f4e4e434d2 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 4 k-mer length 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max sequence length 65535 Max results per query 20 Split database 4 Split mode 2 Split memory limit 70G Coverage threshold 0.9 Coverage mode 0 Compositional bias 1 Compositional bias 1 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa
Include identical seq. id. false Spaced k-mers 1 Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Spaced k-mer pattern
Local temporary path
Threads 40 Compressed 1 Verbosity 3 Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.9 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max reject 2147483647 Max accept 2147483647 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Cluster mode 0 Max connected component depth 1000 Similarity type 2 Weight file name
Cluster Weight threshold 0.9 Single step clustering false Cascaded clustering steps 3 Cluster reassign false Remove temporary files true Force restart with latest tmp false MPI runner
k-mers per sequence 21 Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Shift hash 67 Include only extendable false Skip repeating k-mers false

Set cluster sensitivity to -s 1.000000 Set cluster mode SET COVER Set cluster iterations to 1 linclust DB tmp/10260956542131223380/clu_redundancy tmp/10260956542131223380/linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 40 --compressed 1 -v 3 --cluster-weight-threshold 0.9 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.9 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --alph-size aa:13,nucl:5 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 -k 0 --hash-shift 67 --split-memory-limit 70G --include-only-extendable 0 --ignore-multi-kmer 0 --rescore-mode 0 --filter-hits 0 --sort-results 0 --remove-tmp-files 1 --force-reuse 0

kmermatcher DB tmp/10260956542131223380/linclust/4311072182387952617/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 70G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 40 --compressed 1 -v 3 --cluster-weight-threshold 0.9

Database size: 541124045 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Not enough memory to process at once need to split [=================================================================] 100.00% 541.12M 9m 42s 360ms
Process file into 4 parts Generate k-mers list for 1 split [=================================================================] 100.00% 541.12M 11m 11s 8ms
Sort kmer 0h 0m 27s 593ms Sort by rep. sequence 0h 0m 10s 91ms Generate k-mers list for 2 split [=================================================================] 100.00% 541.12M 11m 10s 926ms
Sort kmer 0h 0m 25s 859ms Sort by rep. sequence 0h 0m 10s 403ms Generate k-mers list for 3 split [=================================================================] 100.00% 541.12M 11m 3s 10ms
Sort kmer 0h 0m 24s 363ms Sort by rep. sequence 0h 0m 9s 647ms Generate k-mers list for 4 split [=================================================================] 100.00% 541.12M 11m 6s 122ms
Sort kmer 0h 0m 14s 827ms Sort by rep. sequence 0h 0m 3s 410ms Merge splits ... Time for fill: 0h 14m 22s 381ms Time for merging to pref: 0h 0m 0s 0ms Time for processing: 1h 16m 36s 224ms rescorediagonal DB DB tmp/10260956542131223380/linclust/4311072182387952617/pref tmp/10260956542131223380/linclust/4311072182387952617/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 40 --compressed 1 -v 3

Can not touch 215222407074 into main memory [> ] 0.00% 1 eta -

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

LittletreeZou commented 8 months ago

I met the same issue exactly as yours. Could you please share your solution or your experience about solving this issue? My program has been stuck for 12 hours without printing out anything after the "Can not touch 215222407074 into main memory".

xtj87515 commented 3 weeks ago

I met the same issue exactly as yours. Could you please share your solution or your experience about solving this issue? My program has been stuck for 12 hours without printing out anything after the "Can not touch 215222407074 into main memory".

Did you ever fix that? I'm having similar issues too but using easy-search

soedinglab / MMseqs2