Heavy Slowdown with no output while running mmseqs linclust in a big database with not enough ram.

Alvaro-Nostrum commented 3 months ago

Expected Behavior

Summary: Running linclust or clust with a very big database leads to a heavy slowdown in the rescorediagonal part. Expected the job to continue much faster. It releases a warning that says Can not touch X into main memory and the job continues running.

Current Behavior

The job is stuck at rescorediagonal with no output after several hours. The job is however accesing the indexes inside of the temporary folder. Is there anyway to fix this? Or speed it up?

MMSeqs Output

linclust JGI JGI_nr tmp --cluster-mode 2 --cov-mode 1 -c 0.99 --min-seq-id 0.95 --split-memory-limit 300G

MMseqs Version: c498f51053e2f550a4ab4bee534b0ef80033a2b3 Cluster mode 2 Max connected component depth 1000 Similarity type 2 Threads 96 Compressed 0 Verbosity 3 Weight file name
Cluster Weight threshold 0.9 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace false Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0.95 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0.99 Coverage mode 1 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Alphabet size aa:21,nucl:5 k-mers per sequence 21 Spaced k-mers 0 Spaced k-mer pattern
Scale k-mers per sequence aa:0.000,nucl:0.200 Adjust k-mer length false Mask residues 0 Mask residues probability 0.9 Mask lower case residues 0 k-mer length 0 Shift hash 67 Split memory limit 300G Include only extendable false Skip repeating k-mers false Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Remove temporary files false Force restart with latest tmp false MPI runner

kmermatcher JGI tmp/14756877054557405347/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --alph-size aa:13,nucl:5 --min-seq-id 0.95 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.99 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 300G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 1311052782 type: Aminoacid Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Not enough memory to process at once need to split [=================================================================] 1.31B 2h 26m 20s 97ms Process file into 2 parts Generate k-mers list for 1 split [=================================================================] 1.31B 2h 34m 42s 85ms Sort kmer 0h 0m 52s 653ms Sort by rep. sequence 0h 0m 31s 645ms Generate k-mers list for 2 split [=================================================================] 1.31B 2h 36m 22s 543ms Sort kmer 0h 0m 44s 690ms Sort by rep. sequence 0h 0m 26s 121ms Merge splits ... Time for fill: 1h 31m 44s 960ms Time for merging to pref: 0h 0m 0s 6ms Time for processing: 10h 13m 54s 576ms rescorediagonal JGI JGI tmp/14756877054557405347/pref tmp/14756877054557405347/pref_rescore1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.99 -a 0 --cov-mode 1 --min-seq-id 0.95 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

Can not touch 407600133816 into main memory

Your Environment

Latest precompiled AVX2 version Release 15-6f452

xtj87515 commented 3 months ago

Did you fix the problem? I'm having similar issues but using easy-search

Alvaro-Nostrum commented 3 months ago

Did you fix the problem? I'm having similar issues but using easy-search

Nope :(. If you manage to fix it please tell me

soedinglab / MMseqs2