ucabuk commented 1 year ago


I am using mmseqs2 for the taxonomy assignment using NR database. However, Estimated memory consumption is 2T. Is that normal? Also, my input is already protein. My another question is about the speed. Is there any way to speed it up?

Create directory tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1
search BH193L-2_S20/BH193L-2_S20_database NR tmp_BH193L-2_S20/16497043801801069335/first tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1 --alignment-mode 1 -e 0.0001 --max-rejected 5 --max-accept 30 --threads 36 -s 3 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1

prefilter BH193L-2_S20/BH193L-2_S20_database NR tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1/10054445979770264072/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 36 --compressed 0 -v 3 -s 3.0

Query database size: 355695 type: Aminoacid
Estimated memory consumption: 2T
Target database size: 532633656 type: Aminoacid
Index table k-mer threshold: 152 at k-mer size 7
Index table: counting k-mers

Thank you. Best,

milot-mirdita commented 1 year ago

I don't think that there is a lot left to speed up NR searches. The NR is just extremely large.

We were thinking of implementing clustered searches, similar to our ColabFold search, as a more general search-strategy in MMseqs2. But that's a longer term project. These would speed up searches against the NR significantly.

The memory use is not very accurate and it also doesn't take database chunking into account. If you use a machine with less RAM, then it will just split the target database in smaller chunk (at a small runtime cost).

ucabuk commented 1 year ago

Thank you for your answer. I understand, yes, I agree would be good to see clustered searches in MMseqs2. Is there any benchmark with diamond tool? Maybe I could not see it.

Best, Ugur