soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

memory consumption and speed #678

Open ucabuk opened 1 year ago

ucabuk commented 1 year ago

Hi,

I am using mmseqs2 for the taxonomy assignment using NR database. However, Estimated memory consumption is 2T. Is that normal? Also, my input is already protein. My another question is about the speed. Is there any way to speed it up?

MMseqs Version:         14.7e284
Database type           0
Shuffle input database  true
Createdb mode           0
Write lookup file       1
Offset of numeric ids   0
Compressed              0
Verbosity               3

Converting sequences
[===================================
Time for merging to BH193L-2_S20_database_h: 0h 0m 0s 80ms
Time for merging to BH193L-2_S20_database: 0h 0m 0s 85ms
Database type: Aminoacid
Time for processing: 0h 0m 17s 880ms
Create directory tmp_BH193L-2_S20
taxonomy --lca-mode 3 --threads 36 -e 0.0001 --tax-lineage 1 -s 3 --lca-ranks species,genus,family,order,class,phylum,kingdom,superkingdom BH193L-2_S20_database NR BH193L-2_S20.result tmp_BH193L-2_S20

MMseqs Version:                         14.7e284
ORF filter                              1
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Taxonomy output mode                    0
Majority threshold                      0.5
Vote mode                               1
LCA ranks                               species,genus,family,order,class,phylum,kingdom,superkingdom
Column with taxonomic lineage           1
Compressed                              0
Threads                                 36
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          1
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.0001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              5
Max accept                              30
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             3
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

Create directory tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1
search BH193L-2_S20/BH193L-2_S20_database NR tmp_BH193L-2_S20/16497043801801069335/first tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1 --alignment-mode 1 -e 0.0001 --max-rejected 5 --max-accept 30 --threads 36 -s 3 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1

prefilter BH193L-2_S20/BH193L-2_S20_database NR tmp_BH193L-2_S20/16497043801801069335/tmp_hsp1/10054445979770264072/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 36 --compressed 0 -v 3 -s 3.0

Query database size: 355695 type: Aminoacid
Estimated memory consumption: 2T
Target database size: 532633656 type: Aminoacid
Index table k-mer threshold: 152 at k-mer size 7
Index table: counting k-mers

Thank you. Best,

milot-mirdita commented 1 year ago

I don't think that there is a lot left to speed up NR searches. The NR is just extremely large.

We were thinking of implementing clustered searches, similar to our ColabFold search, as a more general search-strategy in MMseqs2. But that's a longer term project. These would speed up searches against the NR significantly.

The memory use is not very accurate and it also doesn't take database chunking into account. If you use a machine with less RAM, then it will just split the target database in smaller chunk (at a small runtime cost).

ucabuk commented 1 year ago

Thank you for your answer. I understand, yes, I agree would be good to see clustered searches in MMseqs2. Is there any benchmark with diamond tool? Maybe I could not see it.

Best, Ugur