soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

slow easy-taxonomy #577

Closed mmpust closed 2 years ago

mmpust commented 2 years ago

Expected Behavior

I want to get taxonomic annotations of my nucleotide database (97 GB) with uniprotkb.

Current Behavior

The first filtering step (easy-taxonomy) runs for about 12 hours now. Is that correct? Is there a way to speed up the pre-filtering? If the pre-filtering process is split into 6 parts, can I expect that every section takes 12 hours?

Steps to Reproduce (for bugs)

mmseqs databases UniProtKB databases/uniprotkb tmp
mmseqs easy-taxonomy input.fna databases/uniprotkb taxdb tmp --dbtype 2 --lca-mode 4 --orf-filter 0 --tax-lineage 1 --split-memory-limit 200G --threads 32

MMseqs Output (for bugs)

Create directory tmp

MMseqs Version:                         13.45111
ORF filter                              0
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                4
Majority threshold                      0.5
Vote mode                               1
LCA ranks                               
Column with taxonomic lineage           1
Compressed                              0
Threads                                 32
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          0
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             4
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      200G
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern                    
Local temporary path                    
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                              
Force restart with latest tmp           false
Remove temporary files                  true
Report mode                             0
Alignment format                        0
Format alignment output                 query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                         false
First sequence as representative        false
Target column                           1
Add full header                         false
Sequence source                         0
Database type                           2
Shuffle input database                  true
Createdb mode                           1
Write lookup file                       0

createdb input.fna tmp/6713332935333060100/query --dbtype 2 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
[===================================================================================================    1 Mio. sequences processed
===================================================================================================     2 Mio. sequences processed
===================================================================================================     3 Mio. sequences processed
===================================================================================================     4 Mio. sequences processed
===================================================================================================     5 Mio. sequences processed
===================================================================================================     6 Mio. sequences processed
===================================================================================================     7 Mio. sequences processed
===================================================================================================     8 Mio. sequences processed
===================================================================================================     9 Mio. sequences processed
===================================================================================================     10 Mio. sequences processed
===================================================================================================     11 Mio. sequences processed
===================================================================================================     12 Mio. sequences processed
===================================================================================================     13 Mio. sequences processed
===================================================================================================     14 Mio. sequences processed
===================================================================================================     15 Mio. sequences processed
===================================================================================================     16 Mio. sequences processed
===================================================================================================     17 Mio. sequences processed
===================================================================================================     18 Mio. sequences processed
===================================================================================================     19 Mio. sequences processed
===================================================================================================     20 Mio. sequences processed
===================================================================================================     21 Mio. sequences processed
===================================================================================================     22 Mio. sequences processed
===================================================================================================     23 Mio. sequences processed
===================================================================================================     24 Mio. sequences processed
===================================================================================================     25 Mio. sequences processed
===================================================================================================     26 Mio. sequences processed
===================================================================================================     27 Mio. sequences processed
===================================================================================================     28 Mio. sequences processed
===================================================================================================     29 Mio. sequences processed
===================================================================================================     30 Mio. sequences processed
===================================================================================================     31 Mio. sequences processed
===================================================================================================     32 Mio. sequences processed
===================================================================================================     33 Mio. sequences processed
===================================================================================================     34 Mio. sequences processed
===================================================================================================     35 Mio. sequences processed
===================================================================================================     36 Mio. sequences processed
===================================================================================================     37 Mio. sequences processed
===================================================================================================     38 Mio. sequences processed
===================================================================================================     39 Mio. sequences processed
===================================================================================================     40 Mio. sequences processed
===================================================================================================     41 Mio. sequences processed
===================================================================================================     42 Mio. sequences processed
===================================================================================================     43 Mio. sequences processed
===================================================================================================     44 Mio. sequences processed
===================================================================================================     45 Mio. sequences processed
===================================================================================================     46 Mio. sequences processed
===================================================================================================     47 Mio. sequences processed
===================================================================================================     48 Mio. sequences processed
===================================================================================================     49 Mio. sequences processed
===================================================================================================     50 Mio. sequences processed
===================================================================================================     51 Mio. sequences processed
===================================================================================================     52 Mio. sequences processed
===================================================================================================     53 Mio. sequences processed
===================================================================================================     54 Mio. sequences processed
===================================================================================================     55 Mio. sequences processed
===================================================================================================     56 Mio. sequences processed
===================================================================================================     57 Mio. sequences processed
===================================================================================================     58 Mio. sequences processed
===================================================================================================     59 Mio. sequences processed
===================================================================================================     60 Mio. sequences processed
===================================================================================================     61 Mio. sequences processed
===================================================================================================     62 Mio. sequences processed
===================================================================================================     63 Mio. sequences processed
===================================================================================================     64 Mio. sequences processed
===================================================================================================     65 Mio. sequences processed
===================================================================================================     66 Mio. sequences processed
===================================================================================================     67 Mio. sequences processed
===================================================================================================     68 Mio. sequences processed
===================================================================================================     69 Mio. sequences processed
===================================================================================================     70 Mio. sequences processed
===================================================================================================     71 Mio. sequences processed
===================================================================================================     72 Mio. sequences processed
===================================================================================================     73 Mio. sequences processed
===================================================================================================     74 Mio. sequences processed
===================================================================================================     75 Mio. sequences processed
===================================================================================================     76 Mio. sequences processed
===================================================================================================     77 Mio. sequences processed
===================================================================================================     78 Mio. sequences processed
===================================================================================================     79 Mio. sequences processed
===================================================================================================     80 Mio. sequences processed
===================================================================================================     81 Mio. sequences processed
===================================================================================================     82 Mio. sequences processed
===================================================================================================     83 Mio. sequences processed
===================================================================================================     84 Mio. sequences processed
===================================================================================================     85 Mio. sequences processed
===================================================================================================     86 Mio. sequences processed
===================================================================================================     87 Mio. sequences processed
===================================================================================================     88 Mio. sequences processed
===================================================================================================     89 Mio. sequences processed
===================================================================================================     90 Mio. sequences processed
===================================================================================================     91 Mio. sequences processed
===================================================================================================     92 Mio. sequences processed
===================================================================================================     93 Mio. sequences processed
===================================================================================================     94 Mio. sequences processed
===================================================================================================     95 Mio. sequences processed
===================================================================================================     96 Mio. sequences processed
===================================================================================================     97 Mio. sequences processed
===================================================================================================     98 Mio. sequences processed
===================================================================================================     99 Mio. sequences processed
===================================================================================================     100 Mio. sequences processed
===================================================================================================     101 Mio. sequences processed
===================================================================================================     102 Mio. sequences processed
===================================================================================================     103 Mio. sequences processed
===================================================================================================     104 Mio. sequences processed
===================================================================================================     105 Mio. sequences processed
===================================================================================================     106 Mio. sequences processed
===================================================================================================     107 Mio. sequences processed
===================================================================================================     108 Mio. sequences processed
===================================================================================================     109 Mio. sequences processed
===================================================================================================     110 Mio. sequences processed
===================================================================================================     111 Mio. sequences processed
===================================================================================================     112 Mio. sequences processed
===================================================================================================     113 Mio. sequences processed
===================================================================================================     114 Mio. sequences processed
===================================================================================================     115 Mio. sequences processed
===================================================================================================     116 Mio. sequences processed
===================================================================================================     117 Mio. sequences processed
===========
Time for merging to query_h: 0h 0m 0s 0ms
Time for merging to query: 0h 0m 0s 0ms
Database type: Nucleotide
Time for processing: 0h 1m 58s 419ms
Create directory tmp/6713332935333060100/taxonomy_tmp
taxonomy tmp/6713332935333060100/query databases/uniprotkb tmp/6713332935333060100/result tmp/6713332935333060100/taxonomy_tmp --orf-filter 0 --lca-mode 4 --tax-output-mode 2 --tax-lineage 1 --thre
ads 32 --split-memory-limit 200G --remove-tmp-files 1 

extractorfs tmp/6713332935333060100/query tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-en
d-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 32 --compressed 0 -v
 3 

[=================================================================] 117.12M 6m 0s 613ms
Time for merging to orfs_aa_h: 0h 8m 37s 570ms
Time for merging to orfs_aa: 0h 12m 15s 84ms
Time for processing: 0h 35m 2s 862ms
Create directory tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/tmp_taxonomy
taxonomy tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/orfs_aa databases/uniprotkb tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/orfs_tax tmp/6713332935333060100/taxonomy_tmp/
9923875229524867748/tmp_taxonomy --orf-filter 0 --lca-mode 4 --tax-output-mode 2 --tax-lineage 0 --threads 32 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --split-memory-limit 200G
 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --remove-tmp-files 1 

Create directory tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/tmp_taxonomy/15848989983316803073/tmp_hsp1
search tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/orfs_aa databases/uniprotkb tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/tmp_taxonomy/15848989983316803073/first tmp/6713
332935333060100/taxonomy_tmp/9923875229524867748/tmp_taxonomy/15848989983316803073/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 --threads 32 -s 2 --split-memory-limit 200G --sp
aced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --remove-tmp-files 1 

prefilter tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/orfs_aa databases/uniprotkb tmp/6713332935333060100/taxonomy_tmp/9923875229524867748/tmp_taxonomy/15848989983316803073/tmp_hsp1/68
53721603621777485/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-
seqs 300 --split 0 --split-mode 2 --split-memory-limit 200G -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-se
lf-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 32 --compressed 0 -v 3 -s 2.0 

Query database size: 1599064123 type: Aminoacid
Target split mode. Searching through 6 splits
Estimated memory consumption: 163G
Target database size: 231921744 type: Aminoacid
Process prefiltering step 1 of 6
Index table k-mer threshold: 163 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 38.41M 2m 12s 304ms
Index table: Masked residues: 221272222
Index table: fill
[=================================================================] 38.41M 3m 45s 787ms
Index statistics
Entries:          11399442350
DB size:          74993 MB
Avg k-mer size:   8.905814
Top 10 k-mers
    FSHAGSI     272598
    AFMFFMP     260790
    AFRNNFW     259163
    RMNSFLP     218177
    NNSWLPS     215496
    AHFMIMV     211691
    MPMGGNW     204521
    TMLDRNT     168603
    TGTYPSS     159040
    GDQYNVT     148373
Time for index table init: 0h 6m 20s 599ms
k-mer similarity threshold: 163
Starting prefiltering scores calculation (step 1 of 6)
Query db start 1 to 1599064123
Target db start 1 to 38411731
[=================================================================] 1.60B 12h 46m 35s 370ms

22.315418 k-mers per position
10964 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
27 sequences passed prefiltering per query sequence
16 median result list length
125561700 sequences with 0 size result lists
Time for merging to pref_0_tmp_0: 0h 20m 11s 194ms
Time for merging to pref_0_tmp_0_tmp: 0h 53m 2s 600ms
Process prefiltering step 2 of 6

Index table k-mer threshold: 163 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 38.92M 2m 23s 524ms
Index table: Masked residues: 206230655
Index table: fill
[=====================

Context

free -h
              total        used        free      shared  buff/cache   available
Mem:           409G        135G         86G        1.4M        186G        270G
Swap:            0B          0B          0B

Google Cloud Platform (64 vCPU and 425984 MiB) Boot disk: 6000 GB Ubuntu 18.4

milot-mirdita commented 2 years ago

The --orf-filter 0 parameter disabled an important speed optimization. With the parameter enabled it should run quite a bit faster.

mmpust commented 2 years ago

Perfect, setting parameter --orf-filter 1 more than doubled the speed. Thanks!