Open LuukvDamme opened 3 years ago
Could you please include the parts that you cut too? They are important to understand what exactly is going on.
Cetrainly, I just had to edit some paths due to some data being private information.
easy-taxonomy /sample.fastq.gz /nr /result /tmp -s 0.5
MMseqs Version: 13.45111
ORF filter 0
ORF filter e-value 100
ORF filter sensitivity 2
LCA mode 3
Majority threshold 0.5
Vote mode 1
LCA ranks
Column with taxonomic lineage 0
Compressed 0
Threads 26
Verbosity 3
Taxon blacklist 12908:unclassified sequences,28384:other sequences
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 0
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
Sensitivity 0.5
k-mer length 0
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 300
Split database 0
Split mode 0
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 15
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Maximum seq. id. threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files true
Report mode 0
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
First sequence as representative false
Target column 1
Add full header false
Sequence source 0
Database type 0
Shuffle input database true
Createdb mode 1
Write lookup file 0
createdb /sample.fastq.gz /tmp/7059426268546109220/query --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
Only uncompressed fasta files can be used with --createdb-mode 0.
We recompute with --createdb-mode 1.
Time for merging to query_h: 0h 0m 0s 26ms
Time for merging to query: 0h 0m 0s 24ms
[=================================================================================================== 1 Mio. sequences processed
=================================================================================================== 2 Mio. sequences processed
=================================================================================================== 3 Mio. sequences processed
=================================================================================================== 4 Mio. sequences processed
=================================================================================================== 5 Mio. sequences processed
=================================================================================================== 6 Mio. sequences processed
=================================================================================================== 7 Mio. sequences processed
=================================================================================================== 8 Mio. sequences processed
=================================================================================================== 9 Mio. sequences processed
=================================================================================================== 10 Mio. sequences processed
=================================================================================================== 11 Mio. sequences processed
=================================================================================================== 12 Mio. sequences processed
=================================================================================================== 13 Mio. sequences processed
=================================================================================================== 14 Mio. sequences processed
=================================================================================================== 15 Mio. sequences processed
=================================================================================================== 16 Mio. sequences processed
=================================================================================================== 17 Mio. sequences processed
=================================================================================================== 18 Mio. sequences processed
=================================================================================================== 19 Mio. sequences processed
=================================================================================================== 20 Mio. sequences processed
=================================================================================================== 21 Mio. sequences processed
=================================================================================================== 22 Mio. sequences processed
=================================================================================================== 23 Mio. sequences processed
=================================================================================================== 24 Mio. sequences processed
=================================================================================================== 25 Mio. sequences processed
=================================================================================================== 26 Mio. sequences processed
=================================================================================================== 27 Mio. sequences processed
=================================================================================================== 28 Mio. sequences processed
=================================================================================================== 29 Mio. sequences processed
=================================================================================================== 30 Mio. sequences processed
=================================================================================================== 31 Mio. sequences processed
=================================================================================================== 32 Mio. sequences processed
=================================================================================================== 33 Mio. sequences processed
=================================================================================================== 34 Mio. sequences processed
=================================================================================================== 35 Mio. sequences processed
=================================================================================================== 36 Mio. sequences processed
=================================================================================================== 37 Mio. sequences processed
=================================================================================================== 38 Mio. sequences processed
=================================================================================================== 39 Mio. sequences processed
=================================================================================================== 40 Mio. sequences processed
=================================================================================================== 41 Mio. sequences processed
=================================================================================================== 42 Mio. sequences processed
=================================================================================================== 43 Mio. sequences processed
=================================================================================================== 44 Mio. sequences processed
=================================================================================================== 45 Mio. sequences processed
=================================================================================================== 46 Mio. sequences processed
=================================================================================================== 47 Mio. sequences processed
=================================================================================================== 48 Mio. sequences processed
=================================================================================================== 49 Mio. sequences processed
=================================================================================================== 50 Mio. sequences processed
=================================================================================================== 51 Mio. sequences processed
=================================================================================================== 52 Mio. sequences processed
=================================================================================================== 53 Mio. sequences processed
=================================================================================================== 54 Mio. sequences processed
=================================================================================================== 55 Mio. sequences processed
=================================================================================================== 56 Mio. sequences processed
=================================================================================================== 57 Mio. sequences processed
=================================================================================================== 58 Mio. sequences processed
=================================================================================================== 59 Mio. sequences processed
=================================================================================================== 60 Mio. sequences processed
=================================================================================================== 61 Mio. sequences processed
=================================================================================================== 62 Mio. sequences processed
=================================================================================================== 63 Mio. sequences processed
=================================================================================================== 64 Mio. sequences processed
=================================================================================================== 65 Mio. sequences processed
=================================================================================================== 66 Mio. sequences processed
=================================================================================================== 67 Mio. sequences processed
=================================================================================================== 68 Mio. sequences processed
=================================================================================================== 69 Mio. sequences processed
=================================================================================================== 70 Mio. sequences processed
=================================================================================================== 71 Mio. sequences processed
=================================================================================================== 72 Mio. sequences processed
=================================================================================================== 73 Mio. sequences processed
=================================================================================================== 74 Mio. sequences processed
=================================================================================================== 75 Mio. sequences processed
=================================================================================================== 76 Mio. sequences processed
=================================================================================================== 77 Mio. sequences processed
=================================================================================================== 78 Mio. sequences processed
=================================================================================================== 79 Mio. sequences processed
=================================================================================================== 80 Mio. sequences processed
=================================================================================================== 81 Mio. sequences processed
=================================================================================================== 82 Mio. sequences processed
=================================================================================================== 83 Mio. sequences processed
=================================================================================================== 84 Mio. sequences processed
=================================================================================================== 85 Mio. sequences processed
=================================================================================================== 86 Mio. sequences processed
=================================================================================================== 87 Mio. sequences processed
=================================================================================================== 88 Mio. sequences processed
=================================================================================================== 89 Mio. sequences processed
=================================================================================================== 90 Mio. sequences processed
=================================================================================================== 91 Mio. sequences processed
=================================================================================================== 92 Mio. sequences processed
=================================================================================================== 93 Mio. sequences processed
=================================================================================================== 94 Mio. sequences processed
=================================================================================================== 95 Mio. sequences processed
=================================================================================================== 96 Mio. sequences processed
=================================================================================================== 97 Mio. sequences processed
=================================================================================================== 98 Mio. sequences processed
=================================================================================================== 99 Mio. sequences processed
=================================================================================================== 100 Mio. sequences processed
=================================================================================================== 101 Mio. sequences processed
=================================================================================================== 102 Mio. sequences processed
=================================================================================================== 103 Mio. sequences processed
=================================================================================================== 104 Mio. sequences processed
=================================================================================================== 105 Mio. sequences processed
=================================================================================================== 106 Mio. sequences processed
=================================================================================================== 107 Mio. sequences processed
=================================================================================================== 108 Mio. sequences processed
=================================================================================================== 109 Mio. sequences processed
=================================================================================================== 110 Mio. sequences processed
=================================================================================================== 111 Mio. sequences processed
=================================================================================================== 112 Mio. sequences processed
=================================================================================================== 113 Mio. sequences processed
=================================================================================================== 114 Mio. sequences processed
=================================================================================================== 115 Mio. sequences processed
=================================================================================================== 116 Mio. sequences processed
=================================================================================================== 117 Mio. sequences processed
=================================================================================================== 118 Mio. sequences processed
=================================================================================================== 119 Mio. sequences processed
=================================================================================================== 120 Mio. sequences processed
=================================================================================================== 121 Mio. sequences processed
=================================================================================================== 122 Mio. sequences processed
=================================================================================================== 123 Mio. sequences processed
=================================================================================================== 124 Mio. sequences processed
=================================================================================================== 125 Mio. sequences processed
=================================================================================================== 126 Mio. sequences processed
=================================================================================================== 127 Mio. sequences processed
=================================================================================================== 128 Mio. sequences processed
=================================================================================================== 129 Mio. sequences processed
=================================================================================================== 130 Mio. sequences processed
=================================================================================================== 131 Mio. sequences processed
=================================================================================================== 132 Mio. sequences processed
=================================================================================================== 133 Mio. sequences processed
=================================================================================================== 134 Mio. sequences processed
=================================================================================================== 135 Mio. sequences processed
=================================================================================================== 136 Mio. sequences processed
=================================================================================================== 137 Mio. sequences processed
=================================================================================================== 138 Mio. sequences processed
=================================================================================================== 139 Mio. sequences processed
=================================================================================================== 140 Mio. sequences processed
=================================================================================================== 141 Mio. sequences processed
=================================================================================================== 142 Mio. sequences processed
=================================================================================================== 143 Mio. sequences processed
=================================================================================================== 144 Mio. sequences processed
=================================================================================================== 145 Mio. sequences processed
=================================================================================================== 146 Mio. sequences processed
=================================================================================================== 147 Mio. sequences processed
=================================================================================================== 148 Mio. sequences processed
=================================================================================================== 149 Mio. sequences processed
=================================================================================================== 150 Mio. sequences processed
=================================================================================================== 151 Mio. sequences processed
=================================================================================================== 152 Mio. sequences processed
=================================================================================================== 153 Mio. sequences processed
=================================================================================================== 154 Mio. sequences processed
=================================================================================================== 155 Mio. sequences processed
=================================================================================================== 156 Mio. sequences processed
=================================================================================================== 157 Mio. sequences processed
=================================================================================================== 158 Mio. sequences processed
=================================================================================================== 159 Mio. sequences processed
=================================================================================================== 160 Mio. sequences processed
=================================================================================================== 161 Mio. sequences processed
=================================================================================================== 162 Mio. sequences processed
=================================================================================================== 163 Mio. sequences processed
=================================================================================================== 164 Mio. sequences processed
=================================================================================================== 165 Mio. sequences processed
=================================================================================================== 166 Mio. sequences processed
=================================================================================================== 167 Mio. sequences processed
=================================================================================================== 168 Mio. sequences processed
=================================================================================================== 169 Mio. sequences processed
=================================================================================================== 170 Mio. sequences processed
=================================================================================================== 171 Mio. sequences processed
=================================================================================================== 172 Mio. sequences processed
=================================================================================================== 173 Mio. sequences processed
=================================================================================================== 174 Mio. sequences processed
=================================================================================================== 175 Mio. sequences processed
=================================================================================================== 176 Mio. sequences processed
=================================================================================================== 177 Mio. sequences processed
=================================================================================================== 178 Mio. sequences processed
=================================================================================================== 179 Mio. sequences processed
=================================================================================================== 180 Mio. sequences processed
=================================================================================================== 181 Mio. sequences processed
=================================================================================================== 182 Mio. sequences processed
=================================================================================================== 183 Mio. sequences processed
=================================================================================================== 184 Mio. sequences processed
=================================================================================================== 185 Mio. sequences processed
=================================================================================================== 186 Mio. sequences processed
=================================================================================================== 187 Mio. sequences processed
=================================================================================================== 188 Mio. sequences processed
=================================================================================================== 189 Mio. sequences processed
=================================================================================================== 190 Mio. sequences processed
=================================================================================================== 191 Mio. sequences processed
=================================================================================================== 192 Mio. sequences processed
====================================================
Time for merging to query_h: 0h 0m 0s 212ms
Time for merging to query: 0h 0m 0s 27ms
Database type: Nucleotide
Time for processing: 0h 8m 12s 710ms
Create directory /tmp/7059426268546109220/taxonomy_tmp
taxonomy /tmp/7059426268546109220/query /nr /tmp/7059426268546109220/result /tmp/7059426268546109220/taxonomy_tmp --tax-output-mode 2 -s 0.5 --split-mode 0 --remove-tmp-files 1
extractorfs /tmp/7059426268546109220/query /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 26 --compressed 0 -v 3
[=================================================================] 192.52M 22m 33s 393ms
Time for merging to orfs_aa_h: 0h 7m 19s 213ms
Time for merging to orfs_aa: 0h 8m 4s 740ms
Time for processing: 0h 47m 10s 767ms
Create directory /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy
taxonomy /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/orfs_aa /nr /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/orfs_tax /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy --tax-output-mode 2 --tax-lineage 0 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 0.5 --split-mode 0 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --remove-tmp-files 1
Create directory /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy/8588819485854123580/tmp_hsp1
search /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/orfs_aa /nr /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy/8588819485854123580/first /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy/8588819485854123580/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 0.5 --split-mode 0 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1 --remove-tmp-files 1
prefilter /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/orfs_aa /nr /tmp/7059426268546109220/taxonomy_tmp/13812531703396435525/tmp_taxonomy/8588819485854123580/tmp_hsp1/1723886274502240713/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 0 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 26 --compressed 0 -v 3 -s 0.5
Query database size: 695256546 type: Aminoacid
Target split mode. Searching through 6 splits
Estimated memory consumption: 232G
Target database size: 353572663 type: Aminoacid
Process prefiltering step 1 of 6
Index table k-mer threshold: 180 at k-mer size 7
Index table: counting k-mers
[=================================================================] 58.92M 1h 27m 43s 365ms
Index table: Masked residues: 338212106
Index table: fill
[=================================================================] 58.92M 2h 48m 44s 23ms
Index statistics
Entries: 10047647313
DB size: 67258 MB
Avg k-mer size: 7.849724
Top 10 k-mers
FSHAGSI 169128
AFRNNFW 161115
APMFPNN 145858
GGGWLLM 137963
NNSWLPS 137460
AHFMIMV 126820
MPMGGNW 126274
TMLDRNT 108816
TGTYPSS 94201
GDQYNVT 84229
Time for index table init: 4h 18m 41s 415ms
k-mer similarity threshold: 180
Starting prefiltering scores calculation (step 1 of 6)
Query db start 1 to 695256546
Target db start 1 to 58919300
[=================================================================] 695.26M 61h 14m 42s 623ms
2.307739 k-mers per position
1254 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
11 sequences passed prefiltering per query sequence
1 median result list length
275899073 sequences with 0 size result lists
Time for merging to pref_0_tmp_0: 0h 16m 3s 814ms
Time for merging to pref_0_tmp_0_tmp: 0h 26m 19s 322ms
Process prefiltering step 2 of 6
Index table k-mer threshold: 180 at k-mer size 7
Index table: counting k-mers
[=================================================================] 58.92M 1h 18m 46s 598ms
Index table: Masked residues: 338371908
Index table: fill
[===========================Terminated
It seems like you accidentally defeated a speed-up mechanism by setting -s 0.5
.
By setting -s
<= --orf-filter-s
it deactivates this optimization.
In this mode, we first do a very low sensitivity search to see if an extracted ORF can find anything at all in the target database, thus we can reject a lot of fragments that won't be able to contribute at all later.
You can try setting --orf-filter-s 1
instead and leave the default sensitivity.
Thank you for the quick response, I will test this soon and post the results when it is done
Hello,
First of all thank you for making such an amazing program, secondly I was wondering if you could provide some advice on how to handle a very large query database. I have several terabytes that I would like to check against the nr. Currently I am using the easy-taxonomy workflow, I have loaded about 1/15th of my data as a proof of concept. However as you will see in the log below this will take quite some time. My main questions are: is this expected behaviour and how am I able to speed this up?
Current Behavior
LSBATCH: User input mmseqs easy-taxonomy ./sample.fastq.gz ./nr ./result ./tmp -s 0.5
MMseqs Version: 13.45111 ORF filter 0 ORF filter e-value 100 ORF filter sensitivity 2 LCA mode 3 Majority threshold 0.5 Vote mode 1 LCA ranks
Column with taxonomic lineage 0 Compressed 0 Threads 26 Verbosity 3 Taxon blacklist 12908:unclassified sequences,28384:other sequences Substitution matrix nucl:nucleotide.out,aa:blosum62.out Add backtrace false Alignment mode 0 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 0.5 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max results per query 300 Split database 0 Split mode 0 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.001 Global sequence weighting false Allow deletions false Filter MSA 1 Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files true Report mode 0 Alignment format 0 Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits Database output false First sequence as representative false Target column 1 Add full header false Sequence source 0 Database type 0 Shuffle input database true Createdb mode 1 Write lookup file 0
skipped some parts of the log that took very little time
Query database size: 695256546 type: Aminoacid Target split mode. Searching through 6 splits Estimated memory consumption: 232G Target database size: 353572663 type: Aminoacid Process prefiltering step 1 of 6
Index table k-mer threshold: 180 at k-mer size 7 Index table: counting k-mers [=================================================================] 58.92M 1h 27m 43s 365ms Index table: Masked residues: 338212106 Index table: fill [=================================================================] 58.92M 2h 48m 44s 23ms Index statistics Entries: 10047647313 DB size: 67258 MB Avg k-mer size: 7.849724 Top 10 k-mers FSHAGSI 169128 AFRNNFW 161115 APMFPNN 145858 GGGWLLM 137963 NNSWLPS 137460 AHFMIMV 126820 MPMGGNW 126274 TMLDRNT 108816 TGTYPSS 94201 GDQYNVT 84229 Time for index table init: 4h 18m 41s 415ms k-mer similarity threshold: 180 Starting prefiltering scores calculation (step 1 of 6) Query db start 1 to 695256546 Target db start 1 to 58919300 [=================================================================] 695.26M 61h 14m 42s 623ms
2.307739 k-mers per position 1254 DB matches per sequence 0 overflows 0 queries produce too many hits (truncated result) 11 sequences passed prefiltering per query sequence 1 median result list length 275899073 sequences with 0 size result lists Time for merging to pref_0_tmp_0: 0h 16m 3s 814ms Time for merging to pref_0_tmp_0_tmp: 0h 26m 19s 322ms Process prefiltering step 2 of 6
Index table k-mer threshold: 180 at k-mer size 7 Index table: counting k-mers [=================================================================] 58.92M 1h 18m 46s 598ms Index table: Masked residues: 338371908 Index table: fill [===========================Terminated