soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

mmseq taxonomy with Uniref100 #278

Open MrOlm opened 4 years ago

MrOlm commented 4 years ago

Hello,

I downloaded and set up the Uniref100 database for the taxonomy pipeline according to the instructions, but when I run the taxonomy command, the output says Computed index is too large. Avoid using the index. It also says split was set to 5 but at least 8 are required. Please run with default paramerters even though I never adjusted the default. Are these things I should be worried about, and could I be doing something different to make this search more efficient? I know it is a huge database. Full traceback of commands below.

Thank you in advance, -Matt

Commands to set up the taxonomy database

mmseqs databases UniRef100 uniref100.mmseqs tmp

mmseqs createtaxdb uniref100.mmseqs tmp --threads 8
createtaxdb uniref100.mmseqs tmp --threads 8

mmseqs createindex uniref100.mmseqs tmp --threads 8
createindex uniref100.mmseqs tmp --threads 8

Search commands

mmseqs createdb N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.faa N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db

mmseqs taxonomy ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs N4_005_008G1_Pseudomonas_aeruginosa_66_42
5.proteins.taxonomy temp --threads 8

Full traceback of search command

mmseqs taxonomy ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.taxonomy temp --threads 8
Tmp temp folder does not exist or is not a directory.
Create dir temp
taxonomy ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.taxonomy temp --threads 8

MMseqs Version:                         ca58693979f95537016a0454affcfd529dbde24d
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          2
Allow wrapped scoring                   false
E-value threshold                       1
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Realign hits                            false
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Gap open cost                           11
Gap extension cost                      1
zdrop                                   40
Threads                                 8
Compressed                              0
Verbosity                               3
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile e-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Omit consensus                          false
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false
LCA ranks
Taxon blacklist                         12908,28384
Show taxonomic lineage                  false
LCA mode                                4
Taxonomy output mode                    0

search ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs temp/9118733262229857306/first temp/9118733262229857306/tmp_hsp1 --alignment-mode 2 -e 1 --threads 8 -s 5.7 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1

search ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs temp/9118733262229857306/first temp/9118733262229857306/tmp_hsp1 --alignment-mode 2 -e 1 --threads 8 -s 5.7 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1

prefilter ../N4_005_008G1_Pseudomonas_aeruginosa_66_425.proteins.db /LAB_DATA/DATABASES/UniRef100/uniref100.mmseqs.idx temp/9118733262229857306/tmp_hsp1/5064659849361391999/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --s
plit 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 8 --compressed 0 -v 3 -s 5.7

Index version: 16
Generated by:  98c37f3c32b222632ada6011504380e91351276b
ScoreMatrix:  VTML80.out
Query database size: 6282 type: Aminoacid
split was set to 5 but at least 8 are required. Please run with default paramerters
Target split mode. Searching through 5 splits
Estimated memory consumption: 138G
Process needs more than 113G main memory.
Increase the size of --split or set it to 0 to automatically optimize target database split.
Computed index is too large. Avoid using the index.
Target database size: 213522593 type: Aminoacid
Process prefiltering step 1 of 5

k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 1 of 5)
Query db start 1 to 6282
Target db start 1 to 42795128
milot-mirdita commented 4 years ago

Can you run createindex with manually passing it --split 8 to recreate the index with more subsets? Right now it probably has a very small RAM safety margin and could crash with a larger query sequence set. The warnings seem to be quite weird and buggy, we have to take a look at that.

By the way, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.