Segmentation fault mmseqs2 taxonomy

Expected Behavior

I would like to query a transcriptome against NT db and retrieve taxonomy. I generated the NT db according to your docs (with compression enabled). Then I convert my transcriptome to a mmseqs2 db and try to query via:

mmseqs taxonomy --search-type 3 Transcripts_mmseqs2 nt.fnaDB MyTaxonomyResult tmp

But I get a segfault...

UPDATE: I also get a segfault when executing search or taxonomy against a pre-compiled database downloaded via databases. UPDATE 2: Also happens with the latest Docker image. UPDATE 3: Tried a very small toy fasta. Also segfaults.

Current Behavior

Execution of mmseqs taxonomy fails with segfault.

It tried several versions of mmseqs2 binary:

Last provided AVX2
Last provided SSE4
Self-compiled AVX2
Older version (Release 12-113e3 - AVX2)

-> All fail

Steps to Reproduce (for bugs)

Create DB for query: mmseqs createdb ../transcripts.fasta Transcripts_mmseqs2 Get taxonomy: mmseqs taxonomy --search-type 3 Transcripts_mmseqs2 nt.fnaDB MyTaxonomyResult tmp

These are the files I generated from NT as the target database (does anything look off?):

-rw-rw-r-- 1 user user 129522020819 Apr 14 17:03 nt.fnaDB
-rw-rw-r-- 1 user user 4 Apr 14 17:03 nt.fnaDB.dbtype
-rw-rw-r-- 1 user user 1766255879 Apr 14 17:03 nt.fnaDB.index
-rw-rw-r-- 1 user user 1657557037 Apr 14 17:05 nt.fnaDB.lookup
-rw-rw-r-- 1 user user 9 Apr 14 16:58 nt.fnaDB.source
-rw-rw-r-- 1 user user 7644438631 Apr 14 16:58 nt.fnaDB_h
-rw-rw-r-- 1 user user 4 Apr 14 16:58 nt.fnaDB_h.dbtype
-rw-rw-r-- 1 user user 1609915648 Apr 14 17:03 nt.fnaDB_h.index
-rw-rw-r-- 1 user user 1043159832 Apr 14 17:21 nt.fnaDB_mapping
-rw-rw-r-- 1 user user 640718438 Apr 14 17:17 nt.fnaDB_taxonomy

MMseqs Output (some paths & filenames redacted)

Create directory tmp
taxonomy --search-type 3 Transcripts_mmseqs2 nt.fnaDB MyTaxonomyResult tmp 

MMseqs Version:                         19064f27c8d86fcdcd3daad60f6db70f6360f30b
ORF filter                              1
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Taxonomy output mode                    0
Majority threshold                      0.5
Vote mode                               1
LCA ranks                               
Column with taxonomic lineage           0
Compressed                              0
Threads                                 64
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          1
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       1
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              5
Max accept                              30
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             2
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern                    
Local temporary path                    
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             3
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                              
Force restart with latest tmp           false
Remove temporary files                  false

Accel. 2bLCA cannot be used with nucl-nucl taxonomy, using top-hit instead
Create directory tmp/11485603906739492364/tmp_hsp1
search Transcripts_mmseqs2 nt.fnaDB tmp/11485603906739492364/first tmp/11485603906739492364/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --search-type 3 

splitsequence nt.fnaDB tmp/11485603906739492364/tmp_hsp1/7610357885614778610/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 64 --compressed 0 -v 3 

[=================================================================] 100.00% 69.18M 4s 219ms     
Time for merging to target_seqs_split_h: 0h 0m 22s 610ms
Time for merging to target_seqs_split: 0h 0m 24s 544ms
Time for processing: 0h 1m 21s 366ms
extractframes Transcripts_mmseqs2 tmp/11485603906739492364/tmp_hsp1/7610357885614778610/query_seqs --forward-frames 1 --reverse-frames 1 --create-lookup 0 --threads 64 --compressed 0 -v 3 

[=================================================================] 100.00% 1.32M 0s 321ms      
Time for merging to query_seqs_h: 0h 0m 0s 490ms
Time for merging to query_seqs: 0h 0m 2s 39ms
Time for processing: 0h 0m 3s 738ms
splitsequence tmp/11485603906739492364/tmp_hsp1/7610357885614778610/query_seqs tmp/11485603906739492364/tmp_hsp1/7610357885614778610/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 64 --compressed 0 -v 3 

[=================================================================] 100.00% 2.64M 0s 246ms      
Time for merging to query_seqs_split_h: 0h 0m 0s 507ms
Time for merging to query_seqs_split: 0h 0m 0s 573ms
Time for processing: 0h 0m 2s 178ms
prefilter tmp/11485603906739492364/tmp_hsp1/7610357885614778610/query_seqs_split tmp/11485603906739492364/tmp_hsp1/7610357885614778610/target_seqs_split tmp/11485603906739492364/tmp_hsp1/7610357885614778610/search/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 15 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 64 --compressed 0 -v 3 -s 2.0 

Query database size: 2644526 type: Nucleotide
Target split mode. Searching through 13 splits
Estimated memory consumption: 247G
Target database size: 99637107 type: Nucleotide
Process prefiltering step 1 of 13

Index table k-mer threshold: 0 at k-mer size 15 
Index table: counting k-mers
Segmentation fault                                                ] 0.00% 1 eta -
Error: Prefilter died
Error: Search step died
Error: First search died

Environment

AMD EPYC 7502P 32-Core Processor
320GB memory
OS: Ubuntu 20.04.2 LTS Kernel: 5.4.0-70-generic

I'm getting a similar segemntation fault with a tblastn-style search against a taxonomy-annotated target database derived from BLAST NT. Interestingly, it looks like the prefilter step calculates the memory consumption at 60T but jumps right into prefiltering instead of splitting the database to handle the ~620G memory limit. I also used the --compressed flag, but will check to see if removing that flag fixes the problem for me too.

@milot-mirdita It may be worth re-opening this issue.

search query_db/db target_db/db result_db/db /fsx/scratch/mmseqs/mmseqs-nf/d3d8e6be-a51b-4707-b105-d650f029c7be/MMSEQS/BLAST_DB_SEARCH/mmseqs_search -s 6 -a --num-iterations 1 --use-all-table-starts 1 --compressed 1 --split-memory-limit 618475290624 --threads 96 

MMseqs Version:                         45111b641859ed0ddd875b94d6fd1aef1a675b7e
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           true
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Threads                                 96
Compressed                              1
Verbosity                               3
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             6
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      589824T
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern                   
Local temporary path                   
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.1
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    true
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                             
Force restart with latest tmp           false
Remove temporary files                  false

prefilter query_db/db /fsx/scratch/mmseqs/mmseqs-nf/d3d8e6be-a51b-4707-b105-d650f029c7be/MMSEQS/BLAST_DB_SEARCH/mmseqs_search/340477856621524793/t_orfs_aa /fsx/scratch/mmseqs/mmseqs-nf/d3d8e6be-a51b-4707-b105-d650f029c7be/MMSEQS/BLAST_DB_SEARCH/mmseqs_search/340477856621524793/search/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 589824T -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 96 --compressed 1 -v 3 -s 6.0 

Query database size: 727664 type: Aminoacid
Estimated memory consumption: 60T
Target database size: 13319670203 type: Aminoacid
Index table k-mer threshold: 118 at k-mer size 7 
Index table: counting k-mers
Error: Prefilter died
Error: Search step died

soedinglab / MMseqs2