mmseqs createindex --split not generating correct number of splits

nick-youngblut commented 3 years ago

Expected Behavior

I expect --split 16 for mmseqs createindex to generate 16 *.idx files. Instead, I'm getting 18:

mmseqs_tax_target/mmseqs_tax.db.idx.0
mmseqs_tax_target/mmseqs_tax.db.idx.1
mmseqs_tax_target/mmseqs_tax.db.idx.10
mmseqs_tax_target/mmseqs_tax.db.idx.11
mmseqs_tax_target/mmseqs_tax.db.idx.12
mmseqs_tax_target/mmseqs_tax.db.idx.13
mmseqs_tax_target/mmseqs_tax.db.idx.14
mmseqs_tax_target/mmseqs_tax.db.idx.15
mmseqs_tax_target/mmseqs_tax.db.idx.16
mmseqs_tax_target/mmseqs_tax.db.idx.17
mmseqs_tax_target/mmseqs_tax.db.idx.2
mmseqs_tax_target/mmseqs_tax.db.idx.3
mmseqs_tax_target/mmseqs_tax.db.idx.4
mmseqs_tax_target/mmseqs_tax.db.idx.5
mmseqs_tax_target/mmseqs_tax.db.idx.6
mmseqs_tax_target/mmseqs_tax.db.idx.7
mmseqs_tax_target/mmseqs_tax.db.idx.8
mmseqs_tax_target/mmseqs_tax.db.idx.9

Pipeline software (eg., snakemake) generally requires keeping track of all (important) output files produced; otherwise, untracked output files can accidentally be deleted, which is is causing some downstream problems (eg., seg-fault errors for mmseqs taxonomy).

Steps to Reproduce (for bugs)

mmseqs createindex --threads 8   --split 16 mmseqs_tax.db mmseqs_tax_target/tmp/

Your Environment

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2021.1.19            h06a4308_1
gawk                      5.1.0                h516909a_0    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libidn2                   2.3.0                h516909a_0    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libunistring              0.9.10               h14c3975_0    conda-forge
mmseqs2                   13.45111             h95f258a_1    bioconda
openssl                   1.1.1k               h7f98852_0    conda-forge
pigz                      2.6                  h27826a3_0    conda-forge
seqkit                    0.15.0                        0    bioconda
seqtk                     1.3                  h5bf99c6_3    bioconda
wget                      1.20.1               h22169c7_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge

OS: Ubuntu 18.04.5

nick-youngblut commented 3 years ago

The seg-fault errors that I'm getting with mmseqs taxonomy don't appear to be due to the 2 extra split files. Even when tracking all *.idx files so they don't accidentally get deleted, I get the following error:

taxonomy -e 1e-5 --max-seqs 200 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6 --lca-ranks superkingdom,kingdom,phylum,class,order,family,genus,species --threads 8 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/seqs_tax_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP

MMseqs Version:                         13.45111
ORF filter                              1
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Taxonomy output mode                    0
Majority threshold                      0.5
Vote mode                               1
LCA ranks                               superkingdom,kingdom,phylum,class,order,family,genus,species
Column with taxonomic lineage           0
Compressed                              0
Threads                                 8
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          1
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       1e-05
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              5
Max accept                              30
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             6
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   200
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       2
Start sensitivity                       1
Search steps                            3
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

Create directory /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1
search /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/first /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1 --alignment-mode 1 -e 1e-05 --max-rejected 5 --max-accept 30 --threads 8 -s 6 --max-seqs 200 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --num-iterations 2 --start-sens 1 --sens-steps 3 --lca-search 1

prefilter /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 6 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 8 --compressed 0 -v 3

Index version: 16
Generated by:  13.45111
ScoreMatrix:  VTML80.out
Query database size: 1075 type: Aminoacid
Target split mode. Searching through 16 splits
Estimated memory consumption: 8G
Target database size: 41195879 type: Aminoacid
Process prefiltering step 1 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 1 of 16)
Query db start 1 to 1075
Target db start 1 to 2572505
[=================================================================] 1.08K 2s 989ms

390.206187 k-mers per position
423278 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_0: 0h 0m 0s 8ms
Time for merging to pref_0_tmp_0_tmp: 0h 0m 0s 10ms
Process prefiltering step 2 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 2 of 16)
Query db start 1 to 1075
Target db start 2572506 to 5147039
[=================================================================] 1.08K 3s 152ms

390.206187 k-mers per position
423330 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_1: 0h 0m 0s 8ms
Time for merging to pref_0_tmp_1_tmp: 0h 0m 0s 36ms
Process prefiltering step 3 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 3 of 16)
Query db start 1 to 1075
Target db start 5147040 to 7717242
[=================================================================] 1.08K 2s 825ms

390.206187 k-mers per position
423389 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_2: 0h 0m 0s 43ms
Time for merging to pref_0_tmp_2_tmp: 0h 0m 0s 57ms
Process prefiltering step 4 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 4 of 16)
Query db start 1 to 1075
Target db start 7717243 to 10294414
[=================================================================] 1.08K 3s 10ms

390.206187 k-mers per position
423306 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_3: 0h 0m 0s 23ms
Time for merging to pref_0_tmp_3_tmp: 0h 0m 0s 55ms
Process prefiltering step 5 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 5 of 16)
Query db start 1 to 1075
Target db start 10294415 to 12871105
[=================================================================] 1.08K 2s 902ms

390.206187 k-mers per position
423264 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_4: 0h 0m 0s 8ms
Time for merging to pref_0_tmp_4_tmp: 0h 0m 0s 11ms
Process prefiltering step 6 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 6 of 16)
Query db start 1 to 1075
Target db start 12871106 to 15442705
[=================================================================] 1.08K 2s 907ms

390.206187 k-mers per position
423514 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_5: 0h 0m 0s 9ms
Time for merging to pref_0_tmp_5_tmp: 0h 0m 0s 9ms
Process prefiltering step 7 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 7 of 16)
Query db start 1 to 1075
Target db start 15442706 to 18017124
[=================================================================] 1.08K 2s 795ms

390.206187 k-mers per position
423292 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_6: 0h 0m 0s 7ms
Time for merging to pref_0_tmp_6_tmp: 0h 0m 0s 9ms
Process prefiltering step 8 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 8 of 16)
Query db start 1 to 1075
Target db start 18017125 to 20593148
[=================================================================] 1.08K 2s 843ms

390.206187 k-mers per position
423223 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_7: 0h 0m 0s 9ms
Time for merging to pref_0_tmp_7_tmp: 0h 0m 0s 10ms
Process prefiltering step 9 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 9 of 16)
Query db start 1 to 1075
Target db start 20593149 to 23168610
[=================================================================] 1.08K 3s 92ms

390.206187 k-mers per position
423365 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_8: 0h 0m 0s 7ms
Time for merging to pref_0_tmp_8_tmp: 0h 0m 0s 11ms
Process prefiltering step 10 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 10 of 16)
Query db start 1 to 1075
Target db start 23168611 to 25746437
[=================================================================] 1.08K 2s 946ms

390.206187 k-mers per position
423353 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_9: 0h 0m 0s 11ms
Time for merging to pref_0_tmp_9_tmp: 0h 0m 0s 15ms
Process prefiltering step 11 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 11 of 16)
Query db start 1 to 1075
Target db start 25746438 to 28318851
[=================================================================] 1.08K 2s 418ms

390.206187 k-mers per position
423304 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_10: 0h 0m 0s 8ms
Time for merging to pref_0_tmp_10_tmp: 0h 0m 0s 14ms
Process prefiltering step 12 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 12 of 16)
Query db start 1 to 1075
Target db start 28318852 to 30895702
[=================================================================] 1.08K 3s 701ms

390.206187 k-mers per position
423306 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_11: 0h 0m 0s 61ms
Time for merging to pref_0_tmp_11_tmp: 0h 0m 0s 71ms
Process prefiltering step 13 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 13 of 16)
Query db start 1 to 1075
Target db start 30895703 to 33469145
[=================================================================] 1.08K 3s 180ms

390.206187 k-mers per position
423354 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_12: 0h 0m 0s 10ms
Time for merging to pref_0_tmp_12_tmp: 0h 0m 0s 14ms
Process prefiltering step 14 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 14 of 16)
Query db start 1 to 1075
Target db start 33469146 to 36042326
[=================================================================] 1.08K 3s 458ms

390.206187 k-mers per position
423372 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
2 sequences with 0 size result lists
Time for merging to pref_0_tmp_13: 0h 0m 0s 34ms
Time for merging to pref_0_tmp_13_tmp: 0h 0m 0s 44ms
Process prefiltering step 15 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 15 of 16)
Query db start 1 to 1075
Target db start 36042327 to 38619947
[=================================================================] 1.08K 3s 98ms

390.206187 k-mers per position
423325 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
1 sequences with 0 size result lists
Time for merging to pref_0_tmp_14: 0h 0m 0s 29ms
Time for merging to pref_0_tmp_14_tmp: 0h 0m 0s 31ms
Process prefiltering step 16 of 16

k-mer similarity threshold: 109
Starting prefiltering scores calculation (step 16 of 16)
Query db start 1 to 1075
Target db start 38619948 to 41195879
[=================================================================] 1.08K 2s 904ms

390.206187 k-mers per position
423266 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
25 sequences passed prefiltering per query sequence
26 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0_tmp_15: 0h 0m 0s 24ms
Time for merging to pref_0_tmp_15_tmp: 0h 0m 0s 20ms
Merging 16 target splits to pref_0
Preparing offsets for merging: 0h 0m 0s 53ms
[=================================================================] 1.08K 0s 37ms
Time for merging to pref_0: 0h 0m 0s 23ms
Time for merging target splits: 0h 0m 0s 174ms
Time for merging to pref_0_tmp: 0h 0m 0s 45ms
Time for processing: 0h 6m 46s 299ms
lcaalign /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/pref_0 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/aln_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 1 --alignment-output-mode 0 --wrapped-scoring 0 -e 1e-05 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 5 --max-accept 30 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3

Index version: 16
Generated by:  13.45111
ScoreMatrix:  VTML80.out
Compute score and coverage
Query database size: 1075 type: Aminoacid
Target database size: 41195879 type: Aminoacid
[=================================================================] 1.08K 0s 508ms
Time for merging to aln_0: 0h 0m 0s 8ms
19048 alignments calculated
15817 sequence pairs passed the thresholds (0.830376 of overall calculated)
14.713489 hits per query sequence
Time for processing: 0h 0m 54s 194ms
result2profile /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/aln_0 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/profile_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 1e-05 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --pca 0 --pcb 1.5 --db-load-mode 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --threads 8 --compressed 0 -v 3

Index version: 16
Generated by:  13.45111
ScoreMatrix:  VTML80.out
Query database size: 1075 type: Aminoacid
Target database size: 41195879 type: Aminoacid
[========================================Segmentation fault
Error: Create profile died
Error: First search died

Note that sometimes when I re-run the command, I instead get the error:

Index version: 16
Generated by:  13.45111
ScoreMatrix:  VTML80.out
Query database size: 1075 type: Aminoacid
Target database size: 41195879 type: Aminoacid
[=======================================================]
free(): invalid next size (normal)
Aborted
Error: Create profile died
Error: First search died

System memory should not be the cause; I've got ~800 Gb free.

Maybe I'm missing a "hidden" input file (ie., one of the files associated with the main input files, which are generally no mentioned in any of the docs). The input files that are present:

/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db.dbtype
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db.index
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db_h
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db_h.dbtype
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db_h.index
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.dbtype
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.index
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_h
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_h.dbtype
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_h.index
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_mapping
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_delnodes.dmp
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_merged.dmp
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_names.dmp
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db_nodes.dmp
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.0
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.1
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.2
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.3
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.4
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.5
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.6
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.7
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.8
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.9
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.10
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.11
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.12
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.13
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.14
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.15
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.16
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.17
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.dbtype
/ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx.index

If I had to guess, there's probably something wrong with the *.idx files.

milot-mirdita commented 3 years ago

I introduced the two additional splits because of https://github.com/soedinglab/MMseqs2/issues/338. Though that wasn't very effective to reduce peak memory use.

The error looks like a memory corruption though. I am not really sure how to reproduce the issue locally. Do you still have the tmp files? Could you try rerunning only the last step without the index:

mmseqs result2profile /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/aln_0 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/profile_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 1e-05 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --pca 0 --pcb 1.5 --db-load-mode 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --threads 8 --compressed 0 -v 3

The only change was to remove the .idx suffix after mmseqs_tax.db.

milot-mirdita commented 3 years ago

The next step would be to try a MMseqs2 build instrumented with ASan. Sadly ASan doesn't support static builds so you would have to compile MMseqs2 yourself:

git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2;
mkdir build
cd build
cmake -DHAVE_SANITIZER=1 -DCMAKE_BUILD_TYPE=ASan ..
make -j $(nproc --all)

The new binary in src/mmseqs would then hopefully be able to tell what is going wrong:

Path-To-Where-You-Git-Clone/MMseqs2/build/src/mmseqs result2profile /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/06/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db.idx /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/aln_0 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/06/TMP/14652724320229658153/tmp_hsp1/11598483508011826746/profile_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 1e-05 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --pca 0 --pcb 1.5 --db-load-mode 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --threads 8 --compressed 0 -v 3

nick-youngblut commented 3 years ago

Removing the *.idx suffix for mmseqs result2profile did not fix the issue. I'll try the ASan build next.

nick-youngblut commented 3 years ago

Here's the output from the ASan run:

 ./build/src/mmseqs result2profile \
>   /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/09/seqs_db \
>   /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db \
>   /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/09/TMP/1355100225373504351/tmp_hsp1/9650299475897910544/aln_0 \
>   /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/09/TMP/1355100225373504351/tmp_hsp1/9650299475897910544/profile_0 \
>   --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 1e-05 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --wg 0 --allow-deletion 0 \
>   --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --pca 0 --pcb 1.5 --db-load-mode 0 --gap-open nucl:5,aa:11 \
>   --gap-extend nucl:2,aa:1 --threads 8 --compressed 0 -v 3
result2profile /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_query/09/seqs_db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax_target/mmseqs_tax.db /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/09/TMP/1355100225373504351/tmp_hsp1/9650299475897910544/aln_0 /ebio/abt3_scratch/nyoungblut/LLCDS_126702996474/mmseqs_tax/09/TMP/1355100225373504351/tmp_hsp1/9650299475897910544/profile_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -e 1e-05 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --pca 0 --pcb 1.5 --db-load-mode 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --threads 8 --compressed 0 -v 3

MMseqs Version:             a6cab565c98376623e82c3a04d186219d4c2f10c
Substitution matrix         nucl:nucleotide.out,aa:blosum62.out
E-value threshold           1e-05
Mask profile                1
Profile E-value threshold   1e-05
Compositional bias          1
Global sequence weighting   false
Allow deletions             false
Filter MSA                  1
Maximum seq. id. threshold  0.9
Minimum seq. id.            0
Minimum score per column    -20
Minimum coverage            0
Select N most diverse seqs  1000
Pseudo count a              0
Pseudo count b              1.5
Preload mode                0
Gap open cost               nucl:5,aa:11
Gap extension cost          nucl:2,aa:1
Threads                     8
Compressed                  0
Verbosity                   3

Query database size: 1151 type: Aminoacid
Target database size: 41195879 type: Aminoacid
================================================================= ] 46.43% 535 eta 0s
==71239==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61a0000233e0 at pc 0x55c61d242cd7 bp 0x7fc0f27db1b0 sp 0x7fc0f27db1a0
WRITE of size 1 at 0x61a0000233e0 thread T3
==71239==AddressSanitizer: while reporting a bug found another one. Ignoring.08K eta 0s
    #0 0x55c61d242cd6 in MultipleAlignment::updateGapsInSequenceSet(char**, unsigned long, std::vector<std::vector<unsigned char, std::allocator<unsigned char> >, std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > > > const&, std::vector<Matcher::result_t, std::allocator<Matcher::result_t> > const&, unsigned int*, bool) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/alignment/MultipleAlignment.cpp:168
    #1 0x55c61d2432cc in MultipleAlignment::computeMSA(Sequence*, std::vector<std::vector<unsigned char, std::allocator<unsigned char> >, std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > > > const&, std::vector<Matcher::result_t, std::allocator<Matcher::result_t> > const&, bool) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/alignment/MultipleAlignment.cpp:208
    #2 0x55c61d180e7b in result2profile(int, char const**, Command const&, bool) [clone ._omp_fn.0] /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/util/result2profile.cpp:203
    #3 0x7fc0f70d796d  (/usr/lib/x86_64-linux-gnu/libgomp.so.1+0x1696d)
    #4 0x7fc0f6c916da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #5 0x7fc0f69ba71e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12171e)

0x61a0000233e0 is located 0 bytes to the right of 1376-byte region [0x61a000022e80,0x61a0000233e0)
allocated by thread T3 here:
    #0 0x7fc0f812b790 in posix_memalign (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xdf790)
    #1 0x55c61cd2e5c3 in mem_align(unsigned long, unsigned long) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/lib/simd/simd.h:463
    #2 0x55c61cee071f in malloc_simd_int(unsigned long) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/lib/simd/simd.h:483
    #3 0x55c61d2410c9 in MultipleAlignment::initX(int) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/alignment/MultipleAlignment.cpp:19
    #4 0x55c61d243175 in MultipleAlignment::computeMSA(Sequence*, std::vector<std::vector<unsigned char, std::allocator<unsigned char> >, std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > > > const&, std::vector<Matcher::result_t, std::allocator<Matcher::result_t> > const&, bool) /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/alignment/MultipleAlignment.cpp:198
    #5 0x55c61d180e7b in result2profile(int, char const**, Command const&, bool) [clone ._omp_fn.0] /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/util/result2profile.cpp:203
    #6 0x7fc0f70d796d  (/usr/lib/x86_64-linux-gnu/libgomp.so.1+0x1696d)

Thread T3 created by T0 here:
    #0 0x7fc0f8083d2f in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d2f)
    #1 0x7fc0f70d7f5f  (/usr/lib/x86_64-linux-gnu/libgomp.so.1+0x16f5f)
    #2 0x7fc0f70ceed9 in GOMP_parallel (/usr/lib/x86_64-linux-gnu/libgomp.so.1+0xded9)
    #3 0x7ffc996a2d2f  (<unknown module>)

SUMMARY: AddressSanitizer: heap-buffer-overflow /ebio/abt3_projects/software/dev/ll_pipelines/llcds/tmp/mmseqs_taxonomy/MMseqs2/src/alignment/MultipleAlignment.cpp:168 in MultipleAlignment::updateGapsInSequenceSet(char**, unsigned long, std::vector<std::vector<unsigned char, std::allocator<unsigned char> >, std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > > > const&, std::vector<Matcher::result_t, std::allocator<Matcher::result_t> > const&, unsigned int*, bool)
Shadow bytes around the buggy address:
  0x0c347fffc620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc650: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c347fffc670: 00 00 00 00 00 00 00 00 00 00 00 00[fa]fa fa fa
  0x0c347fffc680: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c347fffc690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc6b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c347fffc6c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==71239==ABORTING

milot-mirdita commented 3 years ago

Thanks, I suspected that this might have been the problem. I'll update you once we figure out how to fix this.

milot-mirdita commented 3 years ago

Ah sorry, that makes a lot of sense that this doesn't work. Iterative-profile searches won't work currently together with the taxonomy workflow, since the alignment positions computed in the taxonomy workflow don't refer to the same things that the iterative-profile-search workflow expects. I am not this type of search makes sense. Could you explain your use case for combining these two?

I am not sure if it's fixable with the current protocol, we might just disallow taxonomy in combination with iterative-profile searches instead.

nick-youngblut commented 3 years ago

Thanks for looking more into the issue.

I carried over the iterative search parameters from some other mmseqs search jobs. If iterative search parameters don't make sense for mmseqs taxonomy, then it would be good to remove that from the script docs.

soedinglab / MMseqs2