soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 194 forks source link

`mmseqs expandaln` error: Invalid database read | getData: local id >= db size #695

Open bchodkowski-vir opened 1 year ago

bchodkowski-vir commented 1 year ago

Expected Behavior

mmseqs expandaln to complete successfully.

Current Behavior

mmseqs expandaln throws this error:

Invalid database read for database data file=/home/user/project/target_DB/target_DB.idx, database index=/home/user/project/target_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)

Steps to Reproduce (for bugs)

All these commands are executed when i run colabfold_search and fails on expandaln.

createdb result_20230419_115721/query.fas result_20230419_115721/qdb --shuffle 0

search result_20230419_115721/qdb /home/user/project/target_DB/target_DB result_20230419_115721/res result_20230419_115721/tmp --threads 96 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000

prefilter result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

align result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_0 result_20230419_115721/tmp/16464230693756166324/aln_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

result2profile result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/profile_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3

prefilter result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

subtractdbs result_20230419_115721/tmp/16464230693756166324/pref_tmp_1 result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

rmdb result_20230419_115721/tmp/16464230693756166324/pref_tmp_1

align result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_1 result_20230419_115721/tmp/16464230693756166324/aln_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

mergedbs result_20230419_115721/tmp/16464230693756166324/profile_0 result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/aln_0 result_20230419_115721/tmp/16464230693756166324/aln_tmp_1

rmdb result_20230419_115721/tmp/16464230693756166324/aln_0
rmdb result_20230419_115721/tmp/16464230693756166324/aln_tmp_1

result2profile result_20230419_115721/tmp/16464230693756166324/profile_0 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/profile_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3

prefilter result_20230419_115721/tmp/16464230693756166324/profile_1 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

subtractdbs result_20230419_115721/tmp/16464230693756166324/pref_tmp_2 result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

rmdb result_20230419_115721/tmp/16464230693756166324/pref_tmp_2

align result_20230419_115721/tmp/16464230693756166324/profile_1 /home/user/project/target_DB/target_DB.idx result_20230419_115721/tmp/16464230693756166324/pref_2 result_20230419_115721/tmp/16464230693756166324/aln_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

mergedbs result_20230419_115721/tmp/16464230693756166324/profile_1 result_20230419_115721/res result_20230419_115721/tmp/16464230693756166324/aln_1 result_20230419_115721/tmp/16464230693756166324/aln_tmp_2

rmdb result_20230419_115721/tmp/16464230693756166324/aln_tmp_2

expandaln result_20230419_115721/qdb /home/user/project/target_DB/target_DB.idx result_20230419_115721/res /home/user/project/target_DB/target_DB.idx result_20230419_115721/res_exp --db-load-mode 2 --threads 96 --expansion-mode 0 -e 1.7976931348623157e+308 --expand-filter-clusters 1 --max-seq-id 0.95

Invalid database read for database data file=/home/user/project/target_DB/target_DB.idx, database index=/home/user/project/target_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)

MMseqs Output (for bugs)

MMseqs output

Context

I wish to run colabfold_search on my own database via --db1 'target_DB'. colabfold_search works fine with --db1 'uniref30_2103_db'.

Number of sequences in query.fasta: 1

egrep -c '^>' query.fasta
1

wc -l result_20230419_115721/qdb
1 result_20230419_115721/qdb

Number of sequences in target_DB.fasta: 104664

egrep -c '^>' target_DB.fasta
104664

wc -l  target_DB
104664 target_DB

Number of sequences in resulting database res: 1011

wc -l result_20230419_115721/res
1011  result_20230419_115721/res

Number of sequences in intermediate databases:

wc -l result_20230419_115721/tmp/latest/pref_0
2455  result_20230419_115721/tmp/latest/pref_0

wc -l result_20230419_115721/tmp/latest/profile_0
28    result_20230419_115721/tmp/latest/profile_0

wc -l result_20230419_115721/tmp/latest/profile_1
34    result_20230419_115721/tmp/latest/profile_1

I saw in another Issue asking to see what these awk commands returned when looking at databases:

awk 'BEGIN { min = 2^32; } $3 < min { min = $3 }; $3 > max { max = $3 } { sum = sum + $3; n = n + 1; } END { print sum/n,min,max;  }' $out_DB
awk 'BEGIN { min = 2^32; } $2 < min { min = $2 }; $2 > max { max = $2 } { sum = sum + $2; n = n + 1; } END { print sum/n,min,max;  }' $out_DB

out_DB                             | col $3                   | col $2
-----------------------------------+--------------------------+-----------------------
target_DB/target_DB.index          | 412.665 2 8110           | 2.15005e+07 0 43190597
target_DB/target_DB.idx.index      | 6.04213e+07 1 512000009  | 5.54188e+08 0 1261572096
result_20230419_115721/qdb.index   | 114 114 114              | 0 0
result_20230419_115721/qdb_h.index | 190 190 190              | 0 0
result_20230419_115721/res.index   | 58682 58682 58682        | 0 0

I can run these sequences via mmseqs easy-search (which does not call expandaln):

easy-search query.fasta /home/user/project/target_DB/target_DB result_DB tmp_easy_search --db-output 1 --max-seqs 10000

wc -l result_DB
606   result_DB

# awk sum/n,min,max
out_DB                             | col $3                   | col $2
-----------------------------------+--------------------------+-----------------------
result_DB.index                    | 104112 104112 104112     | 0 0

Your Environment

Please let me know if there is any other information I can share to help debug this.

Kind regards.

semal commented 1 year ago

same problem.