soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 190 forks source link

'Invalid database read for database data' on running expandaln #616

Open samirelanduk opened 1 year ago

samirelanduk commented 1 year ago

The expandaln command fails to properly read index, producing an 'Invalid database read for database data' error

Expected Behavior

Command to run without error messages.

Current Behavior

Command fails instantly with following error message:

Invalid database read for database data file=db/human.idx, database index=db/human.idx.index
getData: local id (4294967295) >= db size (22)

Steps to Reproduce (for bugs)

mkdir db
mkdir job
mmseqs createdb uniprotkb_human.fasta db/human
mmseqs createindex db/human db/tmp --remove-tmp-files 1 --check-compatible 1
mmseqs createdb query.fasta job/qdb

mmseqs search job/qdb db/human job/res job/tmp1 --num-iterations 3 --db-load-mode 2 -a --k-score 'seq:96,prof:80' -e 0.1 --max-seqs 10000

mmseqs mvdb job/tmp1/latest/profile_1 job/prof_res
mmseqs lndb job/qdb_h job/prof_res_h

# Command which fails:
mmseqs expandaln job/qdb db/human.idx job/res db/human.idx job/res_exp --db-load-mode 1 --expansion-mode 0 -e inf --expand-filter-clusters 1 --max-seq-id 0.95

MMseqs Output (for bugs)

createdb:

MMseqs Version:         8799829d213f31b647fc69e0572a0c828c5aaf63
Database type           0
Shuffle input database  true
Createdb mode           0
Write lookup file       1
Offset of numeric ids   0
Compressed              0
Verbosity               3

Converting sequences
[79690] 0s 233ms
Time for merging to human_h: 0h 0m 0s 24ms
Time for merging to human: 0h 0m 0s 53ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 472ms

createindex:

MMseqs Version:                 8799829d213f31b647fc69e0572a0c828c5aaf63
Seed substitution matrix        aa:VTML80.out,nucl:nucleotide.out
k-mer length                    0
Alphabet size                   aa:21,nucl:5
Compositional bias              1
Compositional bias              1
Max sequence length             65535
Max results per query           300
Mask residues                   1
Mask residues probability       0.9
Mask lower case residues        0
Spaced k-mers                   1
Spaced k-mer pattern     
Sensitivity                     7.5
k-score                         seq:0,prof:0
Check compatible                1
Search type                     0
Split database                  0
Split memory limit              0
Verbosity                       3
Threads                         4
Min codons in orf               30
Max codons in length            32734
Max orf gaps                    2147483647
Contig start mode               2
Contig end mode                 2
Orf start mode                  1
Forward frames                  1,2,3
Reverse frames                  1,2,3
Translation table               1
Translate orf                   0
Use all table starts            false
Offset of numeric ids           0
Create lookup                   0
Compressed                      0
Add orf stop                    false
Overlap between sequences       0
Sequence split mode             1
Header split mode               0
Strand selection                1
Remove temporary files          true

createindex db/human db/tmp --remove-tmp-files 1 --check-compatible 1 

MMseqs Version:                 8799829d213f31b647fc69e0572a0c828c5aaf63
Seed substitution matrix        aa:VTML80.out,nucl:nucleotide.out
k-mer length                    0
Alphabet size                   aa:21,nucl:5
Compositional bias              1
Compositional bias              1
Max sequence length             65535
Max results per query           300
Mask residues                   1
Mask residues probability       0.9
Mask lower case residues        0
Spaced k-mers                   1
Spaced k-mer pattern     
Sensitivity                     7.5
k-score                         seq:0,prof:0
Check compatible                1
Search type                     0
Split database                  0
Split memory limit              0
Verbosity                       3
Threads                         4
Min codons in orf               30
Max codons in length            32734
Max orf gaps                    2147483647
Contig start mode               2
Contig end mode                 2
Orf start mode                  1
Forward frames                  1,2,3
Reverse frames                  1,2,3
Translation table               1
Translate orf                   0
Use all table starts            false
Offset of numeric ids           0
Create lookup                   0
Compressed                      0
Add orf stop                    false
Overlap between sequences       0
Sequence split mode             1
Header split mode               0
Strand selection                1
Remove temporary files          true

indexdb db/human db/human --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 1 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 4 

Estimated memory consumption: 1G
Write VERSION (0)
Write META (1)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write GENERATOR (22)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write HDR1INDEX (18)
Write HDR1DATA (19)
Index table: counting k-mers
[=================================================================] 100.00% 79.74K 2s 947ms    
Index table: Masked residues: 1262029
Index table: fill
[=================================================================] 100.00% 79.74K 4s 125ms    
Index statistics
Entries:          25991856
DB size:          637 MB
Avg k-mer size:   0.406123
Top 10 k-mers
    VMEYLV      439
    QRLRML      421
    LYDMNY      403
    TFDAFS      367
    YRVLYR      257
    VAESEW      236
    TGYKLS      202
    GEVLSS      200
    VTSSSS      199
    TFDAFT      194
Write ENTRIES (9)
Write ENTRIESOFFSETS (10)
Write SEQINDEXDATASIZE (15)
Write SEQINDEXSEQOFFSET (16)
Write SEQINDEXDATA (14)
Write ENTRIESNUM (12)
Write SEQCOUNT (13)
Time for merging to human.idx: 0h 0m 0s 0ms
Time for processing: 0h 0m 11s 156ms

expandaln:

expandaln job/qdb db/human.idx job/res db/human.idx job/res_exp --db-load-mode 1 --expansion-mode 0 -e inf --expand-filter-clusters 1 --max-seq-id 0.95 

MMseqs Version:                 8799829d213f31b647fc69e0572a0c828c5aaf63
Expansion mode                  0
Substitution matrix             aa:blosum62.out,nucl:nucleotide.out
Gap open cost                   aa:11,nucl:5
Gap extension cost              aa:1,nucl:2
Max sequence length             65535
Score bias                      0
Compositional bias              1
Compositional bias              1
E-value threshold               inf
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Pseudo count mode               0
Pseudo count a                  substitution:1.100,context:1.400
Pseudo count b                  substitution:4.100,context:5.800
Expand filter clusters          1
Use filter only at N seqs       0
Maximum seq. id. threshold      0.95
Minimum seq. id.                0.0
Minimum score per column        -20
Minimum coverage                0
Select N most diverse seqs      1000
Preload mode                    1
Compressed                      0
Threads                         4
Verbosity                       3

Index version: 16
Generated by:  8799829d213f31b647fc69e0572a0c828c5aaf63
ScoreMatrix:  VTML80.out
Index version: 16
Generated by:  8799829d213f31b647fc69e0572a0c828c5aaf63
ScoreMatrix:  VTML80.out
Invalid database read for database data file=db/human.idx, database index=db/human.idx.index
getData: local id (4294967295) >= db size (22)

Context

I am attempting to recreate the functionality in https://github.com/soedinglab/MMseqs2-App/blob/master/backend/worker.go

Your Environment

Include as many relevant details about the environment you experienced the bug in.

bchodkowski-vir commented 1 year ago

I also get the "Invalid database read for database data file" error from expandaln when called by colabfold_search.

(I originally posted this on Issue 64 before I realized that that Issue was closed.)

Invalid database read for database data file=/home/username/project/my_local_DB/target_DB.idx, database index=/home/username/project/my_local_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)

I created target_DB from target.fasta which has 142 records in it:

pwd
  # /home/username/project/my_local_DB

mmseqs createdb target.fasta target_DB
mmseqs createindex target_DB tmp_createindex --threads 96

indexdb target_DB target_DB --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 96

Then I ran colabfold_search. Output is below.

CUDA_VISIBLE_DEVICES='0' colabfold_search
    -s '8'
    --db1 'target_DB'
    --use-templates '0'
    --db2 ''
    --use-env '0'
    --db3 ''
    --filter '1'
    --mmseqs 'mmseqs'
    --expand-eval '1.7e+308'
    --align-eval '10'
    --diff '3000'
    --qsc '-20.0'
    --max-accept '1000000'
    --db-load-mode '2'
    --threads '96'
        query.fasta
        /home/username/project/my_local_DB
        result_query_20230412_142303

createdb result_query_20230412_142303/query.fas result_query_20230412_142303/qdb --shuffle 0

search result_query_20230412_142303/qdb /home/username/project/my_local_DB/target_DB result_query_20230412_142303/res result_query_20230412_142303/tmp --threads 96 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000

prefilter result_query_20230412_142303/qdb /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

align result_query_20230412_142303/qdb /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/pref_0 result_query_20230412_142303/tmp/18292001434761310910/aln_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 1 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

result2profile result_query_20230412_142303/qdb /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/aln_0 result_query_20230412_142303/tmp/18292001434761310910/profile_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3

subtractdbs result_query_20230412_142303/tmp/18292001434761310910/pref_tmp_1 result_query_20230412_142303/tmp/18292001434761310910/aln_0 result_query_20230412_142303/tmp/18292001434761310910/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

subtractdbs result_query_20230412_142303/tmp/18292001434761310910/pref_tmp_1 result_query_20230412_142303/tmp/18292001434761310910/aln_0 result_query_20230412_142303/tmp/18292001434761310910/pref_1 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

align result_query_20230412_142303/tmp/18292001434761310910/profile_0 /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/pref_1 result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

mergedbs result_query_20230412_142303/tmp/18292001434761310910/profile_0 result_query_20230412_142303/tmp/18292001434761310910/aln_1 result_query_20230412_142303/tmp/18292001434761310910/aln_0 result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_1

rmdb result_query_20230412_142303/tmp/18292001434761310910/aln_0

rmdb result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_1

result2profile result_query_20230412_142303/tmp/18292001434761310910/profile_0 /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/aln_1 result_query_20230412_142303/tmp/18292001434761310910/profile_1 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.1 --mask-profile 1 --e-profile 0.1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --db-load-mode 2 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --gap-pc 10 --threads 96 --compressed 0 -v 3

prefilter result_query_20230412_142303/tmp/18292001434761310910/profile_1 /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/pref_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

subtractdbs result_query_20230412_142303/tmp/18292001434761310910/pref_tmp_2 result_query_20230412_142303/tmp/18292001434761310910/aln_1 result_query_20230412_142303/tmp/18292001434761310910/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

subtractdbs result_query_20230412_142303/tmp/18292001434761310910/pref_tmp_2 result_query_20230412_142303/tmp/18292001434761310910/aln_1 result_query_20230412_142303/tmp/18292001434761310910/pref_2 --threads 96 --e-profile 0.1 -e 0.1 --compressed 0 -v 3

align result_query_20230412_142303/tmp/18292001434761310910/profile_1 /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/tmp/18292001434761310910/pref_2 result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_2 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 96 --compressed 0 -v 3

mergedbs result_query_20230412_142303/tmp/18292001434761310910/profile_1 result_query_20230412_142303/res result_query_20230412_142303/tmp/18292001434761310910/aln_1 result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_2

rmdb result_query_20230412_142303/tmp/18292001434761310910/aln_1

rmdb result_query_20230412_142303/tmp/18292001434761310910/aln_tmp_2

expandaln result_query_20230412_142303/qdb /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/res /home/username/project/my_local_DB/target_DB.idx result_query_20230412_142303/res_exp --db-load-mode 2 --threads 96 --expansion-mode 0 -e 1.7976931348623157e+308 --expand-filter-clusters 1 --max-seq-id 0.95

  MMseqs Version:             67949d702dbfc6e5d54fdd0f14a9ab6740f11c32
  Expansion mode              0
  Substitution matrix         aa:blosum62.out,nucl:nucleotide.out
  Gap open cost               aa:11,nucl:5
  Gap extension cost          aa:1,nucl:2
  Max sequence length         65535
  Score bias                  0
  Compositional bias          1
  Compositional bias          1
  E-value threshold           1.79769e+308
  Seq. id. threshold          0
  Coverage threshold          0
  Coverage mode               0
  Pseudo count mode           0
  Pseudo count a              substitution:1.100,context:1.400
  Pseudo count b              substitution:4.100,context:5.800
  Expand filter clusters      1
  Use filter only at N seqs   0
  Maximum seq. id. threshold  0.95
  Minimum seq. id.            0.0
  Minimum score per column    -20
  Minimum coverage            0
  Select N most diverse seqs  1000
  Preload mode                2
  Compressed                  0
  Threads                     96
  Verbosity                   3

Index version: 16
Generated by:  67949d702dbfc6e5d54fdd0f14a9ab6740f11c32
ScoreMatrix:  VTML80.out
Index version: 16
Generated by:  67949d702dbfc6e5d54fdd0f14a9ab6740f11c32
ScoreMatrix:  VTML80.out
Invalid database read for database data file=/home/username/project/my_local_DB/target_DB.idx, database index=/home/username/project/my_local_DB/target_DB.idx.index
getData: local id (4294967295) >= db size (22)
Traceback (most recent call last):
  File "/home/username/project/colabfold_batch/colabfold-conda/bin/colabfold_search", line 8, in <module>
    sys.exit(main())
  File "/home/username/project/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 444, in main
    threads=args.threads,
  File "/home/username/project/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 86, in mmseqs_search_monomer
    run_mmseqs(mmseqs, ["expandaln", base.joinpath("qdb"), dbbase.joinpath(f"{uniref_db}{dbSuffix1}"), base.joinpath("res"), dbbase.joinpath(f"{uniref_db}{dbSuffix2}"), base.joinpath("res_exp"), "--db-load-mode", str(db_load_mode), "--threads", str(threads)] + expand_param)

  File "/home/username/project/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/mmseqs/search.py", line 23, in run_mmseqs
    subprocess.check_call([mmseqs] + params)
  File "/home/username/project/colabfold_batch/colabfold-conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command '[PosixPath('mmseqs'), 'expandaln', PosixPath('result_query_20230412_142303/qdb'), PosixPath('/home/username/project/my_local_DB/target_DB.idx'), PosixPath('result_query_20230412_142303/res'), PosixPath('/home/username/project/my_local_DB/target_DB.idx'), PosixPath('result_query_20230412_142303/res_exp'), '--db-load-mode', '2', '--threads', '96', '--expansion-mode', '0', '-e', '1.7976931348623157e+308', '--expand-filter-clusters', '1', '--max-seq-id', '0.95']' returned non-zero exit status 1.

target_DB is a brand new database; I have not added nor deleted records after its creation.

I am working on Lambda server running Ubuntu:

Linux xyz-lambda02 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Please let me know if I can help with debugging.

Thank you. And thanks for mmseqs.

bzhousd commented 1 year ago

I got the same error but in different place, I ran local colabfold API Server, the error message is

Invalid database read for database data file=/data/colabFold/MsaServer/databases/uniref30_2202_db.idx, database index=/data/colabFold/MsaServer/databases/uniref30_2202_db.idx.index getData: local id (4294967295) >= db size (22)

Thanks