steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

--cluster-search 1 throws error #216

Closed DS-ribo closed 9 months ago

DS-ribo commented 9 months ago

Hello, Hello, I have a .pdb file A.pdb and I am trying to use easy-search against the PDB database to find structural homologs. I downloaded the database as:

foldseek databases PDB PDB tmp

the files look like this:

-rw-rw-r--+ 1 aaaa b  68499224 Nov 20 12:39 PDB
-rw-rw-r--+ 1 aaaa b 430218116 Nov 20 12:39 PDB_ca
-rw-rw-r--+ 1 aaaa b         4 Nov 20 12:39 PDB_ca.dbtype
-rw-rw-r--+ 1 aaaa b   7221686 Nov 20 12:39 PDB_ca.index
-rw-r--r--+ 1 aaaa b   7613603 Nov 20 12:39 PDB_clu
-rw-r--r--+ 1 aaaa b         4 Nov 20 12:39 PDB_clu.dbtype
-rw-r--r--+ 1 aaaa b   5888160 Nov 20 12:39 PDB_clu.index
-rw-rw-r--+ 1 aaaa b         4 Nov 20 12:39 PDB.dbtype
-rw-rw-r--+ 1 aaaa b  32302155 Nov 20 12:39 PDB_h
-rw-rw-r--+ 1 aaaa b        4 Nov 20 12:39 PDB_h.dbtype
-rw-rw-r--+ 1 aaaa b   6521240 Nov 20 12:39 PDB_h.index
-rw-rw-r--+ 1 aaaa b   6646100 Nov 20 12:39 PDB.index
-rw-r--r--+ 1 aaaa b  24153280 Nov 20 12:39 PDB.lookup
-rw-r--r--+ 1 aaaa b  13049570 Nov 20 12:39 PDB_mapping
lrwxrwxrwx  1 aaaa b         3 Nov 20 12:39 PDB_seq.0 -> pdb
-rw-rw-r--+ 1 aaaa b 152709324 Nov 20 12:39 PDB_seq.1
lrwxrwxrwx  1 aaaa b        6 Nov 20 12:39 PDB_seq_ca.0 -> pdb_ca
-rw-rw-r--+ 1 aaaa b944432719 Nov 20 12:39 PDB_seq_ca.1
-rw-rw-r--+ 1 aaaa b        4 Nov 20 12:39 PDB_seq_ca.dbtype
-rw-rw-r--+ 1 aaaa b  22572398 Nov 20 12:39 PDB_seq_ca.index
-rw-rw-r--+ 1 aaaa b         4 Nov 20 12:39 PDB_seq.dbtype
lrwxrwxrwx  1 aaaa b         5 Nov 20 12:39 PDB_seq_h.0 -> pdb_h
-rw-rw-r--+ 1 aaaa b 68518179 Nov 20 12:39 PDB_seq_h.1
-rw-rw-r--+ 1 aaaa b         4 Nov 20 12:39 PDB_seq_h.dbtype
-rw-rw-r--+ 1 aaaa b  20156655 Nov 20 12:39 PDB_seq_h.index
-rw-rw-r--+ 1 aaaa b 21081446 Nov 20 12:39 PDB_seq.index
lrwxrwxrwx  1 aaaa b        10 Nov 20 12:39 PDB_seq.lookup -> pdb.lookup
lrwxrwxrwx  1 aaaa b        11 Nov 20 12:39 PDB_seq_mapping -> pdb_mapping
lrwxrwxrwx  1 aaaa b        10 Nov 20 12:39 PDB_seq.source -> pdb.source
lrwxrwxrwx  1 aaaa b         6 Nov 20 12:39 PDB_seq_ss.0 -> pdb_ss
-rw-rw-r--+ 1 aaaa b 152709324 Nov 20 12:39 PDB_seq_ss.1
-rw-rw-r--+ 1 aaaa b         4 Nov 20 12:39 PDB_seq_ss.dbtype
-rw-rw-r--+ 1 aaaa b  21073299 Nov 20 12:39 PDB_seq_ss.index
lrwxrwxrwx  1 aaaa b        12 Nov 20 12:39 PDB_seq_taxonomy -> pdb_taxonomy
-rw-r--r--+ 1 aaaa b  8542120 Nov 20 12:39 PDB.source
-rw-rw-r--+ 1 aaaa b  68499224 Nov 20 12:39 PDB_ss
-rw-rw-r--+ 1 aaaa b        4 Nov 20 12:39 PDB_ss.dbtype
-rw-rw-r--+ 1 aaaa b   6644637 Nov 20 12:39 PDB_ss.index
-rw-r--r--+ 1 aaaa b 701534552 Nov 20 12:39 PDB_taxonomy
-rw-rw-r--+ 1 aaaa b       125 Dec  8 21:41 PDB.version

Current Behavior

when I run easy-searchwith default parameters it works: foldseek easy-search A.pdb PDB_db_folder/PDB output.txt tmp

MMseqs Version:                 8.ef4e960
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
TMscore threshold               0
TMalign hit order               0
TMalign fast                    1
Preload mode                    0
Threads                         128
Verbosity                       3
LDDT threshold                  0
Sort by structure bit score     1
Alignment type                  2
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               10
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    6
Target search mode              0
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        1
Minimum diagonal score          30
Selected taxa                   
Spaced k-mers                   1
Spaced k-mer pattern            
Local temporary path            
Exhaustive search mode          false
Prefilter mode                  0
Search iterations               1
Remove temporary files          true
MPI runner                      
Force restart with latest tmp   false
Cluster search                  0
Chain name mode                 0
Write mapping file              0
Mask b-factor threshold         0
Coord store mode                2
Write lookup file               1
Tar Inclusion Regex             .*
Tar Exclusion Regex             ^$
File Inclusion Regex            .*
File Exclusion Regex            ^$
Alignment format                0
Format alignment output         query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                 false
Greedy best hits                false

createdb A.pdb tmp/5124259549608294352/query --chain-name-mode 0 --write-mapping 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --tar-include '.*' --tar-exclude '^$' --file-include '.*' --file-exclude '^$' --threads 128 -v 3 

Output file: tmp/5124259549608294352/query
[=================================================================] 100.00% 1 eta -
Time for merging to query_ss: 0h 0m 0s 4ms
Time for merging to query_h: 0h 0m 0s 2ms
Time for merging to query_ca: 0h 0m 0s 2ms
Time for merging to query: 0h 0m 0s 2ms
Ignore 0 out of 1.
Too short: 0, incorrect: 0, not proteins: 0.
Time for processing: 0h 0m 0s 50ms
Create directory tmp/5124259549608294352/search_tmp
search tmp/5124259549608294352/query PDB_db_folder/PDB tmp/5124259549608294352/result tmp/5124259549608294352/search_tmp --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1 

prefilter tmp/5124259549608294352/query_ss PDB_db_folder/PDB_ss tmp/5124259549608294352/search_tmp/3320063337752281300/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3 

Query database size: 1 type: Aminoacid
Estimated memory consumption: 3G
Target database size: 343785 type: Aminoacid
Index table k-mer threshold: 78 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 100.00% 343.79K 0s 196ms    
Index table: Masked residues: 4286
Index table: fill
[=================================================================] 100.00% 343.79K 0s 236ms    
Index statistics
Entries:          62346553
DB size:          845 MB
Avg k-mer size:   0.974165
Top 10 k-mers
    LVLVVV  66465
    VVLVVV  60690
    SVSVVV  57022
    VVSVVV  51031
    LVVVVV  47617
    SVVVVV  47491
    VSVVVV  30326
    VVNVVV  24596
    VVQVVV  22202
    VLVVVV  20767
Time for index table init: 0h 0m 0s 902ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 343785
[=================================================================] 100.00% 1 eta -

5831.973105 k-mers per position
20059473 DB matches per sequence
1 overflows
1000 sequences passed prefiltering per query sequence
1000 median result list length
0 sequences with 0 size result lists
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 503ms
structurealign tmp/5124259549608294352/query PDB_db_folder/PDB tmp/5124259549608294352/search_tmp/3320063337752281300/pref tmp/5124259549608294352/search_tmp/3320063337752281300/strualn --tmscore-threshold 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 128 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to strualn: 0h 0m 0s 4ms
Time for processing: 0h 0m 2s 867ms
mvdb tmp/5124259549608294352/search_tmp/3320063337752281300/strualn tmp/5124259549608294352/search_tmp/3320063337752281300/aln 

Time for processing: 0h 0m 0s 2ms
mvdb tmp/5124259549608294352/search_tmp/3320063337752281300/aln tmp/5124259549608294352/result -v 3 

Time for processing: 0h 0m 0s 1ms
Removing temporary files
rmdb tmp/5124259549608294352/search_tmp/3320063337752281300/pref -v 3 

Time for processing: 0h 0m 0s 0ms
convertalis tmp/5124259549608294352/query PDB_db_folder/PDB tmp/5124259549608294352/result output.txt --sub-mat 'aa:3di.out,nucl:3di.out' --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 128 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to output.txt: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 34ms
rmdb tmp/5124259549608294352/result -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/5124259549608294352/query -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmp/5124259549608294352/query_h -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmp/5124259549608294352/query_ca -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmp/5124259549608294352/query_ss -v 3 

Time for processing: 0h 0m 0s 0ms

the output:

A.pdb   4mr7.cif.gz_A   1.000   407 0   0   1   407 1   407 2.862E-84   3384
A.pdb   7c7s.cif.gz_A   0.992   407 3   0   1   405 2   408 4.897E-77   3026
...

However, when I run it with --cluster-search 1 it throws an error "No datafile could be found for PDB_db_folder/PDB_seq!": foldseek easy-search A.pdb PDB_db_folder/PDB output.txt tmp --cluster-search 1

Create directory tmp
easy-search A.pdb PDB_db_folder/PDB output.txt tmp --cluster-search 1 

MMseqs Version:                 8.ef4e960
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
TMscore threshold               0
TMalign hit order               0
TMalign fast                    1
Preload mode                    0
Threads                         128
Verbosity                       3
LDDT threshold                  0
Sort by structure bit score     1
Alignment type                  2
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               10
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    6
Target search mode              0
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        1
Minimum diagonal score          30
Selected taxa                   
Spaced k-mers                   1
Spaced k-mer pattern            
Local temporary path            
Exhaustive search mode          false
Prefilter mode                  0
Search iterations               1
Remove temporary files          true
MPI runner                      
Force restart with latest tmp   false
Cluster search                  1
Chain name mode                 0
Write mapping file              0
Mask b-factor threshold         0
Coord store mode                2
Write lookup file               1
Tar Inclusion Regex             .*
Tar Exclusion Regex             ^$
File Inclusion Regex            .*
File Exclusion Regex            ^$
Alignment format                0
Format alignment output         query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                 false
Greedy best hits                false

createdb A.pdb tmp/12283752656240867025/query --chain-name-mode 0 --write-mapping 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --tar-include '.*' --tar-exclude '^$' --file-include '.*' --file-exclude '^$' --threads 128 -v 3 

Output file: tmp/12283752656240867025/query
[=================================================================] 100.00% 1 eta -
Time for merging to query_ss: 0h 0m 0s 4ms
Time for merging to query_h: 0h 0m 0s 2ms
Time for merging to query_ca: 0h 0m 0s 2ms
Time for merging to query: 0h 0m 0s 2ms
Ignore 0 out of 1.
Too short: 0, incorrect: 0, not proteins: 0.
Time for processing: 0h 0m 0s 40ms
Create directory tmp/12283752656240867025/search_tmp
search tmp/12283752656240867025/query PDB_db_folder/PDB tmp/12283752656240867025/result tmp/12283752656240867025/search_tmp --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1 --cluster-search 1 

prefilter tmp/12283752656240867025/query_ss PDB_db_folder/PDB_ss tmp/12283752656240867025/search_tmp/4204095793800578284/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3 

Query database size: 1 type: Aminoacid
Estimated memory consumption: 3G
Target database size: 343785 type: Aminoacid
Index table k-mer threshold: 78 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 100.00% 343.79K 0s 192ms     
Index table: Masked residues: 4286
Index table: fill
[=================================================================] 100.00% 343.79K 0s 226ms    
Index statistics
Entries:          62346553
DB size:          845 MB
Avg k-mer size:   0.974165
Top 10 k-mers
    LVLVVV  66465
    VVLVVV  60690
    SVSVVV  57022
    VVSVVV  51031
    LVVVVV  47617
    SVVVVV  47491
    VSVVVV  30326
    VVNVVV  24596
    VVQVVV  22202
    VLVVVV  20767
Time for index table init: 0h 0m 0s 885ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 78
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 343785
[=================================================================] 100.00% 1 eta -

5831.973105 k-mers per position
20059473 DB matches per sequence
1 overflows
1000 sequences passed prefiltering per query sequence
1000 median result list length
0 sequences with 0 size result lists
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 401ms
structurealign tmp/12283752656240867025/query PDB_db_folder/PDB tmp/12283752656240867025/search_tmp/4204095793800578284/pref tmp/12283752656240867025/search_tmp/4204095793800578284/strualn --tmscore-threshold 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 128 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to strualn: 0h 0m 0s 4ms
Time for processing: 0h 0m 3s 371ms
mergeresultsbyset tmp/12283752656240867025/search_tmp/4204095793800578284/strualn PDB_db_folder/PDB tmp/12283752656240867025/search_tmp/4204095793800578284/strualn_expanded --threads 128 --compressed 0 -v 3 

Time for merging to strualn_expanded: 0h 0m 0s 3ms
Time for processing: 0h 0m 0s 75ms
setextendeddbtype tmp/12283752656240867025/search_tmp/4204095793800578284/strualn_expanded --extended-dbtype 2 

Time for processing: 0h 0m 0s 0ms
structurealign tmp/12283752656240867025/query PDB_db_folder/PDB tmp/12283752656240867025/search_tmp/4204095793800578284/strualn_expanded tmp/12283752656240867025/search_tmp/4204095793800578284/aln --tmscore-threshold 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 128 --compressed 0 -v 3 

No datafile could be found for PDB_db_folder/PDB/PDB_seq!
Error: Alignment step died
Error: Search died

Could you please help me resolve this issue?

Expected Behavior

output with matches something like this:

A.pdb   4mr7.cif.gz_A   1.000   407 0   0   1   407 1   407 2.862E-84   3384
A.pdb   7c7s.cif.gz_A   0.992   407 3   0   1   405 2   408 4.897E-77   3026
A.pdb   6w2y.cif.gz_A   1.000   403 0   0   2   404 1   403 6.945E-77   2978

Context

Foldseek does not report all relevant matches by default, and I would like to include more high-scoring hits by using --cluster-search 1

Your Environment

milot-mirdita commented 9 months ago

I found out what's going on. We have created a few broken symlinks, if the output database name is not exactly the same as we used to create it.

For the current workaround please download the PDB with:

foldseek databases PDB pdb tmp

And use pdb as the database name, then it should work.

We will fix the uploaded PDB asap.

DS-ribo commented 9 months ago

Hi @milot-mirdita , thank you for your help - it worked now!