steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

How is the PDB100 database prepared? #203

Closed BinhongLiu closed 7 months ago

BinhongLiu commented 11 months ago

Expected Behavior

If I didn't get it wrong, the PDB100 database was built based on 100% sequence identity clustered PDB. I checked the pdb.lookup file, which supposedly contains all the pdb_chain IDs, and found some strange chain IDs were included, like 1a0n_MODEL_1_B, 1a0n_MODEL_2_B and 1a0n_MODEL_3_B. I could not find the corresponding chain that named this from the 1a0n from PDB. And what is the difference between these 1a0n_MODEL_*_B chains?

Current Behavior

Another question is associated with the --cluster-search. I'm not sure when should this option be added. This issue occurs when I add --cluster-search 1. Without the option, the search finished without the error:

Steps to Reproduce (for bugs)

foldseek easy-search 1a0n_B.pdb PDB100/pdb 1a0n_B.m8 tmp --alignment-type 1 --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1

Foldssek Output (for bugs)

`1a0n_B.m8 exists and will be overwritten easy-search 1a0n_B.pdb PDB100/pdb 1a0n_B.m8 tmp --alignment-type 1 --tmscore-threshold 0.3 --max-seqs 1000 -e 10 -s 9.5 --prefilter-mode 1 --cluster-search 1

MMseqs Version: 8e68e86fc16f50be3c5df09fc96c6495bceeeb20 Seq. id. threshold 0 Coverage threshold 0 Coverage mode 0 Max reject 2147483647 Max accept 2147483647 Add backtrace false TMscore threshold 0.3 TMalign hit order 0 TMalign fast 1 Preload mode 0 Threads 8 Verbosity 3 LDDT threshold 0 Sort by structure bit score 1 Alignment type 1 Substitution matrix aa:3di.out,nucl:3di.out Alignment mode 3 Alignment mode 0 E-value threshold 10 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Gap open cost aa:10,nucl:10 Gap extension cost aa:1,nucl:1 Compressed 0 Seed substitution matrix aa:3di.out,nucl:3di.out Sensitivity 9.5 k-mer length 6 Target search mode 0 k-score seq:2147483647,prof:2147483647 Max results per query 1000 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask residues probability 0.99995 Mask lower case residues 1 Minimum diagonal score 30 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Exhaustive search mode false Prefilter mode 1 Search iterations 1 Remove temporary files true MPI runner
Force restart with latest tmp false Cluster search 1 Chain name mode 0 Write mapping file 0 Mask b-factor threshold 0 Coord store mode 2 Write lookup file 1 Tar Inclusion Regex . Tar Exclusion Regex ^$ File Inclusion Regex . File Exclusion Regex ^$ Alignment format 0 Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits Database output false Greedy best hits false

createdb 1a0n_B.pdb tmp/6818599433398252441/query --chain-name-mode 0 --write-mapping 0 --mask-bfactor-threshold 0 --coord-store-mode 2 --write-lookup 1 --tar-include '.' --tar-exclude '^$' --file-include '.' --file-exclude '^$' --threads 8 -v 3

Output file: tmp/6818599433398252441/query [=================================================================] 100.00% 1 eta - Time for merging to query_ss: 0h 0m 0s 143ms Time for merging to query_h: 0h 0m 0s 159ms Time for merging to query_ca: 0h 0m 0s 105ms Time for merging to query: 0h 0m 0s 68ms Ignore 0 out of 1. Too short: 0, incorrect: 0, not proteins: 0. Time for processing: 0h 0m 2s 785ms Create directory tmp/6818599433398252441/search_tmp search tmp/6818599433398252441/query PDB100/pdb tmp/6818599433398252441/result tmp/6818599433398252441/search_tmp --tmscore-threshold 0.3 --alignment-type 1 --alignment-mode 3 -e 10 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --max-seqs 1000 --mask 0 --mask-prob 0.99995 --prefilter-mode 1 --remove-tmp-files 1 --cluster-search 1

ungappedprefilter tmp/6818599433398252441/query_ss PDB100/pdb_ss tmp/6818599433398252441/search_tmp/10652644971345159255/pref --sub-mat 'aa:3di.out,nucl:3di.out' -c 0 -e 1.79769e+308 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --min-ungapped-score 30 --max-seqs 1000 --db-load-mode 0 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 1 eta - Time for merging to pref: 0h 0m 0s 27ms Time for processing: 0h 0m 0s 296ms structurealign tmp/6818599433398252441/query PDB100/pdb tmp/6818599433398252441/search_tmp/10652644971345159255/pref tmp/6818599433398252441/search_tmp/10652644971345159255/strualn --tmscore-threshold 0.3 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 1 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 1 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 1 eta - Time for merging to strualn: 0h 0m 0s 57ms Time for processing: 0h 0m 3s 346ms mergeresultsbyset tmp/6818599433398252441/search_tmp/10652644971345159255/strualn PDB100/pdb_clu tmp/6818599433398252441/search_tmp/10652644971345159255/strualn_expanded --threads 8 --compressed 0 -v 3

Time for merging to strualn_expanded: 0h 0m 0s 33ms Time for processing: 0h 0m 0s 382ms tmalign tmp/6818599433398252441/query PDB100/pdb tmp/6818599433398252441/search_tmp/10652644971345159255/strualn_expanded tmp/6818599433398252441/search_tmp/10652644971345159255/aln --min-seq-id 0 -c 0 --cov-mode 0 --max-rejected 2147483647 --max-accept 2147483647 -a 0 --add-self-matches 0 --tmscore-threshold 0.3 --tmalign-hit-order 0 --tmalign-fast 1 --db-load-mode 0 --threads 8 -v 3

Query database: tmp/6818599433398252441/query Target database: PDB100/pdb [=================================================================] 100.00% 1 eta - Time for merging to aln: 0h 0m 0s 70ms Time for processing: 0h 0m 24s 611ms Removing temporary files rmdb tmp/6818599433398252441/search_tmp/10652644971345159255/strualn -v 3

Time for processing: 0h 0m 0s 16ms mvdb tmp/6818599433398252441/search_tmp/10652644971345159255/aln tmp/6818599433398252441/result -v 3

Time for processing: 0h 0m 0s 159ms Removing temporary files rmdb tmp/6818599433398252441/search_tmp/10652644971345159255/strualn -v 3

Time for processing: 0h 0m 0s 3ms rmdb tmp/6818599433398252441/search_tmp/10652644971345159255/strualn_expanded -v 3

Time for processing: 0h 0m 0s 16ms rmdb tmp/6818599433398252441/search_tmp/10652644971345159255/pref -v 3

Time for processing: 0h 0m 0s 7ms 1a0n_B.m8 exists and will be overwritten convertalis tmp/6818599433398252441/query PDB100/pdb tmp/6818599433398252441/result 1a0n_B.m8 --sub-mat 'aa:3di.out,nucl:3di.out' --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 1 eta - Invalid database read for database data file=PDB100/pdb_h, database index=PDB100/pdb_h.index getData: local id (4294967295) >= db size (341829) Error: Convert Alignments died`

Context

foldseek version: 8e68e86fc16f50be3c5df09fc96c6495bceeeb20 pdb100 version: 05d31a32d9acf1c5165b17cc35bf4186

Your Environment

Linux

BinhongLiu commented 9 months ago

Hi, I really need your help. Could you help me with my problem? Many thanks.