steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
696 stars 92 forks source link

How to obtain a complete PDB through foldssek? #172

Open jiaweiguan opened 10 months ago

jiaweiguan commented 10 months ago

foldseek easy-search ./query/ {database_path} tmp --format-mode 5

Result: image

From the returned results, it can be seen that only Ca. Is there any other way for me to obtain a complete PDB?

martin-steinegger commented 10 months ago

We do not store the full PDB in our databases but just Cš›¼ to keep the databases small. In order to get the full PDB files you would need to use our compressed Foldcomp databases, accessible through a python interface, or download it from the EBI directly.

If you want to superpose x,y,z coordinates of the target structure, you would need to:

  1. Print out u and t using the --format-output parameter of the easy-search workflow.
  2. Apply the following transformation using the original coordinates:
x = t[0] + x * u[0][0] + y * u[0][1] + z * u[0][2]
y = t[1] + x * u[1][0] + y * u[1][1] + z * u[1][2]
z = t[2] + x * u[2][0] + y * u[2][1] + z * u[2][2]
jiaweiguan commented 10 months ago

Thank you for your help!

martin-steinegger commented 10 months ago

Neither of these databases is clustered by foldseek easy-cluster. We only provide databases clustered by amino acid sequence. The only preclustered databases are Alphafold/UniProt50, PDB and ESMAtlas30 were clustered through MMSeqs2.

However, we did cluster the whole Alphafold/UniPort as part of our cluster work. If you want to use these structurally clustered proteins you can download the representatives through foldcomp, the db is called afdb_rep_v4.

jiaweiguan commented 10 months ago

We only provide databases clustered by structure. Does it mean that structural clustering is performed before creating the database?

martin-steinegger commented 10 months ago

Sorry for the confusion. I meant "We only provide databases clustered by amino acid sequence.". E.g. the UniProt50 is clustered using MMseqs2 mmseqs cluster afdb afdb50 tmp --min-seq-id 0.5 -c 0.9 --cluster-reassign 1

jiaweiguan commented 10 months ago

Got it! Thanks!

jiaweiguan commented 10 months ago

@martin-steinegger foldseek easy-search ./1QYS.pdb ./afdb/afdb res.8m tmp --format-mode 4 When I execute the this command, I found some warnings in the log. Can not touch 415722228266 into main memory

But I still got the search results. I don't know if this result is complete, can this warning be ignored?

milot-mirdita commented 10 months ago

You can ignore that warning. Its something that we have to fix at some point, but it doesn't affect anything.

jiaweiguan commented 10 months ago

@martin-steinegger Hi! Is ā€˜tstartā€™ starting from 0 or 1? And ā€˜tstartā€™seems to be related to chains. If I want to get 'tstart' to 'tend'ļ¼ŒI need chain.