How to obtain a complete PDB through foldssek?

steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.

https://foldseek.com

GNU General Public License v3.0

696 stars 92 forks source link

How to obtain a complete PDB through foldssek? #172

Open jiaweiguan opened 10 months ago

jiaweiguan commented 10 months ago

foldseek easy-search ./query/ {database_path} tmp --format-mode 5

Result:

From the returned results, it can be seen that only Ca. Is there any other way for me to obtain a complete PDB?

martin-steinegger commented 10 months ago

We do not store the full PDB in our databases but just C𝛼 to keep the databases small. In order to get the full PDB files you would need to use our compressed Foldcomp databases, accessible through a python interface, or download it from the EBI directly.

If you want to superpose x,y,z coordinates of the target structure, you would need to:

Print out u and t using the --format-output parameter of the easy-search workflow.
Apply the following transformation using the original coordinates:

x = t[0] + x * u[0][0] + y * u[0][1] + z * u[0][2]
y = t[1] + x * u[1][0] + y * u[1][1] + z * u[1][2]
z = t[2] + x * u[2][0] + y * u[2][1] + z * u[2][2]

jiaweiguan commented 10 months ago

Thank you for your help!

martin-steinegger commented 10 months ago

Neither of these databases is clustered by foldseek easy-cluster. We only provide databases clustered by amino acid sequence. The only preclustered databases are Alphafold/UniProt50, PDB and ESMAtlas30 were clustered through MMSeqs2.

However, we did cluster the whole Alphafold/UniPort as part of our cluster work. If you want to use these structurally clustered proteins you can download the representatives through foldcomp, the db is called afdb_rep_v4.

jiaweiguan commented 10 months ago

We only provide databases clustered by structure. Does it mean that structural clustering is performed before creating the database?

martin-steinegger commented 10 months ago

Sorry for the confusion. I meant "We only provide databases clustered by amino acid sequence.". E.g. the UniProt50 is clustered using MMseqs2 mmseqs cluster afdb afdb50 tmp --min-seq-id 0.5 -c 0.9 --cluster-reassign 1

jiaweiguan commented 10 months ago

Got it! Thanks!

jiaweiguan commented 10 months ago

@martin-steinegger foldseek easy-search ./1QYS.pdb ./afdb/afdb res.8m tmp --format-mode 4 When I execute the this command, I found some warnings in the log. Can not touch 415722228266 into main memory

But I still got the search results. I don't know if this result is complete, can this warning be ignored?

milot-mirdita commented 10 months ago

You can ignore that warning. Its something that we have to fix at some point, but it doesn't affect anything.

jiaweiguan commented 10 months ago

@martin-steinegger Hi! Is ‘tstart’ starting from 0 or 1? And ‘tstart’seems to be related to chains. If I want to get 'tstart' to 'tend'，I need chain.