steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
839 stars 103 forks source link

How to conduct clusters based on global structures instead of local substructures? #265

Open krinoz opened 7 months ago

krinoz commented 7 months ago

Thanks for your work. However, as I tried to conduct protein structure clustering, the program always returns clusters based on proteins' substructure. I am wondering how can I get the cluster based on global structures?

--alignment-type 1 seems can get the global results based on TM align. Here is the code that I used: ./foldseek/bin/foldseek easy-cluster ./protein res tmp --lddt-threshold 0.5 --c 1 --alignment-type 1

However, the results I get look like:

1ai6.pdb_B 1ai6.pdb_B 1ai6.pdb_B 1ai7.pdb_B 1ai6.pdb_B 1ajn.pdb_B 1ai6.pdb_B 1ajp.pdb_B 1ai6.pdb_B 1ajq.pdb_B 1ai7.pdb_A 1ai7.pdb_A 1ai7.pdb_A 1ai6.pdb_A 1ai7.pdb_A 1ajn.pdb_A 1ai7.pdb_A 1ajp.pdb_A 1ai7.pdb_A 1ajq.pdb_A

And I would like to get the format of the clusters similar to:

1ai6.pdb 1ai6.pdb 1ai6.pdb 1ajn.pdb 1ai6.pdb 1ajq.pdb 1ai7.pdb 1ai7.pdb 1ai7.pdb 1ajp.pdb

Thanks.

milot-mirdita commented 6 months ago

We are working on a new method that covers multimer clustering. It's not quite ready yet, but we should have something out soon.

sirius777coder commented 5 months ago

Can we just select the number of chains to increase the cluster efficiency?

jstrobaek commented 3 months ago

A workaround for this is to use a tool like pdb-tools and give all your chains the same name (e.g. using pdb_chain -A [PDB FILE]). It doesn't scale well, but it's very doable if you don't have tens of thousands of files to cluster.