steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
696 stars 92 forks source link

Increasing cluster stringency? #156

Closed richardshuai closed 1 year ago

richardshuai commented 1 year ago

I'm trying to cluster a dataset of thousands of highly similar protein structures using foldseek cluster, but I'm finding that foldseek gives me very few clusters (maybe 30-40) even with very strict structural alignment cutoffs (-c 0.999, -e 0.001). However, the structures within a given cluster do still look different when viewed in PyMOL. How would I go about further increasing cluster stringency these cutoffs?

martin-steinegger commented 1 year ago

Could you please provide more specific details regarding the application you have in mind for clustering?

One thing that come to my mind is to use TM-score. We have incorporated a feature to allow clustering based on the TM-score. You might want to explore the --tmscore-threshold option for your needs.

richardshuai commented 1 year ago

Thank you! It does seem like clustering based on TM-score gives me results closer to what I want. I'm trying to cluster antibody structures (so they will be highly similar except in their hypervariable CDRs), so I wanted a way to cluster such that antibodies with similar CDR orientations will be placed in the same cluster.

Is there an easy interpretation for the tmscore-threshold as far as what it means for each individual cluster? Also, are the other options such as --min-seq-id / -c being used in the Foldseek-TM mode?

martin-steinegger commented 1 year ago

Interesting, I never tried to cluster highly similar structures. Yes, you can combine cluster criteria. Increasing the coverage does make sense for your use-case.