steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
780 stars 99 forks source link

What exactly does each class in the foldseek_cluster result represent? #207

Open hasaiki136 opened 10 months ago

hasaiki136 commented 10 months ago

@milot-mirdita @martin-steinegger

Hi, thank you for developing such an amazing tool. I had some problems using foldseek.

1、I often cluster protein structure datasets using the following command:

foldseek easy-cluster ./*pdb res tmp -c 0.9 

But I found that the RMSD (obtained using pymol alignment) of protein structures in a class varies greatly, can I think that foldseek_cluster is clustered according to the similarity of protein structures?

2、Here is the command I used to calculate TMscore of a large number of protein structures:

foldseek createdb ./*pdb targetDB
foldseek createindex targetDB tmp  
foldseek easy-search ./*pdb targetDB aln.m8 tmpFolder --format-output "query,target,alntmscore"

When I performed pairwise comparison of a large number of protein structures, the TMscores obtained from the same two proteins were different, resulting in TMscores that were not symmetrical. Moreover, the calculated TMscores seemed to be very different from RMSD, which did not show the same trend, or my understanding of these two parameters was defective?

Looking forward to your reply!

milot-mirdita commented 10 months ago

1) You can cluster with the --tmscore-threshold parameter. That will probably help to cluster RMSD a bit better.

2) Our alignments are not necessarily 100% symmetrical, so the input alignment to TMalign might already be different.

I think we should add a RMSD threshold for clustering, similar to the tmscore-threshold to handle cases like yours. We don't have anything like that currently.