Unable to cluster large set of similar proteins

steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.

https://foldseek.com

GNU General Public License v3.0

695 stars 92 forks source link

Unable to cluster large set of similar proteins #205

Open Raul-araya-secchi opened 8 months ago

Raul-araya-secchi commented 8 months ago

I'm trying to cluster a set of ~500 conformations of the same protein but no matter what I try I only get 1 cluster.

Do I need to do something to the set of structures before clustering them?

I'm a bit lost and will appreciate any help.

Thanks.

martin-steinegger commented 8 months ago

@Raul-araya-secchi Foldseek was meant to cluster huge collection of diverse protein structures. However, you could try to use the parameter --tmscore-threshold.

ZiyaoLi commented 6 months ago

I'm in the same position and I tried to add this --tmscore-threshold argument. However no matter how much the threshold is used (even with 0.999) the clustering algorithm cannot distinguish 4ake from 1ake. Do you have any sugguestions?

the cmd I've tried to use are

foldseek easy-cluster data/ res tmp --tmscore-threshold 0.99
foldseek easy-cluster data/ res tmp --tmscore-threshold 0.99 --alignment-type 1

and some other attempts in modifying -c and -e.

ZiyaoLi commented 6 months ago

I also tried to print the tmscores calculated and the results seem to be correct. The following results are made with foldseek easy-search data/ data/ aln tmp --format-output query,target,alntmscore,prob .

4ake.cif.gz_A   4ake.cif.gz_A   1.000E+00       1.000
4ake.cif.gz_A   4ake.cif.gz_B   9.908E-01       1.000
4ake.cif.gz_A   2eck.cif.gz_A   6.868E-01       1.000
4ake.cif.gz_A   2eck.cif.gz_B   6.876E-01       1.000
4ake.cif.gz_A   1ake.cif.gz_A   6.854E-01       1.000
4ake.cif.gz_A   4jzk.cif.gz_B   6.869E-01       1.000
4ake.cif.gz_A   1e4v.cif.gz_A   6.854E-01       1.000
4ake.cif.gz_A   7apu.cif.gz_A   6.858E-01       1.000
4ake.cif.gz_A   1e4v.cif.gz_B   6.870E-01       1.000
4ake.cif.gz_A   7apu.cif.gz_B   6.858E-01       1.000

martin-steinegger commented 6 months ago

@ZiyaoLi could you please share the PDB files of the data folder?

martin-steinegger commented 6 months ago

Thank you for reporting this. This was a bug in the structurerescorediagonal code. It did not respect the tmscore threshold properly. I fixed it now and the following command results now in two clusters:

foldseek easy-cluster 4ake.pdb 1ake.pdb clu tmp  --tmscore-threshold 0.99

ZiyaoLi commented 6 months ago

Thank you for reporting this. This was a bug in the structurerescorediagonal code. It did not respect the tmscore threshold properly. I fixed it now and the following command results now in two clusters:

You are so fast. I was preparing for a smallest demo. The demo is here anyway.

NatureGeorge commented 6 months ago

@martin-steinegger I encountered a related issue. The clustering process seemed not robust when it came to large datasets.

To be clear, I was using foldseek to reduce the redundancy in the PDB and would like to see those nearly identical PDB structures being clustered into the same group. So I ran

foldseek easy-cluster the_folder_to_be_cluster custom_name_for_output tmp -c 0.8 --tmscore-threshold 0.9 --sort-by-structure-bits 1 --min-seq-id 0.9 --cov-mode 0

(foldseek Version: 035edc185941bbab615083aab80c63725f5a48f6)

When the_folder_to_be_cluster only contains following PDB files:

3CYO,3CP1,101K,1AKE,4AKE,865K,2M06,2LHF,1YO7,2HZ8,738K,2LFD

it works well that 3CYO and 3CP1 being clustered into the same group.

However, if the_folder_to_be_cluster contains thousands of files, 3CYO and 3CP1 would be in separate clusters.

The clustering process actually works pretty well for most cases, I just accidentally found out that 3CYO and 3CP1 are in separate clusters.