Open Raul-araya-secchi opened 8 months ago
@Raul-araya-secchi Foldseek was meant to cluster huge collection of diverse protein structures. However, you could try to use the parameter --tmscore-threshold
.
I'm in the same position and I tried to add this --tmscore-threshold
argument. However no matter how much the threshold is used (even with 0.999) the clustering algorithm cannot distinguish 4ake from 1ake. Do you have any sugguestions?
the cmd I've tried to use are
foldseek easy-cluster data/ res tmp --tmscore-threshold 0.99
foldseek easy-cluster data/ res tmp --tmscore-threshold 0.99 --alignment-type 1
and some other attempts in modifying -c and -e.
I also tried to print the tmscores calculated and the results seem to be correct. The following results are made with foldseek easy-search data/ data/ aln tmp --format-output query,target,alntmscore,prob
.
4ake.cif.gz_A 4ake.cif.gz_A 1.000E+00 1.000
4ake.cif.gz_A 4ake.cif.gz_B 9.908E-01 1.000
4ake.cif.gz_A 2eck.cif.gz_A 6.868E-01 1.000
4ake.cif.gz_A 2eck.cif.gz_B 6.876E-01 1.000
4ake.cif.gz_A 1ake.cif.gz_A 6.854E-01 1.000
4ake.cif.gz_A 4jzk.cif.gz_B 6.869E-01 1.000
4ake.cif.gz_A 1e4v.cif.gz_A 6.854E-01 1.000
4ake.cif.gz_A 7apu.cif.gz_A 6.858E-01 1.000
4ake.cif.gz_A 1e4v.cif.gz_B 6.870E-01 1.000
4ake.cif.gz_A 7apu.cif.gz_B 6.858E-01 1.000
@ZiyaoLi could you please share the PDB files of the data
folder?
Thank you for reporting this. This was a bug in the structurerescorediagonal
code. It did not respect the tmscore threshold properly. I fixed it now and the following command results now in two clusters:
foldseek easy-cluster 4ake.pdb 1ake.pdb clu tmp --tmscore-threshold 0.99
Thank you for reporting this. This was a bug in the
structurerescorediagonal
code. It did not respect the tmscore threshold properly. I fixed it now and the following command results now in two clusters:
You are so fast. I was preparing for a smallest demo. The demo is here anyway.
@martin-steinegger I encountered a related issue. The clustering process seemed not robust when it came to large datasets.
To be clear, I was using foldseek
to reduce the redundancy in the PDB and would like to see those nearly identical PDB structures being clustered into the same group. So I ran
foldseek easy-cluster the_folder_to_be_cluster custom_name_for_output tmp -c 0.8 --tmscore-threshold 0.9 --sort-by-structure-bits 1 --min-seq-id 0.9 --cov-mode 0
(foldseek Version: 035edc185941bbab615083aab80c63725f5a48f6)
When the_folder_to_be_cluster
only contains following PDB files:
3CYO,3CP1,101K,1AKE,4AKE,865K,2M06,2LHF,1YO7,2HZ8,738K,2LFD
it works well that 3CYO
and 3CP1
being clustered into the same group.
However, if the_folder_to_be_cluster
contains thousands of files, 3CYO
and 3CP1
would be in separate clusters.
The clustering process actually works pretty well for most cases, I just accidentally found out that 3CYO
and 3CP1
are in separate clusters.
I'm trying to cluster a set of ~500 conformations of the same protein but no matter what I try I only get 1 cluster.
Do I need to do something to the set of structures before clustering them?
I'm a bit lost and will appreciate any help.
Thanks.