steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
693 stars 91 forks source link

Multi-segmented CATH members #286

Closed Pooryamb closed 1 week ago

Pooryamb commented 3 weeks ago

Hi,

Some cath domains are defined using separate segments of a PDB structure. For example, according to CATH domain coordinates, 1a0iA01 has two segments located between 1-35 and 192-239 of 1a0i.

I am wondering how you handle such cases. I initially thought I needed to use multimer mode for searching a protein against these multi-segmented domains, but it appears that cath50 is not available in multimer searching mode. I converted the cath50 database to the FASTA format and realized that for 1a0iA01, the FASTA sequence is the concatenation of residues 1-35 and 192-239. Does this mean that you consider the separate segments of the PDB protein as contiguous, simply neglecting the intervening region (residues 36-191)?

Thank you in advance for your assistance.

martin-steinegger commented 3 weeks ago

You can just use an --alt-ali to find alternative alignments. In default we set it to 1, so you will only find the highest scoring domain but if you increase it you will alternatives.

Pooryamb commented 3 weeks ago

Thanks for your reply!

I am curious about the best way to search against the multi-segmented domains. For example, 1a0iA01 has two segments located between 1-35 and 192-239 of 1a0i. Should it be stored as a multi-chain structure or it has to be stored as a single-chain one and the intervening part (36-191) should be ignored?

I also have another question, what is the meaning of "50" at the end of "cath50"? I thought it meant cath sequences have been clustered based on sequence identity and sequences with more than 50% sequence identity are within the same clusters. However, I see some redundant sequences such as 8icwA02 and 8icbA02 in cath50 which means it is probably not a clustered version of cath.

martin-steinegger commented 2 weeks ago

I think seperated entries is probably better than keeping them as one file. We clustered all domains using foldseek cluster using --min-seq-id 0.5 -c 0.9.

Pooryamb commented 2 weeks ago

Thanks,

If cath50 is a clustered version of cath, why does it contain redundant sequences? For example, 8icbA02, 8icsA02, and 8ictA02 all have identical sequences.

martin-steinegger commented 2 weeks ago

We structurally clustered it. Is the structure different?

Pooryamb commented 1 week ago

What is the measure of the structural similarity and what is the cutoff to put two proteins in the same cluster? And how the clustering would have changed if you would have used MMseqs with the same options rather than Foldseek? I thought by using --min-seq-id with Foldseek would cluster merely based on sequence.

martin-steinegger commented 1 week ago

The similarity is determined by structure through the 3Di alignments, once we have the alignment we can infer the sequence identity of the structural alignment.

Pooryamb commented 1 week ago

Thanks!