maovshao / PLMAlign

PLMAlign utilizes per-residue embeddings as input to obtain specific alignments and more refined similarity
https://dmiip.sjtu.edu.cn/PLMAlign
Other
16 stars 3 forks source link

How should I interpret the PLMSearch similarity value #2

Closed Rohit-Satyam closed 6 days ago

Rohit-Satyam commented 1 month ago

Hi @maovshao

I was using PLMSearch database to find remote homolog of some of my proteins and I was wondering how do I explain similarity value of 0.995 against the target that I got. I am aware that the sequence similarity will be less than 25% but this 99.5% similarity says what exact? Does it say that the query protein when folded will have 99.5% of the fold similar to the target identified by PLMSearch (structure-wise)? Kindly help.

maovshao commented 1 month ago

If you want to learn more about PLMSearch's similarity, you can refer to the "Method --- Similarity prediction" section of the PLMSearch paper.

In short, the similarity provided by PLMSearch is a "structural similarity predicted by a deep learning model with sequences as input". Its relationship with the actual structural similarity can be seen in "Fig. 6: Reference similarity" and "Supplementary Table 12" of PLMSearch paper.

In this case, since 0.995>>0.7, it means that PLMSearch's similarity prediction model (SS-predictor) believes that the two proteins are structurally similar (or more strictly speaking, belong to the same fold).In addition, I think the "PLMAlign score" can more accurately reflect the similarity between the two protein structures. It is a more refined similarity calculated by PLMAlign.

The above mentioned are all structural similarities predicted by sequences. If you are interested in the actual structural similarity, please prepare your real/predicted structures and use TM-align to calculate.

Hope the above answer is helpful to you.

Best.

Rohit-Satyam commented 1 month ago

I have two plasmodium proteins: O77324, Q8I341 (AF structure present) for which PLMSearch gives me ~0.9958 similarity and remote homology for Q8I3Z1 (AF structure absent). Now Q8I3Z1 is a 10K amino acid long and is without structure. So I am not sure how PLMSearch can say that that the proteins O77324, Q8I341 will have a same fold as Q8I3Z1!! Just thinking

Also if O77324, Q8I341 are predicted to have same fold as Q8I3Z1 they must also have same fold when aligned together right? But there TMalign score is also not good see here.

maovshao commented 1 month ago

Hi, thank you for sharing. We also tested the protein pair O77324-Q8I341.

Basically consistent with your findings, the structure of this protein pair is not similar (TM-score < 0.2), but the similarity of PLMSearch is as high as 0.9952. This is mainly because when PLMSearch finds that the COS Similarity between two protein embeddings is particularly high (> 0.995), PLMSearch will directly use the COS Similarity as the Similarity (Explained in the "Method --- Similarity prediction" section of the PLMSearch paper). This means that the pre-trained model we used (ESM-1b) believes that this protein pair is similar. But as you can see, the COS Similarity between the embeddings generated by the pre-trained model ESM-1b is not always correct.

PLMSearch has many ways to avoid such Wrong Pairs, but the O77324-Q8I341 protein pair has the following special features:

  1. Neither O77324 nor Q8I341 can scan any Pfam Domain, which means that the PfamClan module in PLMSearch will not work in this pair.
  2. The COS Similarity between them is only slightly higher than the threshold (0.995).
  3. The length difference between the two proteins is large (337/1165, nearly 3 times). For running speed and space considerations, PLMSearch uses the protein embedding obtained by averaging the amino acid embeddings, and the length information will be lost in the process (Explained in the "Discussion" section of the PLMSearch paper).

In general, the O77324-Q8I341 protein pair is hard to predict. We are conducting further research and will refer to this case. Our new method will do better in these hard cases.

Best.