psipred / Merizo

Fast and accurate protein domain segmentation using Invariant Point Attention
GNU General Public License v3.0
26 stars 5 forks source link

Using Merizo to segment proteins with pure 3D coordinates information #6

Closed Immortals-33 closed 3 months ago

Immortals-33 commented 3 months ago

Dear authors:

Thanks for bringing up this amazing tool and making the code available!

I'm trying to use Merizo to segment some of the AI-generated proteins. However, some of them may lack the sequence information, i.e. with a dummy sequence filled with all "A"s or "E"s. And if I'm not far mistake about Merizo, the model architecture utilizes both sequence and structure information, especially when the single representation as well as pairwise information and frames are integrated using IPA. Therefore, I'm a little bit concerned about the potential loss of information that affects the precision of results.

Do you have any experience under such a scenario, e.g. Using solely 3D coordinates information to perform domain segmentation by Merizo, or a specialized mode for handling such an exception?

Thanks in advance!

Best, Zhuoqi

andymlau commented 3 months ago

Hi Zhuoqi,

There isn't a specialised mode to handle that situation, but if I recall correctly, early testing we did showed that both structure and sequence gave the best performance, but structure was more important than sequence features. I think the best thing to do would be to test Merizo out on your generated models to see what happens - even better if you could generate single-AA analogues of known folds. I'd be interested to hear about the outcome as well.

Thanks,

Andy

Immortals-33 commented 3 months ago

Hi Andy,

Thanks for the quick response!

Actually I do have a few tests upon the structures with "dummy" sequences. Some of my observations below:

  1. Structures with dummy sequences. Most of the residues within them are classified as NDRs, with most of the pIoU below 0.4. ndom are mostly $0$ or $1$.
  2. AA-Analogues of these structures. This set showcases a significantly higher pIoU as well as nres_dom and ndom. I think this is probably the expected outcome by Merizo from normal protein structures, e.g. from PDB or AFDB.

My guess is this behavior originals from the training process of Merizo that utilized both sequence and structure information, making the model sensible to the lack of one of them. Encountering a dummy sequence might make the model confused, partially reflecting in the low values of pIoU which mostly makes use of single representations. Yet for the same reason, I would like to think the actual performance might be better than what pIoU looks like. The fusion of sequence + structure might also help the performance of Merizo compared to purely coordinates-based domain segmentation methods. Does that sound reasonable to you?

Hope these help.

Best, Zhuoqi

andymlau commented 3 months ago

Hi Zhuoqi,

That's interesting, thanks for sharing.

Do your structures in the first test look like they have domains? Are there any structures with obvious domain-like regions that Merizo fails on? I might be mistaken but it sounds like the difference between the first and second experiment is that in set 2, the models are based on known structures, while that's not the case for set 1. If that's indeed the case, I wonder if performance is bad because the models in set 1 are inherently different to those in set 2, e.g. backbone distance distribution is different, mirror topologies, etc. As you say, because Merizo was trained on natural sequences and crystal structures, it's not surprising that performance takes a hit or does something weird when the inputs are out of distribution. The pIoU being so low is probably a hint that Merizo is very unfamiliar with those models.

The pIoU is actually predicted from the combined single + pair + frames embeddings that comes out of the mask decoder. In Figure 1b of the paper, s' is an updated single representation only because it retains the same shape, but is not only the single representation that was fed into the initial IPA module. Apologies if this wasn't clear from the figure.

Thanks,

Andy

Immortals-33 commented 3 months ago

Hi Andy,

Thanks for these responses.

The structures in set 1 mostly consist of structured regions ($\alpha$-helix and $\beta$-sheet mainly, with few loops) and look like proteins with one or two large single domain. The structures in set 2 are generated with a combination of fixed-backbone sequence design method and structure prediction method to find a mirror sequence-structure pair of set 1 (although the similarity could not always be guaranteed), which are likely artificial sequences rather than natural sequences (i.e. different from the training set of Merizo). However the performance of Merizo looks pretty reasonable under the scenario of set 2, which makes me think that it was the absence of sequence information that result in the distinctive performances between set 1 and set 2.

As for the domain confidence, I might be mistaken about its usage of "solely single representation", sorry 'bout that. In this way I think the pIoU could be a consistent indicator to the quality of results.

And thanks again for developing this tool. Your works on domain segmentation and discovery are quite intriguing to me. I'll stay tuned for your future works!

Best, Zhuoqi

andymlau commented 3 months ago

Hi Zhuoqi,

Thanks for the clarification. In that case yes I think the performance difference you see on set 1 and 2 make sense if set 2 targets have richer sequence information than just poly-A, etc. It's an interesting result so thank you for sharing this with me.

Best of luck with your experiments! Andy