mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
183 stars 15 forks source link

ProstT5 for tasks like AMP classification or protein localization. Also protT5 xxl results. #8

Closed IliasGewr closed 8 months ago

IliasGewr commented 10 months ago

First of all I would like to express my appreciation for your work. I use your models for protein embedding extraction and then use them in subsequent tasks like AMP classification, Protein Localization etc. We are currently writing a comperative study for PFMs on different tasks and ProtT5 seems to be the best among the ones under examination (except some cases where ESM2t45 outperforms it by a little but protT5 is substancialy faster.).

I would like to ask you if you expect the ProstT5 model to perform better on tasks like protein localization than ProtT5 xl? I am going to perform some experiments and inform you if you are intrested.

Also I would like to ask you about your opinion on why ProtT5 xxl seems to perform worse than the ProtT5xl model. Is it a tuning thing or maybe a lack of sufficient amount of training data or something else. What do you think.

Best regards,

Elias G. University of Crete - FORTH

mheinzinger commented 10 months ago

Hi Elias, thanks a lot for your detailed explanation and your interest in our work! - On your Q: from what I saw, I expect ProstT5 to outperform original ProtT5 on tasks that are heavily related to structure s.a. CATH/SCOPe topology prediction, secondary/3D structure prediction etc. . For tasks where structure is less crucial, s.a. subcell. loc. prediction, I do not expect ProstT5 to outperform ProtT5. In general, ProstT5 embeddings will be rendered uninformative once you have no fixed/rigid structure (either from PDB/ColabFold or our 3Di-predictor). That being said: what I observed (and what might tie in to your statement about the speed of ProtT5) is that you can easily benefit from "the best of both worlds" (sequence and structure embeddings) and potentially compensate a lack of rigid structure simply by concatenating embeddings from ProtT5 and ProstT5. This will double runtime but depending on your use-case, this might pay off.

Re. ProtT5-XXL: our assumption is that this model did not see enough samples given its number of parameters. This relationship was already shown multiple times, e.g. here: https://arxiv.org/pdf/2001.08361.pdf . To make a final statement, one would probably have to run some (super expensive) experiments so we simply accepted this working hypothesis.

Really curious to see your final benchmark, feel free to ping me once you have something that you can share :)