mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
186 stars 15 forks source link

Strange results for AA2fold translation #30

Open CalvinRusley opened 3 months ago

CalvinRusley commented 3 months ago

Hi! First of all, thank you for making a fantastic tool!

Second, after working exclusively with the embedding outputs for a while I've started playing around with the translation feature of ProstT5. However, the results are somewhat confusing. For example: Embedding the protein GCF_000019165_1_000000000001_1122_cterminaldomain, of length 298, yields a fold result much longer than the input sequence:

dvvvvcpdpvnvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvppddpvvcvvvvddddpvvvvvvvvvvvvvvvvvvvvvvvvvpdpddpvvvvvvvvcvvvvvddfpdppppgddqpdpvsvvvvvvvvvcvvvvvvvvvvvvvvvvvvcvvcvvvvvvvvvvvvvvcvvpvcpvvdpvvvvvvvvvlvvllvvlvvpddpvlvvllvvlvvllvvlvvlvpddqvvqqvvlvvvqvvqvvvdadpvggddsggdgsvvsnvvsvvsnvvsvvsnsvsssvdrdddppdddddddpvvvvvvvvvvpdpvvvpdpvvvvvvvvvvvvvvvvvvvpddpvvvvvvvvvvvvvvvvvvvvvddpvvvvvvvvvvvvvcvvvdppppdddddddddpvrvvvvvvvvvvvvvvvvvvvvvvdGdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

Also of note:

Is this output normal, and if so, could you offer some guidance as to its interpretation?

CalvinRusley commented 3 months ago

Update: I've tried the same set of sequences with translate.py, which returns the right length and without any uppercase characters, but the long stretches of "d" and "v" persist:

dvvvvcpppvnvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvpddddddddddddddddddddddddddddddddpvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvcvvcvvvvvvvvvvvvvvcvvpvvvvddpvvvvvvvvvvvvvvvvvvvpddpvlvvllvvlvvllvvlvvlvpddqvvqqvvqvvvlvvcvvvvnnvvsvvddggdgsvvsnvvsvvsnvvsvvsnvvsvvvd

CalvinRusley commented 3 months ago

I've realized after some more digging that your answer in this issue is pertinent here. Is balancing the 3Di classes and retraining the CNN feasible?

mheinzinger commented 3 months ago

Thanks for digging and sorry for the delayed response; yes balancing the 3Di classes and retraining is feasible but I have to admit that it would probably take quite some time before we get to it from our end (sorry); depending on how urgent it is, you could also try a quick hack which does not require finetuning the actual pLM: my colleague Joa put some nice documentation here on how to train a CNN on top of ProstT5 for predicting 3Di from AA-only: https://github.com/mheinzinger/ProstT5/issues/31 You could take the same dataset, invert the input/output (going from AA to 3Di via a CNN trained on top of the encoder) and apply the balancing there (some weighting of the loss function or some upsampling of rare cases). Hope this helps; good luck!