Strange results for AA2fold translation

CalvinRusley commented 3 months ago

Hi! First of all, thank you for making a fantastic tool!

Second, after working exclusively with the embedding outputs for a while I've started playing around with the translation feature of ProstT5. However, the results are somewhat confusing. For example: Embedding the protein GCF_000019165_1_000000000001_1122_cterminaldomain, of length 298, yields a fold result much longer than the input sequence:

dvvvvcpdpvnvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvppddpvvcvvvvddddpvvvvvvvvvvvvvvvvvvvvvvvvvpdpddpvvvvvvvvcvvvvvddfpdppppgddqpdpvsvvvvvvvvvcvvvvvvvvvvvvvvvvvvcvvcvvvvvvvvvvvvvvcvvpvcpvvdpvvvvvvvvvlvvllvvlvvpddpvlvvllvvlvvllvvlvvlvpddqvvqqvvlvvvqvvqvvvdadpvggddsggdgsvvsnvvsvvsnvvsvvsnsvsssvdrdddppdddddddpvvvvvvvvvvpdpvvvpdpvvvvvvvvvvvvvvvvvvvpddpvvvvvvvvvvvvvvvvvvvvvddpvvvvvvvvvvvvvcvvvdppppdddddddddpvrvvvvvvvvvvvvvvvvvvvvvvdGdddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd

Also of note:

The long, often terminal stretches of just "d". Other sequences seem to swap "d" for "v", and in this sequence specifically these long "v" stretches show up in the apparent "meat" of the translated sequence as well.
The appearance of uppercase letters (the G right before the terminal "d" stretch)

Is this output normal, and if so, could you offer some guidance as to its interpretation?

CalvinRusley commented 3 months ago

Update: I've tried the same set of sequences with translate.py, which returns the right length and without any uppercase characters, but the long stretches of "d" and "v" persist:

dvvvvcpppvnvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvpddddddddddddddddddddddddddddddddpvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvcvvcvvvvvvvvvvvvvvcvvpvvvvddpvvvvvvvvvvvvvvvvvvvpddpvlvvllvvlvvllvvlvvlvpddqvvqqvvqvvvlvvcvvvvnnvvsvvddggdgsvvsnvvsvvsnvvsvvsnvvsvvvd

CalvinRusley commented 3 months ago

I've realized after some more digging that your answer in this issue is pertinent here. Is balancing the 3Di classes and retraining the CNN feasible?

mheinzinger commented 3 months ago

Thanks for digging and sorry for the delayed response; yes balancing the 3Di classes and retraining is feasible but I have to admit that it would probably take quite some time before we get to it from our end (sorry); depending on how urgent it is, you could also try a quick hack which does not require finetuning the actual pLM: my colleague Joa put some nice documentation here on how to train a CNN on top of ProstT5 for predicting 3Di from AA-only: https://github.com/mheinzinger/ProstT5/issues/31 You could take the same dataset, invert the input/output (going from AA to 3Di via a CNN trained on top of the encoder) and apply the balancing there (some weighting of the loss function or some upsampling of rare cases). Hope this helps; good luck!

mheinzinger / ProstT5

Strange results for AA2fold translation #30