Open CalvinRusley opened 3 months ago
Update: I've tried the same set of sequences with translate.py, which returns the right length and without any uppercase characters, but the long stretches of "d" and "v" persist:
dvvvvcpppvnvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvpddddddddddddddddddddddddddddddddpvvvvvvvvvvvcvvvvvvvvvvvvvvvvvvcvvcvvvvvvvvvvvvvvcvvpvvvvddpvvvvvvvvvvvvvvvvvvvpddpvlvvllvvlvvllvvlvvlvpddqvvqqvvqvvvlvvcvvvvnnvvsvvddggdgsvvsnvvsvvsnvvsvvsnvvsvvvd
I've realized after some more digging that your answer in this issue is pertinent here. Is balancing the 3Di classes and retraining the CNN feasible?
Thanks for digging and sorry for the delayed response; yes balancing the 3Di classes and retraining is feasible but I have to admit that it would probably take quite some time before we get to it from our end (sorry); depending on how urgent it is, you could also try a quick hack which does not require finetuning the actual pLM: my colleague Joa put some nice documentation here on how to train a CNN on top of ProstT5 for predicting 3Di from AA-only: https://github.com/mheinzinger/ProstT5/issues/31 You could take the same dataset, invert the input/output (going from AA to 3Di via a CNN trained on top of the encoder) and apply the balancing there (some weighting of the loss function or some upsampling of rare cases). Hope this helps; good luck!
Hi! First of all, thank you for making a fantastic tool!
Second, after working exclusively with the embedding outputs for a while I've started playing around with the translation feature of ProstT5. However, the results are somewhat confusing. For example: Embedding the protein GCF_000019165_1_000000000001_1122_cterminaldomain, of length 298, yields a fold result much longer than the input sequence:
Also of note:
Is this output normal, and if so, could you offer some guidance as to its interpretation?