mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Translation result is unexpected! #13

Closed darkcorvushhh closed 1 month ago

darkcorvushhh commented 3 months ago

屏幕截图 2024-03-09 121846 Hello, I used the pre-trained model of ProstT5 to convert the protein sequence into 3Di, but I found that the result is full of "d" and "p", if the sequence is short, there is only d and p, if the sequence is long, then the beginning parts are all d and p, what is the reason for this?

mheinzinger commented 2 months ago

Hi; yes this is unfortunately, there is quite some class-imbalance within the original 3Di states with "d" and "p" being the 2nd and 3rd most abundant class (see Fig. S1D here: https://www.biorxiv.org/content/10.1101/2023.07.23.550085v2.supplementary-material). Those 3Di states seem to encode mostly loopy/other regions without helix/strand content (see Fig. S1B here: https://www.biorxiv.org/content/10.1101/2023.07.23.550085v2.supplementary-material). So ProstT5 picked up the underlying bias in the 3Di states and might make it even a bit worse by always predicting stretches of d or p if it can not assign it to either helix/sheet.