Translation result is unexpected！

mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure

MIT License

147 stars 13 forks source link

Hi; yes this is unfortunately, there is quite some class-imbalance within the original 3Di states with "d" and "p" being the 2nd and 3rd most abundant class (see Fig. S1D here: https://www.biorxiv.org/content/10.1101/2023.07.23.550085v2.supplementary-material). Those 3Di states seem to encode mostly loopy/other regions without helix/strand content (see Fig. S1B here: https://www.biorxiv.org/content/10.1101/2023.07.23.550085v2.supplementary-material). So ProstT5 picked up the underlying bias in the 3Di states and might make it even a bit worse by always predicting stretches of d or p if it can not assign it to either helix/sheet.

mheinzinger / ProstT5