mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
148 stars 13 forks source link

translate.py with default settings generates long stretch of simple repeats #27

Open yzhang-github-pub opened 4 days ago

yzhang-github-pub commented 4 days ago

The query protein is 125 aa long. I ran translate.py by following the readme to convert amino acid sequence to 3Di first, then from 3Di to amino acid sequence. Below shows amino acid counts:

image

The query is a natural protein sequence and it does look natural. The generated sequence is dominated by a long stretch of "AP" repeats.

Please advice what parameters to adjust. Thanks.

mheinzinger commented 4 days ago

Hi, would you mind going back one step and generate 3Di for your query from some structure prediction (E.g. AF2) and use this to generate an amino-acid sequence? In the past, i have observed that the predicted 3Di can be very repetitive as there is some class imbalance in the 3Di alphabet which leads to our predictor predicting mostly those few frequent 3Di tokens. This might not be that problematic if you use the resulting 3Di string for remote homology detection as you have multiple 3Di tokens coding e.g. for helical sub-structures but when it comes to more fine-grained tasks s.a. inverse folding, this might hit you. So by inputting some 3Di derived from some 3D structure, you can already remove one potential source of errors. Other than that, I have put in some config (https://github.com/mheinzinger/ProstT5/blob/main/scripts/translate.py#L39) that proved to be useful when benchmarked the generation. While being on average useful for my test set, it might be different for your set. However, this is sth you need to simply play around with a bit. Finally, you can always just cross-check whether the sequence (albeit looking maybe unrealistic to you) can be folded by some 3D structure predictor.