translate.py with default settings generates long stretch of simple repeats

Hi, would you mind going back one step and generate 3Di for your query from some structure prediction (E.g. AF2) and use this to generate an amino-acid sequence? In the past, i have observed that the predicted 3Di can be very repetitive as there is some class imbalance in the 3Di alphabet which leads to our predictor predicting mostly those few frequent 3Di tokens. This might not be that problematic if you use the resulting 3Di string for remote homology detection as you have multiple 3Di tokens coding e.g. for helical sub-structures but when it comes to more fine-grained tasks s.a. inverse folding, this might hit you. So by inputting some 3Di derived from some 3D structure, you can already remove one potential source of errors. Other than that, I have put in some config (https://github.com/mheinzinger/ProstT5/blob/main/scripts/translate.py#L39) that proved to be useful when benchmarked the generation. While being on average useful for my test set, it might be different for your set. However, this is sth you need to simply play around with a bit. Finally, you can always just cross-check whether the sequence (albeit looking maybe unrealistic to you) can be folded by some 3D structure predictor.

mheinzinger / ProstT5

translate.py with default settings generates long stretch of simple repeats #27