mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
179 stars 15 forks source link

Limit AA alphabet when generating from 3Di rep #28

Open neuwirtter opened 3 months ago

neuwirtter commented 3 months ago

Hi,

I would like to generate versions of existing proteins with your tool that are lacking one amino acid in their sequence (using alphabet of 19 amino acids). Do you think it is possible when generating sequence from 3Di representation to limit the alphabet somehow?

Thank you in advance,

Tereza

mheinzinger commented 2 months ago

Hi Tereza, thanks a lot for your interest in our method. That is absolutely doable and I already made good experience in using this to avoid generation of e.g. Glycin or Alanin. You can simply expand the token_ids passed via the bad_words_ids given here: https://github.com/mheinzinger/ProstT5/blob/main/scripts/translate.py#L196 (simply add the the AAs you want to avoid to generate and the model should not produce them anymore) Best, Michael