novel protein sequence generation

hugohrban / ProGen2-finetuning

Finetuning ProGen2 protein language model for generation of protein sequences from selected protein families.

BSD 3-Clause "New" or "Revised" License

20 stars 1 forks source link

Hi, thanks for your question!

Finding the closest pfam family to your dataset would be one way to go. But it is definitely possible to train on arbitrary sets of protein sequences as well. You just need to preprocess your train and test data files so that each line starts with a special token, such as <|YOUR_TAG|>. But other that that, I think finetuning and sampling should then work as expected. I updated the scripts to accept any special token (still just one spec. token per sequence). See commit f089a3a and similar issue #3

Let me know if I can help in any other way.

hugohrban / ProGen2-finetuning

novel protein sequence generation #4