hugohrban / ProGen2-finetuning

Finetuning ProGen2 protein language model for generation of protein sequences from selected protein families.
BSD 3-Clause "New" or "Revised" License
20 stars 1 forks source link

novel protein sequence generation #4

Open hanfengming2004 opened 1 week ago

hanfengming2004 commented 1 week ago

If my protein does not belong to an explicit pfam family, how to generate a new sequence? I guess: extract the closest pfam family(or several families) to fine-tuning the progen2 pretrained model, right? This question is whether I can define the training protein sets myself (not limiting to the downloaded pfam family) for fine-tuning according to different design goals?

Thanks!

fengming han

hugohrban commented 4 days ago

Hi, thanks for your question!

Finding the closest pfam family to your dataset would be one way to go. But it is definitely possible to train on arbitrary sets of protein sequences as well. You just need to preprocess your train and test data files so that each line starts with a special token, such as <|YOUR_TAG|>. But other that that, I think finetuning and sampling should then work as expected. I updated the scripts to accept any special token (still just one spec. token per sequence). See commit f089a3a and similar issue #3

Let me know if I can help in any other way.