Open hanfengming2004 opened 1 week ago
Hi, thanks for your question!
Finding the closest pfam family to your dataset would be one way to go. But it is definitely possible to train on arbitrary sets of protein sequences as well. You just need to preprocess your train and test data files so that each line starts with a special token, such as <|YOUR_TAG|>
. But other that that, I think finetuning and sampling should then work as expected. I updated the scripts to accept any special token (still just one spec. token per sequence). See commit f089a3a and similar issue #3
Let me know if I can help in any other way.
If my protein does not belong to an explicit pfam family, how to generate a new sequence? I guess: extract the closest pfam family(or several families) to fine-tuning the progen2 pretrained model, right? This question is whether I can define the training protein sets myself (not limiting to the downloaded pfam family) for fine-tuning according to different design goals?
Thanks!
fengming han