hugohrban / ProGen2-finetuning

Finetuning ProGen2 protein language model for generation of protein sequences from selected protein families.
BSD 3-Clause "New" or "Revised" License
24 stars 2 forks source link

Question about control tags #3

Closed martinez-zacharya closed 2 weeks ago

martinez-zacharya commented 3 weeks ago

Awesome work! I was wondering about specifics regarding the control tags:

Do they need to be pfam ID's or can they be any arbitrary string within two "|" at the beginning of the sequence? Like say if I had a group of proteins that don't have a pfam ID. Thank you!

hugohrban commented 2 weeks ago

Hi, thanks for your question!

You are right, the special prefix tokens can be anything like <|YOUR_TOKEN|>, and should typically be followed by a "1" or a "2" depending whether sequence is stored in the traditional N -> C -terminal direction or the reverse, same as in the original ProGen2 models. Also keep in mind that currently only one prefix token per sequence is supported. I updated the scripts to allow for arbitrary token, not just pfam tokens f089a3a. (See also simillar issue #4 )

Let me know if I can help in any other way!

martinez-zacharya commented 2 weeks ago

Awesome, thank you for the helpful response!