songlab-cal / tape

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
https://www.biorxiv.org/content/10.1101/676825v1
BSD 3-Clause "New" or "Revised" License
658 stars 129 forks source link

Can I make some simple predictions directly (just like stability on my own datasets) using the sequence embedding results? #44

Closed goes0n closed 4 years ago

goes0n commented 4 years ago

I want to embed the protein sequences (my own dataset) and use the embedding vectors to make stability predictions. Can I use the Extracting Embeddings section directly to get the results? I just started to learn the knowledge, and I would appreciate it if you can reply to me.

rmrao commented 4 years ago

There are two potential workflows:

  1. You can extract and save embeddings, then use these as input to a model that you write yourself on your dataset. If you would like to do this I'd recommend using the babbler-1900 model, since this will produce good single-vector embeddings of a protein. Extracting embeddings returns an npz file, so you could write a downstream model in any way you want, even in something like scikit-learn.
  2. You can fully finetune an existing model, either using our training code, or by writing your own. This will probably give the best results, but does require that you write some pytorch code.

Hope this helps!

rmrao commented 4 years ago

Since there's no followup, I'm going to assume this is closed.