tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

How to use the protein encoder? #1

Closed datduong closed 5 years ago

datduong commented 5 years ago

Would you be able to provide instruction for running the encoder? For example, is there some function like this model.encode('MKVKK') where MKVKK is some amino acid sequence?

Which of the pre-trained models should I use?

Thanks.

tbepler commented 5 years ago

There is not a single function to do what you ask. You need to first encode the amino acid sequence into bytes with alphabets.Uniprot21, convert this to a pytorch tensor, and then embed the sequence with the trained model.

I suggest taking a look at eval_secstr and/or eval_transmembrane for an idea of how this works (see specifically "encodesequence" and "TorchModel.__call_\" in eval_secstr).

With regards to models, "pfam_lm_lstm2x1024_tied_mb64.sav" is the bidirectional language model trained on Pfam. Of the structure-based embedding models, I would suggest using "ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_lambda0.1_p0.05_epoch100.sav" which is the SSA model trained with both structural similarity and contact prediction tasks on the full training set (SSA (full) model in the manuscript).

datduong commented 5 years ago

Thanks for your help. I was able to figure out how to use the Uniprot21 conversion.