How to use the protein encoder?

datduong commented 5 years ago

Would you be able to provide instruction for running the encoder? For example, is there some function like this model.encode('MKVKK') where MKVKK is some amino acid sequence?

Which of the pre-trained models should I use?

Thanks.

tbepler commented 5 years ago

There is not a single function to do what you ask. You need to first encode the amino acid sequence into bytes with alphabets.Uniprot21, convert this to a pytorch tensor, and then embed the sequence with the trained model.

I suggest taking a look at eval_secstr and/or eval_transmembrane for an idea of how this works (see specifically "encodesequence" and "TorchModel.__call_\" in eval_secstr).

With regards to models, "pfam_lm_lstm2x1024_tied_mb64.sav" is the bidirectional language model trained on Pfam. Of the structure-based embedding models, I would suggest using "ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_lambda0.1_p0.05_epoch100.sav" which is the SSA model trained with both structural similarity and contact prediction tasks on the full training set (SSA (full) model in the manuscript).

datduong commented 5 years ago

Thanks for your help. I was able to figure out how to use the Uniprot21 conversion.

tbepler / protein-sequence-embedding-iclr2019

How to use the protein encoder? #1