songlab-cal / tape-neurips2019

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
https://arxiv.org/abs/1906.08230
MIT License
118 stars 34 forks source link

amino acid mapping #16

Closed madani-sf closed 4 years ago

madani-sf commented 4 years ago

Can the authors provide the mapping from index number in raw data to three letter amino acid names?

I'm assuming it is alphabetical starting from 'A'-> 4 (skipping the letter 'J'). in addition to ordering, please give clarification on full amino acid name

thomas-a-neil commented 4 years ago

Sure! The mapping from amino acid to index is given by PFAM_VOCAB

https://github.com/songlab-cal/tape/blob/master/tape/data_utils/vocabs.py#L1

The 1 letter codes follow the standard IUPAC convention https://www.bioinformatics.org/sms2/iupac.html

rmrao commented 4 years ago

One thing we could add as a note - if you google IUPAC codes, the first thing that comes up (for me at least) is https://www.bioinformatics.org/sms/iupac.html, which is version 1 of the site, not the updated version 2 (which has different codes). Version 1 only shows codes for the 20 standard amino acids.