sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

Reverse the embedding execution and get the original AA sequence? #147

Closed kandoigaurav closed 3 years ago

kandoigaurav commented 3 years ago

Hi

I'm using some of the ProtTrans embedders (for example: ProtTransBertBFDEmbedder, ProtTransAlbertBFDEmbedder, ProtTransXLNetUniRef100Embedder) for protein property prediction. Most of my target protein sequences are small (<100 amino acids), but all the ProtTrans embedders/models were trained on much longer sequences.

So, as an housekeeping step, I would like to use the embeddings to generate the original protein sequence. This will help me evaluate how well the embeddings work for my smaller proteins.

Is there a way I can do this? I tried searching for decoder for the ProtTrans model/embedders but couldn't find one.

konstin commented 3 years ago

Hi,

predicting the original residue from a per-residue embedding should be trivial, but generating sequences from a per-protein representation is mostly uncharted territory.

From personal experience the ProtTrans models work well for short sequences, especially ProtT5. I generally just do the predictions and then plot accuracy or whatever score I have against sequence length to check the influence of length.


One option to check how well the language models understands your sequences could be our mutagenesis protocol. What it does it that it gives ProtBert the sequence with a residue masked out, and ProtBert makes predictions for which amino acid is how likely. This is repeated with each residue in the protein, so you get a prediction each amino acid in the entire sequence. This should come closest to generating the original sequence from the embedding.

It's a very recent addition so there isn't a proper example yet, but in short you can use it like this:

global:
  sequences_file: short_sequences.fasta
  prefix: mutagenesis
mutagenesis:
  type: mutagenesis
  protocol: protbert_bfd_mutagenesis
plot_mutagenesis:
  type: visualize
  protocol: plot_mutagenesis
  depends_on: mutagenesis

For each protein, this will create a html file with a plotly plot that looks like this:

grafik

The red dot is the original amino acid, the color is the likelihood for each amino acid at this position. You could create a "generated sequence" by only taking the amino acid predicted as most likely.

I recommend trying this only with a single or a handful of proteins, since this is a lot slower than normal embedding computation (we to mask each residue separately, so we do a computation for each residue instead of for each protein)

sacdallago commented 3 years ago

Hi @kandoigaurav ,

what @konstin is the best bet for you. The package doesn't include a reconstruction tool yet, as it's mostly aimed for feature extraction.

On a personal note: I don't see a reason why the embedders shouldn't work well on short sequences! Mind, though, that during the training of ProtTrans models, sequences <20 AA were dropped from BFD (the first training set), while UniRef50 was kept as-is (the fine-tuning set for some models). I would only be concerned if you were to be using chopped sequences, as the models were trained on whole sequences which do have intrinsic structure.

If you have any other questions, feel free to re-open and ask :)