sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
461 stars 65 forks source link

Handling X characters #174

Closed GiovannaNicora closed 2 years ago

GiovannaNicora commented 2 years ago

Hi, thanks for your useful work. I was just wondering how the different embedders are dealing with the presence of unknown AA (X character).

konstin commented 2 years ago

If I remember correctly, all of them have X as a character their alphabet just like a normal amino acid. The representation will of course be mostly meaningless. The embedders vary though when it comes to rare AA, the ProtTrans models e.g. map U, Z, O and to X

mheinzinger commented 2 years ago

In case of ProtTrans, we tested the effect of 'X' by using t-SNE to project uncontextualized embeddings (no attention, yet) of the 20 standard AAs and 'X' down to 2D. In the resulting plots, we saw a trend towards 'X' clustering with hydrophobic amino acids. One thing that stood out was that 'X' was for different transformers always very close to 'C'. So I assume that this might give you at least some hint how the protein language models 'read' an 'X' in a protein sequence. You can check this visually with the t-SNE plots in the ProtTrans paper (mostly in SOM; only for ProtT5 in main).