Three mer embeddings versus codons

Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.

I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:

* D_ij ~ 1 # Null model
* D_ij ~ reading_frame # Null model
* D_ij ~ encoded_amino_acid
* D_ij ~ encoded_amino_acid + reading_frame
* D_ij ~ encoded_amino_acid + domain
* D_ij ~ encoded_amino_acid + reading_frame + domain

With the following variables:

* D_ij - the cosine distance between 3mer i and 3mer j.
* encoded_amino_acid - which amino acid does the 3mer encode for
* reading_frame - is the 3mer in the reading frame. This would require generating the embeddings where reading frame vs. non reading frame 3mers are treated differently. E.g. you might add some character to the end to distinguish between these.
* domain - do they differ between bacteria and archea. This would be a test for a phylogenetic signal.

I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.

philippbayer / machine_learning_notebooks

Three mer embeddings versus codons #1