Open michaelbarton opened 7 years ago
Hi @michaelbarton! By all means please go ahead :) I haven't yet settled on a 'story' for this investigation, but checking 3-mers and codons should definitely yield some interesting insights!
Just like you I don't have much time to look at this, but I'll have a look on the weekend
If you have any other ideas feel free to share!
I've started to look a bit into using LSTMs to classify some of my datasets, but it's slow going since the k-mer counting is so slow. For now I'm using k-mers as predicted by HAWK together with random noise since there I know that the k-mers are different, will later switch to full datasets
Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.
I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:
With the following variables:
I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.