How to select optimal k-mer (lists)?

alexdonath commented 6 years ago

Hi @tderrien, I am trying to annotate lncRNAs in the genome of a non-model organism using the shuffle strategy. Now I was wondering, how do I determine which k-mer frequencies should be preserved in order to maximize the classification accuracy? Can this be judged from the mean accuracy value in the *RF_statsLearn_CrossValidation.txt file? Or do I need to calculate MCC values for each k-mer (list)? If so, can I use the mean TP/TN/FP/FN from the above mentioned file for this?

Thanks for your help!

tderrien commented 6 years ago

Hi @alexdonath, Briefly, the more you include different k-mer frequencies, the better will be the classification accuracy (although the gain is moderate with high combination/numbers of k-mer scores, see supplementary figure S2 of the paper). In general, we use the option --kmer="1,2,3,6,9,12" to include {1,2,3,6,9,12}-mer frequencies which gives relatively good performance in a reasonable computational time. You are right that you could test different combinations of k-mer (list) and look at performance metrics in the *RF_statsLearn_CrossValidation.txt file but again, I guess you will end up with results similar to figure S2 of the paper. Hope this helps All the best,

Thomas

alexdonath commented 6 years ago

Hi Thomas, Thanks a lot! I have tested various k-mer lists and found a couple that gave better results on my data than the list you proposed. However, the improvements were only marginal. So yes, the k-mer list you suggested seems to work quite well. Thanks again, Alex

tderrien commented 6 years ago

Happy it helped! Best,

Thomas

tderrien / FEELnc

How to select optimal k-mer (lists)? #26