Closed alexdonath closed 6 years ago
Hi @alexdonath,
Briefly, the more you include different k-mer frequencies, the better will be the classification accuracy (although the gain is moderate with high combination/numbers of k-mer scores, see supplementary figure S2 of the paper). In general, we use the option --kmer="1,2,3,6,9,12"
to include {1,2,3,6,9,12}-mer frequencies which gives relatively good performance in a reasonable computational time.
You are right that you could test different combinations of k-mer (list) and look at performance metrics in the *RF_statsLearn_CrossValidation.txt file but again, I guess you will end up with results similar to figure S2 of the paper.
Hope this helps
All the best,
Thomas
Hi Thomas, Thanks a lot! I have tested various k-mer lists and found a couple that gave better results on my data than the list you proposed. However, the improvements were only marginal. So yes, the k-mer list you suggested seems to work quite well. Thanks again, Alex
Happy it helped! Best,
Thomas
Hi @tderrien, I am trying to annotate lncRNAs in the genome of a non-model organism using the shuffle strategy. Now I was wondering, how do I determine which k-mer frequencies should be preserved in order to maximize the classification accuracy? Can this be judged from the mean accuracy value in the *RF_statsLearn_CrossValidation.txt file? Or do I need to calculate MCC values for each k-mer (list)? If so, can I use the mean TP/TN/FP/FN from the above mentioned file for this?
Thanks for your help!