rdpstaff / classifier

RDP extensible sequence classifier for fungal lsu, bacterial and archaeal 16s
GNU General Public License v2.0
53 stars 32 forks source link

Is there a way to define a different k-mer? What way to chop up the reads? #13

Open yingeddi2008 opened 8 years ago

yingeddi2008 commented 8 years ago

Hi,

From the RDP classifier paper I read about, it says the word size is 8 (Just to make sure I am understanding it right, the size here should be the length of word, such as ATCTGGTC, right?), which is the optimal, because the other word size of 6,7 or 9 is not accurate enough comparing to size 8 according to preliminary experiments.

- Is there an option for me to pick another word size when I am training my own classifier with a customized database?

Also, I want to know how do you chop up the reads in the database? It says all the words should be non-overlapping, which is to satisfy the assumption for Bayes Rule that all features are independent (correct me if I am understanding it incorrectly). Say I have a sequence in the database:

SeqA: AAAAAAAA TTTTTTTT GGGGGGGG TTTTTTTT

If I chop up from the very first nt, then I should get the 8-size word:

AAAAAAA X1, TTTTTTTT X2, and GGGGGGGG X1, and this will be recorded as the features for this particular genus.

But what if I have a test sequence:

SeqB: ATTTTTTT TGG, clearly you can tell it's a subset from SeqA (I make the subset bold in SeqA), but if I chop up from the very first nt, it won't give me the same feature word as you could get from SeqA. I will get ATTTTTTT, and whatever the leftover: TGG. I am curious, what do you do with the leftover nt? Just throw them away?

- I think I need a little insight about how to chop up the database into kmers, and how you define the features?

I am a beginner in Machine Learning algorithms, and still trying to learn more about RDP classifier. If my understanding is wrong, I am welcome to any suggestion.

Thanks a lot!

Eddi