From the RDP classifier paper I read about, it says the word size is 8 (Just to make sure I am understanding it right, the size here should be the length of word, such as ATCTGGTC, right?), which is the optimal, because the other word size of 6,7 or 9 is not accurate enough comparing to size 8 according to preliminary experiments.
- Is there an option for me to pick another word size when I am training my own classifier with a customized database?
Also, I want to know how do you chop up the reads in the database? It says all the words should be non-overlapping, which is to satisfy the assumption for Bayes Rule that all features are independent (correct me if I am understanding it incorrectly). Say I have a sequence in the database:
SeqA: AAAAAAAA TTTTTTTT GGGGGGGG TTTTTTTT
If I chop up from the very first nt, then I should get the 8-size word:
AAAAAAA X1, TTTTTTTT X2, and GGGGGGGG X1, and this will be recorded as the features for this particular genus.
But what if I have a test sequence:
SeqB: ATTTTTTT TGG, clearly you can tell it's a subset from SeqA (I make the subset bold in SeqA), but if I chop up from the very first nt, it won't give me the same feature word as you could get from SeqA. I will get ATTTTTTT, and whatever the leftover: TGG. I am curious, what do you do with the leftover nt? Just throw them away?
- I think I need a little insight about how to chop up the database into kmers, and how you define the features?
I am a beginner in Machine Learning algorithms, and still trying to learn more about RDP classifier. If my understanding is wrong, I am welcome to any suggestion.
Hi,
From the RDP classifier paper I read about, it says the word size is 8 (Just to make sure I am understanding it right, the size here should be the length of word, such as ATCTGGTC, right?), which is the optimal, because the other word size of 6,7 or 9 is not accurate enough comparing to size 8 according to preliminary experiments.
- Is there an option for me to pick another word size when I am training my own classifier with a customized database?
Also, I want to know how do you chop up the reads in the database? It says all the words should be non-overlapping, which is to satisfy the assumption for Bayes Rule that all features are independent (correct me if I am understanding it incorrectly). Say I have a sequence in the database:
SeqA: AAAAAAAA TTTTTTTT GGGGGGGG TTTTTTTT
If I chop up from the very first nt, then I should get the 8-size word:
AAAAAAA X1, TTTTTTTT X2, and GGGGGGGG X1, and this will be recorded as the features for this particular genus.
But what if I have a test sequence:
SeqB: ATTTTTTT TGG, clearly you can tell it's a subset from SeqA (I make the subset bold in SeqA), but if I chop up from the very first nt, it won't give me the same feature word as you could get from SeqA. I will get ATTTTTTT, and whatever the leftover: TGG. I am curious, what do you do with the leftover nt? Just throw them away?
- I think I need a little insight about how to chop up the database into kmers, and how you define the features?
I am a beginner in Machine Learning algorithms, and still trying to learn more about RDP classifier. If my understanding is wrong, I am welcome to any suggestion.
Thanks a lot!
Eddi