Closed mtmcgowan closed 3 years ago
The k-mer count is a count variable, and the count variable usually follows Poisson distribution. The standard deviation then is the square root of the mean (alpha*t in this case). So mean+6sqrt(x) is kind of a conservative estimation of the upper bound for the count for the erroneous kmer. Does this help?
Yes, this mostly makes sense. I'm still working through the advantage of using a constant multiplier for your added standard deviation rather than simply relaxing your alpha to a lower percent. Do you by chance have a histogram of this multiplicity count distribution for a sample RNA-seq dataset? I'm preparing a short review presentation of your algorithm for a bioinformatics group and I think that would really help get the point across.
Sorry, I don't have a such figure. You can find the multiplicity count distribution easily by using jellyfish.
For the coefficient, to be more precisely, since when t is large, Poisson distribution with parameter t can be approximated by a normal distribution N(t, t). And the probability of >t+6sqrt(t) for the normal distribution is about 1e^-9, which is fairly conservative.
Hi,
I was reading through the Rcorrector paper and I have a question about the section describing the k-mer level threshold. I understand the general idea for determining alpha, but I am unsure about the second term in the g(t) function.
What is the purpose of the second term with the square root. In particular, why is the root multiplied by 6?
Thanks, Matthew