Question about k-mer level threshold

mourisl / Rcorrector

Error correction for Illumina RNA-seq reads

GNU General Public License v3.0

63 stars 18 forks source link

Question about k-mer level threshold #29

Closed mtmcgowan closed 3 years ago

mtmcgowan commented 3 years ago

Hi,

I was reading through the Rcorrector paper and I have a question about the section describing the k-mer level threshold. I understand the general idea for determining alpha, but I am unsure about the second term in the g(t) function.

What is the purpose of the second term with the square root. In particular, why is the root multiplied by 6?

Thanks, Matthew

mourisl commented 3 years ago

The k-mer count is a count variable, and the count variable usually follows Poisson distribution. The standard deviation then is the square root of the mean (alpha*t in this case). So mean+6sqrt(x) is kind of a conservative estimation of the upper bound for the count for the erroneous kmer. Does this help?

mtmcgowan commented 3 years ago

Yes, this mostly makes sense. I'm still working through the advantage of using a constant multiplier for your added standard deviation rather than simply relaxing your alpha to a lower percent. Do you by chance have a histogram of this multiplicity count distribution for a sample RNA-seq dataset? I'm preparing a short review presentation of your algorithm for a bioinformatics group and I think that would really help get the point across.

mourisl commented 3 years ago

Sorry, I don't have a such figure. You can find the multiplicity count distribution easily by using jellyfish.

For the coefficient, to be more precisely, since when t is large, Poisson distribution with parameter t can be approximated by a normal distribution N(t, t). And the probability of >t+6sqrt(t) for the normal distribution is about 1e^-9, which is fairly conservative.