torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

EE: document the meaning of expected error values #518

Closed frederic-mahe closed 1 year ago

frederic-mahe commented 1 year ago

I was told once that an expected error of 1.0 for a given sequence means that there is a 50% chance of having zero error and 50% chance of having one or more errors. If that's correct, then how to interpret an expected error value of 2.0, or 4.0?

frederic-mahe commented 1 year ago

I could not find any reliable information on the correct way to interpret expected error values, so I've decided to do the math and to put the results here (for now).

It is true that for large expected error values (EE), there is a 50% risk of having less than EE errors, and 50% of risk of having more than EE errors.

For smaller EE values (0 to 10), this is not true.

According to Wikipedia:

The Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability p, and the value 0 with probability q = 1 − p. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question.

In the case of a fastq read, for each position we have an error probability, and the question is: is there an error at this position? The same question can be asked for the whole read. By summing all error probabilities, we obtain the expected error (EE).

For any (positive non-null) expected error value, the distribution of outcomes (what is the probability of observing k errors?) is approximated by a Poisson distribution.

According to Wikipedia:

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

In the Poisson distribution, lambda represents our expected error value, and k is a number of errors (zero, one, two, three, etc). We use the floored expected error value as our pivot value (in R):

library(tidyverse)

EE_values <- c(0.01, 0.1, 0.5, 1.0, 1.5, 2.0,
               2.5, 3.0, 3.5, 4.0, 5.0, 10.0, 50.0)

probability_to_see_0_to_EE_errors <- function(expected_error) {
      seq(0, floor(expected_error)) %>%
          map(~ dpois(x = .x, lambda = expected_error)) %>%
          reduce(sum)
}

EE_values %>%
      map_dbl(probability_to_see_0_to_EE_errors) %>%
      as_tibble() %>%
      rename(zero_to_k = "value") %>%
      mutate(EE = EE_values,
             k = floor(EE_values),
             more_than_k = 1 - zero_to_k) %>%
      relocate(EE, k, .before = zero_to_k)
        EE     k zero_to_k more_than_k
     <dbl> <dbl>     <dbl>       <dbl>
   1  0.01     0     0.990     0.00995
   2  0.1      0     0.905     0.0952 
   3  0.5      0     0.607     0.393  
   4  1        1     0.736     0.264  
   5  1.5      1     0.558     0.442  
   6  2        2     0.677     0.323  
   7  2.5      2     0.544     0.456  
   8  3        3     0.647     0.353  
   9  3.5      3     0.537     0.463  
  10  4        4     0.629     0.371  
  11  5        5     0.616     0.384  
  12 10       10     0.583     0.417  
  13 50       50     0.538     0.462

For instance, a read with an expected error of 0.5 has a 60.7% chance of having zero error, and a 39.3% chance of having one or more errors. Specifically:

A read with an expected error of 1.0 has a 73.6% chance of having zero or one error, and a 26.4% chance of having more than one error. Specifically:

A read with an expected error of 4.0 has a 62.9% chance of having zero, one, two, three or four errors, and a 37.1% chance of having more than four errors. Specifically:

For large expected error values, the error distribution converges to a 50% chance of having at most k errors, and 50% of chance of having more than k errors.

frederic-mahe commented 1 year ago

commit https://github.com/torognes/vsearch/commit/828af40ab1f63a961741fdd3e2a2140af6421a3a adds some of the above information to the vsearch manpage.