torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

How is the expected error rate calculated? #485

Closed Statistic-Qin closed 2 years ago

Statistic-Qin commented 2 years ago

Hi, Thanks every authors first! in my project , i use the --fastx_filter command, and make the --fastq_maxee_rate as 0.01, and i find some seq's expected error are >1. How does it happend? every seq's error should be 0-1, Is the expected error also in 0-1 ?

Statistic-Qin commented 2 years ago

this first seq's ee is 1.2539 image Hope your answer!

frederic-mahe commented 2 years ago

@Statistic-Qin the expected error (EE) is defined as the sum of error probability for all positions in a sequence: EE=∑ipi=∑i10–Qi/10. It is the sum of many [0, 1]-values, so it can be greater than one.

--fastq_maxee_rate 0.01 means that vsearch will discard sequences containing low-quality positions (an error rate of 0.01 or more means a quality of Q20 or less).

torognes commented 2 years ago

Please note that when you use the --fastq_maxee_rate option it applies to the average expected error across the sequence, which will be a number between 0 and 1. When you use the --fastq_maxee option, it applies to the total expected error for the sequence, which will vary between 0 and the length of the sequence. It is common to use --fastq_maxee 1.0 or a number of that magnitude. It allows for up to one expected wrong base per sequence. This is equivalent to using --fastq_maxee_rate 0.01 if the sequences are 100 bp long.

Statistic-Qin commented 2 years ago

I think I understand the ee's meaning. The ee in picture is the sum of per sequence, which corresponds to --fastq_maxee. The --fastq_maxee_rate is ee/N, which is the mean of all sequences.

frederic-mahe commented 2 years ago

I've tried to improve the manpage entries for maxEE and maxEE_rate (see bae03fca37150b3fa4501446fdfe418f379b5143). Entries now read as such:

--fastq_maxee real
         When   using   --fastq_filter,   --fastq_mergepairs  or
         --fastx_filter, discard sequences with an expected  er‐
         ror  greater  than  the specified number (value ranging
         from 0.0 to infinity). For a given  sequence,  the  ex‐
         pected  error is the sum of error probabilities for all
         the positions in the sequence.  In  practice,  the  ex‐
         pected  error is greater than zero (error probabilities
         can be small but not null), and at most  equal  to  the
         length  of the sequence (when all positions have an er‐
         ror probability of 1.0).

--fastq_maxee_rate real
         When using --fastq_filter  or  --fastx_filter,  discard
         sequences  with  an average expected error greater than
         the specified number (value ranging from 0.0 to 1.0 in‐
         cluded). For a given sequence, the average expected er‐
         ror is the sum of error probabilities for all the posi‐
         tions in the sequence, divided by the length of the se‐
         quence.