torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

--fastq_ascii and --fastq_qmax must be no more than 126 #564

Open goodcouscous opened 2 weeks ago

goodcouscous commented 2 weeks ago

https://github.com/torognes/vsearch/blob/e79adfd5c77e4a18c218b5aafa430344127e4a9f/src/vsearch.cc#L4744

what happens to reads with a q20 score when i set fastq_max to 90?

using this command vsearch --fastq_filter file_q20.fastq --fastq_qmax 90 -fastaout file_q20.fasta

torognes commented 2 weeks ago

Hi

i'll try to explain:

In a FASTQ file with sequencing reads the quality of each base is represented by a value that indicates the probability that the base is wrong.

The quality (Q) of a base is calculated by the formula Q = - 10 log10( p ), where p is the error probability. For example, if the probability that the base is wrong is 0.01 (or 1%), the quality value Q = - 10 log10( 0.01) = - 10 * -2 = 20.

This quality value is represented in the FASTQ file by a character (on the fourth line in each entry). For this character to be printable and visible in an ordinary plain text file, the character must be in the range from the character "!" (exclamation point) (ascii value 33) to "~" (tilde) (ascii value 126). To convert from the quality value Q to the character shown in the FASTQ file, the value 33 is usually added to the quality value. A quality value of 20 will therefore be represented by character number 53 in the ascii table, which is the character "5".

Wikipedia has a nice page about this: https://en.wikipedia.org/wiki/FASTQ_format

The "magic" number 33 that is added to the quality value is the same as the value that is specified with the "--fastq_ascii" option in VSEARCH. (In some old FASTQ files the value may be 64.)

To keep the character that represents the quality within the usable range (33-126), the qualty value (Q) must be kept within the range 0 to 93. Usually Q values above 41 are not used with Illumina sequencers, and some software will not handle higher Q values. However, some new sequencing technologies (Nanopore, PacBio) or merging of reads may give higher quality sequences and corresponding higher Q values. The maximum Q value that will be used by VSEARCH can be limited with the "--fastq_qmax" and "--fastq_qmax_out" options. The default is 41, but you may increase it to 93 if you want to, usually without problems. However, this is rarely necessary. We have considered increasing the default to 93, but have so far left it as it is.

Using the "--fastq_qmax_out" option in VSEARCH will adjust down any quality values at the upper bound specified, a Q value of 50 will be adjusted down to 41 by default. If a higher Q value than specified with "--fastx_qmax" is found in the input file, I think VSEARCH will issue a warning.

When you run "vsearch --fastx_filter" with ordinary reads you usually don't have to specify neither the "--fastq_ascii " nor the "--fastq_qmax" options. If you have unusually high quality reads, you should specify "--fastq_qmax 93".

The "--fastx_filter" in VSEARCH does not filter directly on Q values, but you can truncate sequences (strip off bad quality sequence in the 3' end) with the "--fastq_truncqual" option. However, you can discard sequences with an overall low quality using the "--fastq_maxee" or "--fastq_maxee_rate" options. For instance "--fastq_maxee 1" will discard sequences with a total expected error above 1. The "--fastq_maxee_rate 0.01" will discard sequences with an average of 1% error per base. This roughly corresponds to an average Q value of 20.