torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

fastq --eeout: report more precise expected error values #500

Closed frederic-mahe closed 1 year ago

frederic-mahe commented 1 year ago

when using the --eeout option, vsearch reports expected error values with a precision of 1E-4:

https://github.com/torognes/vsearch/blob/3ab323616159ea39598d7ddd1dc9fa38864ee3a1/src/fastq.cc#L572

This is fine when working with quality values ranging from 0 to 40. With such a range, the smallest possible expected error is indeed 10E-(Q/10) = 1E-4.

When extending the possible range of quality values to 41 with Illumina 1.8+, or to 93 (!) with PacBio's HiFi reads, then the current precision is not sufficient anymore:

# I = 40, ee=0.0001
printf "@s1\nA\n+\nI\n" | \
    vsearch --fastq_filter - --quiet --eeout --fastqout -

# J = 41, ee=0.0001
printf "@s1\nA\n+\nJ\n" | \
    vsearch --fastq_filter - --quiet --fastq_qmax 41 --eeout --fastqout -

# ~ = 93, ee=0.0000
printf "@s1\nA\n+\n~\n" | \
    vsearch --fastq_filter - --quiet --fastq_qmax 93 --eeout --fastqout -

Maybe the precision should be increased to cover at least Q = 41 (ee = 0.000079433)? and possibly Q = 93 (ee ~ 0.0000000005)?

torognes commented 1 year ago

In commit 001e25d, the precision has been increased to always show at least 4 significant digits, but adjusting the number of digits shown automatically:

$ printf "@s1\nA\n+\n~\n" | vsearch --fastq_filter - --quiet --eeout --fastqout - --fastq_qmax 93
@s1;ee=0.0000000005012
A
+
~

$ printf "@s1\nA\n+\n\"\n" | bin/vsearch --fastq_filter - --quiet --eeout --fastqout - --fastq_qmax 93
@s1;ee=0.7943
A
+
"
torognes commented 1 year ago

Also added for FASTA files as well, with commit f7d08bd9579d2a2666354bd5ab982d0a2d2c18f1.

frederic-mahe commented 1 year ago

Excellent! I've added tests (https://github.com/frederic-mahe/vsearch-tests/commit/7d4e97a13b0d0dd6070e5ae6fecebf4bfe31cee4 and https://github.com/frederic-mahe/vsearch-tests/commit/1a3c6951eca49afff723003216369f7b86660d28) to cover all the branches in 001e25d

Same thing for the --fastaout case (f7d08bd).