s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

quality range not checked leading to undefined behavior #250

Closed joe-angell closed 1 year ago

joe-angell commented 2 years ago

Quality scores according to the fastq wiki can be up to 93, so updating stats to allow that. Also check to make sure we aren't writing out of bounds.

Some of the stats classes were missing virtual destructors which also leads to undefined behavior.

joe-angell commented 2 years ago

@msettles @samhunter @dstreett someone give a quick review please, i'm not sure how you want to handle quality scores outside the range you had set. According to wikipedia "Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126" so I raised it for stats.

samhunter commented 2 years ago

I think the correct fix for this is to add a "-q64" (or equivalent) switch to support this old Illumina format:

"Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected)." --https://en.wikipedia.org/wiki/FASTQ_format#Encoding

@msettles with various new data types becoming available and potentially larger range of valid quality values, I wonder if the 0-40 range should be expanded to support 0-93 according to the Sanger format as @joe-angell rightly points out above. What do you think?

From my side, I don't see any problems with this, except that currently when we overlap PE reads we cap the q-score for matches at 40 to fit with Illumina encoding. I think we could probably still do this since the q-score is an approximation at that point anyway.

samhunter commented 1 year ago

I think we should merge it.

On Thu, Jun 22, 2023, 10:51 PM David Streett @.***> wrote:

@.**** approved this pull request.

I think it looks great @joe-angell https://github.com/joe-angell - and if it is causing a crash, we should merge it. (But let's give the other guys 24 hours ;) )

— Reply to this email directly, view it on GitHub https://github.com/s4hts/HTStream/pull/250#pullrequestreview-1494433559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7ENSXJJFBMYVMAFS7YU3XMUVHVANCNFSM5Q2GUP3A . You are receiving this because you were mentioned.Message ID: @.***>