samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Slice MD5 check allows accepts an MD5 match for a partial slice span #1426

Closed cmnbroad closed 2 years ago

cmnbroad commented 5 years ago

The current implementation of slice md5 validation proceeds in two steps; it first tries to validate the md5 for the entire slice span. If that fails, it falls back to checking the md5 for a smaller span (terminating one base earlier), and accepts the md5 as a match if either succeeds. Its unclear if this is actually necessary, or is just a leftover artifact from old code, but my reading of the spec would have me think only the first test is valid.

cmnbroad commented 5 years ago

@jkbonfield Any insight on this ?

jkbonfield commented 5 years ago

I don't understand why it would check for one base fewer, unless it is to compensate for some historic (or worse, current!) bug that accidentally wrote the wrong md5sums before. It's not a case I was aware of.

It seems to have appeared here: https://github.com/enasequence/cramtools/commit/7597847f#diff-93aa6398ff85742d5ddb208750100499R89 but the commit message doesn't give any clues.

There is also issue https://github.com/samtools/hts-specs/issues/357 to consider as this is a difference between the implementations. The specification states that in the case of an embedded reference the MD5 sum can either be zero (don't check) or the MD5 of the embedded sequence, so I believe the Java implementation to be in error here.

In cases of strong heterozygosity it can be beneficial to embed a local consensus (albeit in lock-step with ref for coordinate purposes) instead of the actual reference. [Sadly though this isn't the panacea I believed it to be as we then waste space elsewhere encoding MD and NM tags verbatim as they'd be regenerated wrongly unless we still choose to generate them from external ref instead of embedded one.

cmnbroad commented 2 years ago

Closing as this is obsolete (I removed the mystery code referenced in the description some time back).