samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

CRAM files have zero-length blocks with a compression method associated with them (so cannot be decompressed). #1633

Open jkbonfield opened 1 year ago

jkbonfield commented 1 year ago

Description of the issue:

htsjdk produces CRAM files with zero length blocks that have a compression codec listed. An empty block post-compression is valid. An empty-block in RAW (uncompressed) is also valid, but if it states it is compressed by a specific codec then the contents of the block should be a valid byte stream for that codec (even if it decodes to zero bytes).

An example from SAMEA3302751.alt_bwamem_GRCh38DH.20200922.Finnish.simons.cram view using cram_dump:

        Block 4/36
            Size:         0 comp / 0 uncomp
            Method:       RANS0 (4)
            Content type: EXTERNAL
            Content id:   3
            Keys:         RI 

Your environment:

This was tested using htsjdk 2.26.11 with build 11.0.16+8-post-Ubuntu-0ubuntu118.04.

Steps to reproduce

The easiest way to reproduce this to convert the above file back to CRAM again using SamFormatConverter. I did this to validate it still happens and it isn't just a historic problem.

Expected behaviour

Zero length blocks should ideally just not be stored, as they're not required, but if it's easier code-wise to keep them there then the method field should be RAW, so no attempt is made to uncompress them.

See https://github.com/zaeleus/noodles/issues/131 for a case where this triggered an error in a spec-compliant decoder.

lbergelson commented 1 year ago

@jkbonfield Thanks for reporting this.