samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Protection against over sized aux tags in CRAM #1613

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

CRAM encoding can overflow the concatenated block of BAM aux tags used in the decoder when the size of the aux tags becomes excessive.

This can happen in real world data in some bizarre circumstances. For example very long ONT records with per-base aux tags left intact, passed through an aligner that records secondary alignments as SEQ "*" but leaves the aux tags in place. This means the limit of the number of bases per container is not triggered, giving rise to excessively large containers, and specifically aux tags that combine to >2GB.

We fix it in two ways.

1) Protect against existing files with this problem by detecting the overflow may happen and simply bailing out. This is perhaps overly harsh, but previously this would simply have core dumped and to date we've only ever had one report of this happening (yesterday), so I expect it's vanishingly rare.

2) By changing the encoder so it produces new containers using base+aux count rather than just base count (as well as the existing record number count).