samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

CRAM Multi-ref slice indexing is inefficient #1347

Open cmnbroad opened 5 years ago

cmnbroad commented 5 years ago

Indexing a CRAM file generally only involves consulting the container headers, and generally doesn't require decoding/hydrating all of the records in a container. The exception is MULTI_REF slices, which have multiple reference IDs that are not explicitly stored in the container header. The current indexing implementation uses the MultiRefSliceAlignmentSpanReader class to generate CRAI entries for MULTI_REF slices, but that decodes the entire container. It should be possible to decode only the RI data series to get the reference IDs to generate the index.

cmnbroad commented 5 years ago

@jmthibault79 Does this ticket/description sound accurate to you ?

jmthibault79 commented 5 years ago

The problem description looks right to me.

However: we would also need Alignment Start (AP series) and Alignment Span values. Alignment Span derives from the combination of Read Length (RL series) and Read Features. Read Features involves quite a few data series, so at that point it's not clear to me that we'd gain a lot by decoding ~half of the series instead of the full set.