Open cmnbroad opened 5 years ago
@jmthibault79 Does this ticket/description sound accurate to you ?
The problem description looks right to me.
However: we would also need Alignment Start (AP
series) and Alignment Span values. Alignment Span derives from the combination of Read Length (RL
series) and Read Features. Read Features involves quite a few data series, so at that point it's not clear to me that we'd gain a lot by decoding ~half of the series instead of the full set.
Indexing a CRAM file generally only involves consulting the container headers, and generally doesn't require decoding/hydrating all of the records in a container. The exception is MULTI_REF slices, which have multiple reference IDs that are not explicitly stored in the container header. The current indexing implementation uses the
MultiRefSliceAlignmentSpanReader
class to generate CRAI entries for MULTI_REF slices, but that decodes the entire container. It should be possible to decode only theRI
data series to get the reference IDs to generate the index.