samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
648 stars 174 forks source link

CRAM: clarification of slice and container headers #370

Closed jmthibault79 closed 5 years ago

jmthibault79 commented 5 years ago

@cmnbroad and I were talking last week about our continuing HTSJDK refactoring work, and we have some questions regarding the Slice Header and Container Header in the CRAM 3 spec, particularly regarding which combinations of read types are valid to combine in a slice or container, and how these impact the header values.

Read Types

First, we should clarify how different read types are treated in 8.5 Slice header block. Mapped is clear, but unmapped reads can be placed at a location or not. It can make sense to include placed unmapped reads in slices with mapped reads on the associated reference, but I would think unplaced unmapped reads should be strictly assigned to their own slice. Also, what is the definition of an "unsorted read" here? Is it a read from a CRAM file which isn't coordinate-sorted? It's not clear to me that this implies anything in particular about its mapping status. Or is "unsorted" used here to mean "unmapped unplaced"? We should clarify, either way.

Slice Types

Certain combinations of reads and references make sense as logical slices, but others do not. We should make this explicit, and document conventions for how to encode these types in the slice headers. We've come up with the following set, which may require some iteration:

  1. All reads are either mapped to or unmapped but placed on the same external reference
  2. All reads are either mapped to or unmapped but placed on the same embedded reference
  3. All reads are unmapped and unplaced
  4. Multi-Reference. Anything can go here except embedded-reference reads because we have no way of specifying multiple embedded references.

The Slice Header fields for each Slice Type are:

Columns = Slice Type Rows = Header Fields

types -> Single External Ref Single Embedded Ref Unmapped Unplaced Multiple External Ref
Ref Seq ID [A real Ref ID] Q: Do embedded refs have an ID? -1 UNMAPPED -2 MULTI
Align Start min(read starts) min(read starts) -1 UNMAPPED -1 UNMAPPED
Align Span max(read ends) -min(read starts) max(read ends) -min(read starts) 0 UNMAPPED 0 UNMAPPED
Embed Ref Block Content ID -1 NONE [A real Block ID] -1 NONE -1 NONE
Ref MD5 [MD5 of the subset of the Ref Sequence which covers the span] [MD5 of the subset of the Ref Sequence which covers the span] 16 0-bytes 16 0-bytes

Container Header vs. Slice Header

7 Container structure conflicts with 8.5 Slice header block.

The sentinel value -2 MULTI is available for containers with multiple references but this is not given as a possibility for slices. There are references to multi-ref slices elsewhere in the spec (like 10.2 and 11.4) so I'm assuming this is an oversight for slice. -1 UNMAPPED is only for "unmapped" for containers but is also for "unsorted" for slices. Which is correct?

Alignment Start for unmapped reads is 0 for container and -1 for slice. Is there a reason for this discrepancy?

Storing reference ID and alignment span at the container level implies to me that there are "Container Types" as well, with restrictions on which slice types may be contained within. We should make that explicit here:

  1. A Single Ref Container may contain N Slices of reads on that specific Ref (no external vs. embedded distinction needed here)
  2. An Unmapped Container may contain N Unmapped Slices
  3. A Multi-Ref Container may contain N Slices of any type
jkbonfield commented 5 years ago

Regarding slice types, yes it makes sense to put unmapped but placed reads in the same slice as their mapped brethren. Indeed this is what happens. Section 8.1 is a bit misleading here as it has MAPPED_SLICE_HEADER listed as a block content type which is used everywhere. CRAM v1.0 also had UNMAPPED_SLICE as a type, but it was dropped in v2.0. I forget the precise reasons why, but the C code still looks for MAPPED_SLICE_HEADER and only decodes ref_id, ref_start and ref_end in that case (they were absent for UNMAPPED_SLICE). The java code reads these fields regardless of the block content type. Perhaps it was always this way and this is just one of the discrepancies between original implementation and specification that were tidied up in v2.0 (and why it got removed?). We should probably just change section 8.1 to state "SLICEHEADER" and omit the MAPPED portion.

For section 8.5, we're talking about the entire slice and it only has one set of fields, so "unmapped" here is indicating the entire slice is unmapped. It could perhaps be explained better though. Unsorted is deeply confusing too - agreed!

If using an embedded reference, the ref sequence ID is still used for purposes of filling out the BAM "refID" field. It's ignored as far as reference comparison goes, but the numeric value should be matching one of the @SQ lines. It's a shame we can't cope with more than one embedded reference, but that's for CRAM4, CRAM5 etc one day I guess!

Span I think is rightmost - leftmost + 1. Ie a one base read at position 55 will have start 55 and span 1.

I'm not sure on the MD5sum when using embedded references. In this case I think I'd prefer the flexibility of the MD5sum matching the embedded reference as a sanity check or zero, rather than the actual reference. This provides us with a future mechanism to produce fake references that do not appear in the SQ headers. This was always a proposal right back to the Fritz paper itself, but never implemented. One plan was for small denovo assemblies (or just hashing tweaks) to do consensus bases compression instead of reference based compression for the totally unmapped / unplaced reads. This can help compression ratios where there are lots of large insertions or contaminations giving portions of high depth sequence data that hasn't been mapped to a reference.

It looks like I fill out MD5sum for embedded references (it'll match the actual reference too, but I was exploring making them different), but don't check it during decode if the reference has been embedded.

Align start I differ with for unmapped / multi-ref (see below).

I agree with the rest of your table.

For the container vs slice part, agreed on the lack of -2 being listed in the slice header: it is permitted there and should be listed. My CRAM spec printout has that pencilled in so I clearly meant to fix it. Indeed I thought I had! Oops.

The unmapped vs unsorted is just confusing. It should be unmapped only. Unsorted here means non-position sorted, which data can be stored but it'll just lead to multi-ref mode (or LOTS of tiny tiny containers), so it's not -1 anyway. Realistically the efficient way of doing unsorted data is just to give up and store the sequence verbatim without reference compression, as it avoids thrashing the memory, but that is a design decision and not a specification requirement. For clarity we should just cull any mention of unsorted.

I've no idea why alignment start is discrepant between slice vs container for unmapped data. We don't explain what it should be for multi-ref either. I set all of the start / span fields to zero for both unmapped and multi-ref data. It has no meaning for unmapped or muiti-ref files anyway, which is why the EOF block is basically a container header stating unmapped data and start pos of 0x454f46 ("EOF") with no records contained within it. It's the CRAM equivalent to the empty bgzf block.

Agreed regarding container types too, but it needs wording in the description rather than adding a new block content type. Generally we don't use multi-slice containers. They've been there since cram1 I think, but I always put one slice in one container (unless requested otherwise on the command line). It's an interesting point - what if we have a 4 slice container, each with 1 reference only in each. The container is multi-ref but the slices are not. I've no idea if this would break things! I think I'd prefer to gloss over it by simply disallowing it. :-)

jmthibault79 commented 5 years ago

Deferring the MD5sum work to PR #248 (for issues #201 and #357)

PR for remaining changes is #372

jmthibault79 commented 5 years ago

Summarizing some recent discussions elsewhere:

Q1: How should we divide slice types? 4 ways as in the table above, collapse down to 3 by combining Single External and Single Embedded, or add a 5th type for non-reference-compressed?

It's clear to me that we do need at least 3: a SINGLE REF type where we know all reads within are mapped (or unmapped-placed) to the same reference, and can therefore use delta-alignment to encode/decode the reads, a MULTI REF type where we know that we have to look at the reads themselves for more details, and an UNMAPPED type to collect all unplaced reads together.

Is this categorization sufficient for handling embedded-reference reads and non-reference-compressed reads in the same way, or do we need to treat them differently? This isn't clear to me since I don't have experience with these types of reads. If the only differences are how we handle reference MD5s (and the embedded ref block ID) then my question becomes: what do we gain from making these into types?

Q2: How do container types interact with slice types?

As discussed above, the simplest thing would be to require containers to only consist of slices of the same type and sequence ID (if single-ref). Is there a real benefit that "mixed" containers can give us that multi-ref slices don't? If not, then we can define container type = the common slice type for all of the container's slices.

jkbonfield commented 5 years ago

Although we deferred the MD5 sum description to another PR, having seen your extra table and the improved clarification it brings to MD5s, I think the table below it needs more work too.

I commented on the PR with my thoughts, but I am rather ambivalent over whether the new table would bring much given the (much needed) updates to the table below. I don't overly object to it, but I also think the other improvements to some extent remove the need for the clarification table.

The questions over how many more types of slice we want to show, 3, 4 or even 5 columns, do highlight that it's hard to convey the right information in this manner. So perhaps getting the descriptions nailed down more in the main table is sufficient.

Regarding container types, that's basically a tricky one. What happens if a container has multiple slices in it and those multiple slices cover different references, or some are mapped and others are entirely unmapped (-1)? I think all we can do is make the outer container use -2 for multi-ref, but that's confusing and I don't know without checking the code if we check the correct thing! We should check the slice, but maybe we check the container? The alternative is to state that the container must be compatible with the slice. Ie we can't have a multi-ref container containing single-ref (but different) slices (this is what my implementation does). This is certainly a weakness with the current specification.

Right now by default no one is using multiple slices per container. However it is possible to create these via the API or via setting specific options, eg samtools view -O cram,slices_per_container=10. Does htsjdk offer any way of building files with multiple slices per container?

I tested my implementation and it appears that it'll terminate the container early if the slices reach the end of a chromosome, so that the other container matches the reference of all slices contained within it. In the above example command, I got a lot of containers with 10 slices, one with 2 slices, and then a load more (chr2 now) with 10 slices again.

As soon as it got to doing multi-ref slices (ref -2) due to lots of tiny chromosomes with just a few reads in each, the outer container also switched to ref -2 and once again it had 10 slices per container. Thus the implementation appears to be written to ensure outer container reference information matches the inner slices.

jmthibault79 commented 5 years ago

In the absence of a compelling reason to allow differences, my preference is to require agreement between a Container's ref ID and those of its Slice(s).

HTSJDK internals do allow the creation of multiple slices in a container, but it's not clear to me whether the ability to do is exposed to callers.

jkbonfield commented 5 years ago

I'm happy with that decision. So for clarity, we want it to state that multi-ref containers must contain multi-ref slices, and single reference contains must only contain slices that use the same single reference? Or words to that effect.

jmthibault79 commented 5 years ago

Yes

jmthibault79 commented 5 years ago

@jkbonfield #401 for the "Container Type" question, and then we're done.

cmnbroad commented 5 years ago

Closing this as I think it was resolved by #401; we can take up any residual issues separately.