cram and bam footers for storing key/value pairs

sb10 commented 8 years ago

It would be very useful to be able to store arbitrary key/value pairs in the footer of both cram and bam files.

My use-case would be storing sequence checksums. The desire to have these in the footer and not the header is so that I can store these on a streamed in alignment file.

Backwards compatibility with tools that trip up on the footer could be achieved with a special argument to samtools view that emits the alignment file as-is, except for the footer.

I propose implementing a reference implementation of this in samtools, once a way of implementing a footer is suggested by responders.

sb10 commented 8 years ago

samtools meeting discussion: cram-only would be easiest. footer exclusion idea not so good? Stenography solution with storing data in penultimate block would be better? Storing in fake unaligned sequence with a name starting with ~?

sb10 commented 8 years ago

cram-only isn't ideal for my use case, because I want my seqchksums to verify the numerous bams we make in our pipelines. We don't currently deal with crams except as initial input. If we manage to re-engineer our pipelines to be all-cram, we probably won't have a backwards compatibility concern.

jkbonfield commented 8 years ago

We worked out one way of doing this in BAM as a transparent block of data - meaning existing BAM readers wouldn't spot it and zlib would produce zero additional output during decoding.

There are, no doubt, many ways to achieve it. Ideally we would extend the info field in gzip stream which is permitted, but not actually coded correct in many BAM implementations. More transparent would be to generate a zlib huffman table where the presence of a symbol acts as a bit of information (more if we wish to play with symbol code lengths). The data stream itself is empty, so it's pure data encoding via metadata. At a minimum that means maybe 9 bytes of information in the huffman table. Multiple blocks can be concatenated together to form arbitrary footer lengths. It's extreme solution, but it does appear to be technically possible.

For CRAM we already have the slice (or container?) key-type-value tuples in the same structure as aux blocks on reads. These were expressly designed for checksumming. Hence it's trivial to add a whole-file checksum using the same technique.

jrandall commented 8 years ago

Agree this would be useful, and I agree with Sendu the ability to have this in SAM/BAM as well as CRAM would be ideal. I had also been thinking about adding it to the bgzip fields, but actually it would be much better if uncompressing data did not result in losing the footer information, so I think I am strongly in favor of in-band signalling that would persist through all of the various formats. I suspect the best way to achieve might be to use some form of special read record (with no sequence and a special flag) that would tend to get ignored in the context of anything important.

jkbonfield commented 8 years ago

We discussed this in the office after last weeks meeting and some were against in-band methods. David (I think) raised the point that anything that modifies SAM without knowing about the special ~~~whatever fake reads (sorted to end of file by both coord and name order) would invalidate the footer, so the footer would have to somehow indicate when it is out of date (tricky as there's no generation number or similar in the header).

There is something to be said for having a system where the footer inherently gets invalidated by any command that modifies data but doesn't understand the footer. This is key for data checksums. Any bright ideas?

In the future this sort of thing can be achieved by the same type of technology for keeping .bam and .bam.bai in sync (random generated uuids that have to match), but as they don't yet exist in the header we cannot exploit them for the footer.

sb10 commented 8 years ago

My implementation of seqchksum in samtools would calculate the checksums as it streamed through the file, and then if a footer existed, compare the checksums, and die on a (desired kind of) mismatch.

So the footer being out of date will behave the same as the sequencing data becoming corrupted, and in both cases the user must investigate and update the footer in the former case. They will immediately learn not to use the tool that isn't handling the footer correctly (or to always post-process it's output with the samtools command that force-regenerates a new footer without aborting), so I'm not sure this is an issue?

jkbonfield commented 8 years ago

Musings on the mechanisms.

See eg https://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf page 14.

PostScript header has items like %%BoundingBox which needs computation of all item locations. This can be complex for streams to compute upfront, so it has the option of specifying "(atend)" as the value and then include a second %%BoundingBox in a %%Trailer section.

I would like to suggest that any items we have in the footer must also appear in the header with some form of deferred or atend token to indicate their presence in the footer. This tells us which checksums we'll need to compute, if any, allowing us to be efficient where appropriate.

Consider a worked example, running fixmates on an input file.

Open input and output files.
Read header from input, write header to output. Input header has either @CS seq deadbeef (example; CS=checksum) or @CS seq atend.
The usual while (seq=read(in)) {do_stuff(seq); write(out,seq);} style loop. During this we'll be computing the seq checksum, but not any other checksums as they're not in the header.
EOF on input; fetch footer and compare checksums; set expected output checksum.
EOF on output; automatically checks output checksum from expected value. If we didn't set an expected value then this wouldn't be done.
Close input stream.
Close output stream; this also automatically writes the footer from the generated checksums, although we can add an explicit footer writing API too.

The above is orthogonal to how the footer is encoded (in-band or out-of-band).

Edit: the presence of an atend header item should not mandate that it exists in the footer; the file may have been produced by something that didn't honour the footer. It is simply a request that the data may be in the footer and so must be computed.

vadimzalunin commented 8 years ago

@jkbonfield sorry if I misunderstood, are you suggesting to use FLG.FEXTRA? Also is huffman table for saving space? The key/value metadata should have negligible effect on file size any way. A secret layer sounds hack-ish to me but I can't think of anything better than extra block after EOF for footer. Given new types of SAM file header records that should be enough imo.

yfarjoun commented 8 years ago

It sounds like BAM and CRAM will have data that is not representable in SAM. this will make round-trip checks very difficult and as people start using this to put "real" data in, this will become a problem. I think that the data should be format agnostic, so that whatever solution we think of should have an implementation in all current data-formats, including SAM...but that's just MHO.

dbolser commented 8 years ago

Why can't this be done with the checksum in the header? On 14 Mar 2016 4:08 pm, "Sendu Bala" notifications@github.com wrote:

My implementation of seqchksum in samtools would calculate the checksums as it streamed through the file, and then if a footer existed, compare the checksums, and die on a (desired kind of) mismatch.

So the footer being out of date will behave the same as the sequencing data becoming corrupted, and in both cases the user must investigate and update the footer in the former case. They will immediately learn not to use the tool that isn't handling the footer correctly (or to always post-process it's output with the samtools command that force-regenerates a new footer without aborting), so I'm not sure this is an issue?

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/136#issuecomment-196387158.

sb10 commented 8 years ago

@dbolser Because you can't then store new checksums in a streaming context.

dbolser commented 8 years ago

OK, but why can't you put it in when you first bake the SAM/BAM/CRAM? Is this an issue with legacy files? On 19 Mar 2016 8:12 am, "Sendu Bala" notifications@github.com wrote:

@dbolser https://github.com/dbolser Because you can't then store new checksums in a streaming context.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/136#issuecomment-198666385

jmarshall commented 8 years ago

When you are writing out a file, there is some information that you often don't know until you're finished: checksums, number of records,… There's basically three techniques for dealing with this:

Leave a gap in the headers and go back and fill it in later. But this is a pain when those headers are compressed, and it's a non-starter when you're streaming (to standard output for instance).
Have footers in your format, and write it out at the end of the stream. (Particularly good for information like checksums that the readers are only really going to need when they get to the end of the stream too!)
Put the data in a separate file alongside, as we do with .bai files. Logistically a pain, and again incompatible with streaming.

jkbonfield commented 8 years ago

@vadimzalunin I was thinking of a few mechanisms. FLG.FEXTRA is one and the SAM specification explicitly permits us to add more than just BGZF size ("Additional RFC1952 extra subfelds if present"), but in practice implementations choke. Maybe we just have to suffer it. There is also FLG.FCOMMENT, but the worked example in the SAM specification has that set to 0 via its explicit FLG==4 value, maybe in error, or maybe it's just an example and not intended to state that it must not be used?

However there is an additional approach, which is harder to implement but totally invisible. This is to use the huffman table itself (so RFC1951/deflate rather than 1952/gzip). It consists of a series of symbols (0-255 bytes plus 30odd distance codes IIRC) and their bit-lengths, which are used to construct a canonical huffman tree. Such a huffman tree therefore contains information, at the very least 1 bit per symbol (whether it is in the tree) or more if we utilise code-lengths too, so a tree with zero size compressed payload becomes data itself. Utterly messy, probably evil, but completely transparent to all decoders. It's also a royal PITA to encode and decode though, requiring a custom deflate implementation. It was really just a thought process :)

dbolser commented 8 years ago

I'm not against footers per se, assuming they aren't implemented as an ugly hack. I still don't understand the reasons why they are needed. On 19 Mar 2016 9:36 am, "John Marshall" notifications@github.com wrote:

When you are writing out a file, there is some information that you often don't know until you're finished: checksums, number of records,… There's basically three techniques for dealing with this:

-

Leave a gap in the headers and go back and fill it in later. But this is a pain when those headers are compressed, and it's a non-starter when you're streaming (to standard output for instance).

Have footers in your format, and write it out at the end of the stream. (Particularly good for information like checksums that the readers are only really going to need when they get to the end of the stream too!)

Put the data in a separate file alongside, as we do with .bai files. Logistically a pain, and again incompatible with streaming.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/136#issuecomment-198674308

jkbonfield commented 8 years ago

One other thought, see https://sourceforge.net/p/staden/code/HEAD/tree/io_lib/trunk/io_lib/deflate_interlaced.c

The entire file is long and complicated - it dates back to the ZTR trace format. I implemented my own version of deflate because I required the ability to store deflate headers and deflate data blocks independently; ZTR would use one huffman table for many data blocks to gain efficiency, but zlib couldn't provide this functionality.

It permits a trivial starting point as we already have code that reads and writes huffman tables directly. The big question really is, is this the right plan? I'm certainly not convinced, but I mention it purely because it's possible, is mostly coded already, and therefore shouldn't be discounted purely on it being hard.

jkbonfield commented 8 years ago

@dbolser The desire is simply to have a way to encode sequence/name/quality/... checksums in the file. We have byte level checksums already (eg CRC32 built in to gzip), but they only validate the file and not the meaning.

Eg imagine we're writing a collation tool that collates records with the same name together, for purposes of piping into samtools fixmates say but without wanting to wait for a full name sort. We know the output file will differ from the input file, so crc, md5sum, etc will clearly differ. However a multiplicative or XORd hash of the checksums for all the individual sequences in the file should be the same regardless of their order. This means we can detect algorithmic bugs. Did our tool accidentally lose a sequence?

Another example in CRAM: We do BAM to CRAM (reference based sequence encoding) back to BAM. Does all of our sequence match the original still? What about those nasty corner cases, of sequence aligned off the end of a chromosome, or an N in a sequence matching an ambiguity code in the reference? These have been real bugs that caused corruptions in the past, which would have been spotted sooner with a sequence checksum.

In short, the checksums are algorithmic checks rather than file integrity checks. A footer makes them easy to implement when streaming.

dbolser commented 8 years ago

I didn't know you stored read and reference checksums On 19 Mar 2016 12:09 pm, "James Bonfield" notifications@github.com wrote:

@dbolser https://github.com/dbolser The desire is simply to have a way to encode sequence/name/quality/... checksums in the file. We have byte level checksums already (eg CRC32 built in to gzip), but they only validate the file and not the meaning.

Eg imagine we're writing a collation tool that collates records with the same name together, for purposes of piping into samtools fixmates say but without wanting to wait for a full name sort. We know the output file will differ from the input file, so crc, md5sum, etc will clearly differ. However a multiplicative or XORd hash of the checksums for all the individual sequences in the file should be the same regardless of their order. This means we can detect algorithmic bugs. Did our tool accidentally lose a sequence?

Another example in CRAM: We do BAM to CRAM (reference based sequence encoding) back to BAM. Does all of our sequence match the original still? What about those nasty corner cases, of sequence aligned off the end of a chromosome, or an N in a sequence matching an ambiguity code in the reference? These have been real bugs that caused corruptions in the past, which would have been spotted sooner with a sequence checksum.

In short, the checksums are algorithmic checks rather than file integrity checks. A footer makes them easy to implement when streaming.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/136#issuecomment-198691846

samtools / hts-specs

cram and bam footers for storing key/value pairs #136

Leave a gap in the headers and go back and fill it in later. But this is a pain when those headers are compressed, and it's a non-starter when you're streaming (to standard output for instance).

Have footers in your format, and write it out at the end of the stream. (Particularly good for information like checksums that the readers are only really going to need when they get to the end of the stream too!)