samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
633 stars 174 forks source link

There's a new format for "quantitative genomics data" #528

Open yfarjoun opened 3 years ago

yfarjoun commented 3 years ago

It's called d4 (meaning D^4= Dense Depth Data Dump) and its described in https://github.com/38/d4-format and presented in a poster here: https://drive.google.com/file/d/17vLmwMcAn0l3-tzzn1DjUVkF0MXLxSy5/view (under MIT license)

It seems to use a poor-man's version of hoffman encoding to encode integer data. If the compression stats they provide a correct, and if this is a really need, it seems that there is much more room for advancement using the ideas behind cram compression on a single column.....

Regardless, it seems that this could drive the conversation about needed a binary bed format (.b2)?

jkbonfield commented 3 years ago

I'm curious what the data looks like, having no experience at all with bedgraph. Is it purely a rname, pos, depth table, or does it have various layers at different resolutions for rendering purposes? An example file would be useful.

One would think that compression wise depth plots are ideal for order-1 entropy encoders given the depth at pos P is highly correlated to the depth at pos P-1. I don't know the scale of numbers involved, but CRAM's rANS o1 mode is byte based which isn't ideal. For CRAM 3.1's name tokeniser I had a way to split integers into 4 byte streams and compress independently, which pragmatically was a cheap way to utilise existing codecs effectively. (I also did this in the ancient ZTR format; an alternative to SCF and AB1 for chromatograms.)