Questions on the file format and content of files in this repo

cmnbroad commented 4 years ago

@jkbonfield I'd like to use the files in this repo as test cases. Assuming the 4 files in dat (q4, q8, q40+dir, qvar) are the raw data (?), some questions:

Are the files in the subfolders just the compressed versions of the raw data files, i.e., does dat/r4x8/q4.0 contained the result of compressing q4 (stripped of newlines ?) with the 0-order r4x8 codec? It doesn't seem to conform to that (beyond the first byte), nor does it match the output of either the htsjdk or the htscodecs javascript r4x8 order-0 codecs. Before I debug further I want to confirm the format.
It looks like q4 has 4 of 256 symbols with non-zero frequency ?. Presumably q8 has 8 ? What do 40+dir and qvar refer to ?

jkbonfield commented 4 years ago

See the logic in the htscodecs test harness, which has a cut down version of these tests but can also use these full copies if you git clone into the test subdirectory. (Although I haven't run the full test for a while so it's possible I broke something!)

https://github.com/jkbonfield/htscodecs/blob/master/tests/rans4x8.test

Yes they have newlines stripped out, so my test logic is:

    tr -d '\012' < q4.0 > r4x8-nl
    ./rans4x8 -d r4x8/q4.0 uncomp
    cmp r4x8-nl r4x8.uncomp

q4 is NovaSeq data with 4 quals. q8 is HiSeq X with 8 quals. q40 is old HiSeq 2000 or 2500 era data with ~40 distinct qualities. qvar is either ONT or PacBio (I forget which) with variable length reads.

Some of the encoders, eg fqzcomp, can use the line length as part of the model as the cycle number has an important correlation with quality. Hence it varies by codec whether to remove newlines first. The data presented in CRAM is minus newline (it's just a concatenated block of data), so fqzcomp is actually embedding this line length information into the compressed data stream so it has no external dependencies.

The q40 data also has "q40+dir". This holds the read1 vs read2 flag for this data set. Again it's an external additional piece of information given to the compression codec via a side channel, like the length, but it's not in the actual data block presented by CRAM. As before, this gets put into the compressed data stream by fqzcomp so it can decode without needing to have a dependency on the BF data series.

Note I'm thinking to provide separate RLE and PACK transforms external to r4x16pr codec (so it'll just become r4x16 and the X4 transform can be moved internal to the name tokeniser). I have a branch of scramble using this now, but it needs tidying up still. It offers identical compression but is a cleaner design and permits more interesting combinations such as PACK followed by gzip or bzip2, which may work better for some data types. Code wise there's no substantial difference though - it's just shuffling where things get done.

Edit: In recent testing I also think we should add Zstd into the mix. It's basically a replacement for deflate in every situation - smaller, faster to encode, faster to decode. Sadly it only has an order-0 entropy encoder, otherwise I'd also propose deprecating rANS and using Zstd's FSE (tANS order-0).

jkbonfield commented 4 years ago

An example of using the read1 vs read2 flag and how it benefits fqzcomp:

$ awk '{print $1}' /var/tmp/htscodecs-corpus/dat/q40+dir | ~/work/htscodecs/build/tests/fqzcomp_qual -s 1 |wc -c
nlines=100000
Total output = 3466140
3466132

$ cat /var/tmp/htscodecs-corpus/dat/q40+dir | ~/work/htscodecs/build/tests/fqzcomp_qual -s 1 |wc -c
nlines=100000
Total output = 3197082
3197074

We could do the same with rANS infact, but figured it's best to have simple mode and complex mode and not bother with the inbetween state. Eg:

$ awk '{print $1}' /var/tmp/htscodecs-corpus/dat/q40+dir | ~/work/htscodecs/build/tests/rans4x16pr -o1|wc -c
4975832

$ (awk '$2==0 {print $1}' /var/tmp/htscodecs-corpus/dat/q40+dir | ~/work/htscodecs/build/tests/rans4x16pr -o1;awk '$2==1 {print $1}' /var/tmp/htscodecs-corpus/dat/q40+dir | ~/work/htscodecs/build/tests/rans4x16pr -o1)|wc -c
4891073

jkbonfield commented 4 years ago

It doesn't seem to conform to that (beyond the first byte), nor does it match the output of either the htsjdk or the htscodecs javascript r4x8 order-0 codecs. Before I debug further I want to confirm the format.

One other thing to comment on here is also this is designed to test decoding. The format does not mandate what the encoding format looks like. Think of Deflate for example. We know we have gzip -1 to -9, plus libdeflate, cloudflare, intel, 7zip, zstd, etc implementations of deflate. They all compress to different byte streams, but they can all decompress each others output. Thus this test data should be used to validate decoders, not encoders.

Obviously it can also be used for round-trip testing.

An example of using the javascript code:

$ tr -d '\012' < /var/tmp/htscodecs-corpus/dat/q4 | md5sum
c703a7461dd65ddee6dd528a7d2f2d08  -

$ node javascript/main_rans.js -d /var/tmp/htscodecs-corpus/dat/r4x8/q4.0|md5sum
Decompress 988986 => 15100000
c703a7461dd65ddee6dd528a7d2f2d08  -

$ node javascript/main_rans.js -d /var/tmp/htscodecs-corpus/dat/r4x8/q4.1|md5sum
Decompress 911331 => 15100000
c703a7461dd65ddee6dd528a7d2f2d08  -

I know the javascript encoders sometimes take liberties and short cuts to simplify the algorithm. It is designed to be more of a reference implementation to go along side the specification than an indication of how to maximally compress data in the most speed efficient manner.

jkbonfield commented 4 years ago

Out of curiosity I wondered what the latest Paq compressor would do on that q40+dir file.

As-is:

~/ftp/compression/paq8px_v183 -9 /var/tmp/htscodecs-corpus/dat/q40+dir
-----------------------
Total input size     : 10300000
Total archive size   : 3222021

Time 1673.46 sec, used 3941 MB (4132639872 bytes) of memory

So around 2000x slower than fqzcomp and within 1% of the same size. Still immensiy impressive from a generic data compressor.

Splitting it up into R1 and R2 separate files (plus an additional 12.5k of bit-field data for the flag) is better:

paq8px archiver v183 (C) 2019, Matt Mahoney et al.

Creating archive r1.paq8px183 in single file mode...

Filename: r1 (5074543 bytes)
Block segmentation:
 0           | text             |   5074543 bytes [0 - 5074542]
-----------------------
Total input size     : 5074543
Total archive size   : 1459844

Time 773.41 sec, used 3941 MB (4132639730 bytes) of memory

paq8px archiver v183 (C) 2019, Matt Mahoney et al.

Creating archive r2.paq8px183 in single file mode...

Filename: r2 (5025457 bytes)
Block segmentation:
 0           | text             |   5025457 bytes [0 - 5025456]
-----------------------
Total input size     : 5025457
Total archive size   : 1666787

Time 819.16 sec, used 3941 MB (4132639730 bytes) of memory

So a total file size of 3139131, ~2% smaller than the fqzcomp_qual -s 1 model. Paq is much better on smaller subsets than fqz because it learns the statistics faster due to the context-mixing, but I'm happy with the way the fqz codec works given no one would realistically be willing to wait that long for Paq!

cmnbroad commented 4 years ago

I see, thanks - thats all very helpful.

Thus this test data should be used to validate decoders, not encoders.

Right, but thats where I got into trouble. I had tried to validate the htsjdk decoder by running it on the compressed output file dat/r4x8/q4.0, but that file isn't itself a (spec compliant) CRAM rans stream, probably because it was output by the javascript code, which appears to break it's input up into 1MB sections and adds additional headers in the output. I can use it as is for now, but at some point we might want to include the raw unadorned compressed output streams to use for round-trip test purposes.

I think separating out RLE and PACK is a great idea, and also if IIRC, I think someone asked about zstd the GA4GH conference during Rob's talk. It would probably be a good addition.

jkbonfield commented 4 years ago

Ah yes that's a valid point, I may have accidentally encapsulated the streams a bit with size meta-data. Ideally it should simply be the raw stream itself and let the decompressor get size from file stat instead of needing some framing / encapsulation meta-data.

I think the reason I did it that way was so I could benchmark on large files, but it should probably be a mode of operation and not something we require.

jkbonfield / htscodecs-corpus

Questions on the file format and content of files in this repo #1