understanding cram_size and re-compressing crams

h-2 commented 3 years ago

I hope this is the right place to ask.

Before getting into lossy compression, I want to compare the lossless compression algorithms. So I want to recompress our cram files with different algorithms/options. I think that cram_size should already tell me which block is how big (compressed) and which algorithm is used, but I can't find documentation on it. Reading the code I deduced that g means Gzipped compression, b bzip2... but I am unsure of the other values and the highlighting. Could you tell me what they mean?

Regarding recompression, I previously used samtools --output-fmt-option use_lzma=1, but I realised that it doesn't actually change the compression of most fields. In total, the file size only went from 17GB to 15GB. I expected the improvement to be better... What would you recommend to achieve best lossless compression?

Thank you!

jkbonfield commented 3 years ago

Some good questions. It's badly documented, sorry, but cram_size and cram_dump were the first hacky tools I wrote for my understanding of the cram format, and I never really got around to expanding and finishing them up. The code is shocking for them both!

So the chars reported are via this code:

        for (i = 1; i <= sizeof(methods)/sizeof(*methods); i++) {
            printf("%s%s%c%s",
                   (count[i]+0.01)/(count[0]+0.01) >= 0.50 ? "\033[7m":"",
                   (count[i]+0.01)/(count[0]+0.01) >= 0.10 ? "\033[4m":"",
                   " gblrR01458923fnNBz01458923"[count[i]?i:0],
                   "\033[0m");
        }

As each slice could use a different compression method, and io_lib learns as it goes along what works, we get a distribution of methods used for each data type. If more than 50% are from the same type it's reverse video, if more than 10% it's underlined.

The codes listed there are the same listed in the table above it, ie in order:

        enum cram_block_method methods[] = {
            GZIP, BZIP2, LZMA, RANS0, RANS1,
            RANS_PR0,   RANS_PR1,   RANS_PR64,  RANS_PR9,
            RANS_PR128, RANS_PR129, RANS_PR192, RANS_PR193,
            FQZ, NAME_TOK3, NAME_TOKA, BSC, ZSTD,
            ARITH_PR0,   ARITH_PR1,   ARITH_PR64,  ARITH_PR9,
            ARITH_PR128, ARITH_PR129, ARITH_PR192, ARITH_PR193,
        };

In short the ones you're going to see from CRAM 3.0 are gzip(g) bzip2(b), lzma(l) and rANS order 0 (r) or order 1 (R). CRAM 3.1 brings in new codecs, which will be tokenise-name(n or N), fqzcomp-qual(f) and a whole bunch of new 16-bit rANS methods (0-9). They make sense to me :-), but they're named after the last digit in the -o option of https://github.com/samtools/htscodecs test/rans4x16pr command, which is a bit field. So -o0 / -o1 are order-0/order-1 as before, -o64/65 is with RLE (order-0+RLE or order-1+RLE) which are code '4' and '5' (last digit of 64/65), +128 is bit-packing (codes 8/9), and obviously 192 is bit-packing + RLE (codes 2/3).

There are arithmetic coder variants too, which get the same numbers but in a different column. They're unlikely to be used IMO as they're a bit slower without rarely being much smaller.

As for the other methods, enabling lzma tells the encoder is can use lzma, but not that it must use lzma. It has a cost-benefit analysis based on the compression size vs the expected CPU time for encoding / decoding. You can see the raw size of blocks being produced if you run scramble -jZ -X archive -vv. Eg:

Try compression of block ID 12 from 15000000 to 1315522 by method GZIP, strat 1
Try compression of block ID 12 from 15000000 to 960796 by method BZIP2, strat 0
Try compression of block ID 12 from 15000000 to 1148660 by method LZMA, strat 0
Try compression of block ID 12 from 15000000 to 1046795 by method RANS0, strat 0
Try compression of block ID 12 from 15000000 to 950660 by method RANS1, strat 0
Try compression of block ID 12 from 15000000 to 1447579 by method GZIP-1, strat 0
Compressed block ID 12 from 15000000 to 950660 by method RANS1 strat 0
Try compression of block ID 11 from 3947215 to 551087 by method GZIP, strat 1
Try compression of block ID 11 from 3947215 to 486775 by method BZIP2, strat 0
Try compression of block ID 11 from 3947215 to 471368 by method LZMA, strat 0
Try compression of block ID 11 from 3947215 to 633584 by method GZIP-1, strat 0
Compressed block ID 11 from 3947215 to 471368 by method LZMA strat 0

You can see LZMA is best for block 11, but it's only a bit smaller than BZIP2. CRAM tries each method periodically on a smapling strategy to learn which methods work and adapt. LZMA will be used for this block as we've tried all methods and we may as well use the smallest, but it may not be chosen for subsequent blocks. The tradeoff between speed vs size is also controlled by the compression level -1 to -9. At -9 it'll choose whatever works best, but it may be prohibitively expensive in CPU. What these options are essentially doing is biasing that cost-benefit analysis, skewing how much smaller a block needs to be before we justify whether to use that method. It's all just a bunch of heuristics, but it seems to work pretty well.

Note the above also shows exploration of different flavours of a specific compression. Eg GZIP vs GZIP-1 (also GZIP-RLE only for zlib). These are all the same GZIP as far as the CRAM format and decoders are concerned, but sometimes we find we don't need an expensive compression level as the quick and dirty method gives us comparable results (sometimes bizarrely even better).

If we enable CRAM 3.1 the above example file shows this:

Try compression of block ID 12 from 15000000 to 1315522 by method GZIP, strat 1
Try compression of block ID 12 from 15000000 to 960796 by method BZIP2, strat 0
Try compression of block ID 12 from 15000000 to 1148660 by method LZMA, strat 0
Try compression of block ID 12 from 15000000 to 1046141 by method RANS_PR0, strat 0
Try compression of block ID 12 from 15000000 to 1012637 by method ARITH_PR0, strat 0
Try compression of block ID 12 from 15000000 to 757562 by method FQZ, strat 3
Try compression of block ID 12 from 15000000 to 776742 by method FQZ_b, strat 259
Try compression of block ID 12 from 15000000 to 1447579 by method GZIP-1, strat 0
Try compression of block ID 12 from 15000000 to 951910 by method RANS_PR1, strat 0
Try compression of block ID 12 from 15000000 to 1069578 by method RANS_PR64, strat 0
Try compression of block ID 12 from 15000000 to 959648 by method RANS_PRstripe1, strat 0
Try compression of block ID 12 from 15000000 to 953033 by method RANS_PR128, strat 0
Try compression of block ID 12 from 15000000 to 809239 by method RANS_PR193, strat 0
Try compression of block ID 12 from 15000000 to 933517 by method ARITH_PR1, strat 0
Try compression of block ID 12 from 15000000 to 1118972 by method ARITH_PR64, strat 0
Try compression of block ID 12 from 15000000 to 946254 by method ARITH_PRstripe1, strat 0
Try compression of block ID 12 from 15000000 to 927471 by method ARITH_PR128, strat 0
Try compression of block ID 12 from 15000000 to 823957 by method ARITH_PR129, strat 0
Try compression of block ID 12 from 15000000 to 917436 by method ARITH_PR192, strat 0
Try compression of block ID 12 from 15000000 to 806972 by method ARITH_PR193, strat 0
Compressed block ID 12 from 15000000 to 757562 by method FQZ strat 0

Many more codecs being assessed, with FQZcomp winning out at 757562 for this block of quality values. If we weren't using profile small or archive then it'd choose RANS4x16 method 193 (order-1 RLE + bit-packing) at 809239. Both of these are considerably smaller than the 1046795 for rANS4x8 order-1 used in CRAM 3.0.

Block 11 (which we store read names in) now looks like this:

Try compression of block ID 11 from 3947215 to 551087 by method GZIP, strat 1
Try compression of block ID 11 from 3947215 to 486775 by method BZIP2, strat 0
Try compression of block ID 11 from 3947215 to 471368 by method LZMA, strat 0
Try compression of block ID 11 from 3947215 to 2031447 by method ARITH_PR0, strat 0
Try compression of block ID 11 from 3947215 to 633584 by method GZIP-1, strat 0
Try compression of block ID 11 from 3947215 to 324306 by method TOK3_A, strat 1
Try compression of block ID 11 from 3947215 to 990473 by method ARITH_PR1, strat 0
Try compression of block ID 11 from 3947215 to 2038622 by method ARITH_PR64, strat 0
Try compression of block ID 11 from 3947215 to 1163212 by method ARITH_PRstripe1, strat 0
Try compression of block ID 11 from 3947215 to 2031447 by method ARITH_PR128, strat 0
Try compression of block ID 11 from 3947215 to 990473 by method ARITH_PR129, strat 0
Try compression of block ID 11 from 3947215 to 2038622 by method ARITH_PR192, strat 0
Try compression of block ID 11 from 3947215 to 984019 by method ARITH_PR193, strat 0
Try compression of block ID 11 from 3947215 to 0 by method ZSTD-1, strat 0
Compressed block ID 11 from 3947215 to 324306 by method TOK3_A strat 0

Here it's now chosen the name-tokeniser with inbuilt arithmetic coder (it can link to rANS or Arith, but arith is enabled when in archive mode, otherwise it's rANS), at 324306 bytes vs the former 471368 from LZMA.

If you let this run for a while it'll stop trying all these and become briefer (and faster!):

Encode slice 0
Compressed block ID 13 from 7480 to 1147 by method GZIP strat 1
Compressed block ID 12 from 15000000 to 798409 by method FQZ strat 3
Compressed block ID 30 from 324869 to 63998 by method RANS_PR193 strat 0
Compressed block ID 11 from 3947035 to 319850 by method TOK3_A strat 0
Compressed block ID 20 from 5347 to 848 by method RANS_PR0 strat 0
Compressed block ID 4281155 from 1687 to 1508 by method ARITH_PR193 strat 0
Compressed block ID 5131587 from 2162 to 6 by method RANS_PR128 strat 0
Compressed block ID 4281171 from 185292 to 83154 by method RANS_PR1 strat 0
Compressed block ID 5194586 from 3592 to 1223 by method BZIP2 strat 0
Compressed block ID 5459283 from 191612 to 60330 by method RANS_PR1 strat 0
Compressed block ID 4342618 from 600000 to 54 by method RANS_PR193 strat 0
Compressed block ID 5459267 from 4194 to 1977 by method ARITH_PR193 strat 0
Compressed block ID 5456218 from 1244 to 412 by method GZIP-1 strat 0
Compressed block ID 14 from 308082 to 75100 by method LZMA strat 0
Compressed block ID 15 from 160200 to 38558 by method RANS_PR1 strat 0
Compressed block ID 16 from 100000 to 14193 by method RANS_PR1 strat 0
Compressed block ID 17 from 100007 to 34346 by method RANS_PR64 strat 0
Compressed block ID 18 from 100000 to 12141 by method RANS_PR1 strat 0
...

After a while it reverts back to all the "Try compression of block" messages, flip-flopping between trials and "just do it". If multiple trials repeatedly show a method just isn't remotely in the right ballpark - eg rANS doesn't help on read names as it really needs some LZ or BWT method - then they get culled completely and not used in subsequent trials.

This way it does a reasonable job of learning the optimal compression methods, but it may penalise CPU cost on small files so when assessing don't just scale up the numbers from a small file and assume it'll be that speed on something larger.

I hope this helps clarify a rather complex mess of heuristics. :-)

h-2 commented 3 years ago

Thank you very much for the very detailed explanation!

I didn't expect different blocks of the same ID to also have different compression schemes. This format is even more complicated than I thought ^^

The tradeoff between speed vs size is also controlled by the compression level -1 to -9

So this is not passed to GZIP/LZMA as -1 to -9 but is instead an option that influences the heuristic?

On a more general note: does compressibility of data with a given codec vary so much between blocks in a file (and between files) that it makes sense to explore which works best every time? Would a specific combination for "small" or "fast" not produce something close to the smallest or fastest encoding every time?

jkbonfield commented 3 years ago

I didn't expect different blocks of the same ID to also have different compression schemes. This format is even more complicated than I thought ^^

Well, the format permits lots of scope for how an encoder wishes to store any data. It's even more complex than described above because it permits (perhaps wrongly) multiple data series to be embedded within the same block. So the encoder could choose to always store two aux tags together if it determined they are highly correlated, permitting better compression. I don't do that as it'd be a bit of a nightmare to do efficiently, but it's one potential win. (The loss though is it harms selective decompression, introducing dependencies between data series.)

The tradeoff between speed vs size is also controlled by the compression level -1 to -9

So this is not passed to GZIP/LZMA as -1 to -9 but is instead an option that influences the heuristic?

It's a bit of both. The level is still passed on to the underlying compression library, but it's also used when in the costs of comparing one method vs another. Note it may sometimes be tweaked a little too. Eg zlib has levels 1 to 9, but libdeflate has 1 to 11 (or 12 - I forget now), so a certain amount of skewing happens.

On a more general note: does compressibility of data with a given codec vary so much between blocks in a file (and between files) that it makes sense to explore which works best every time? Would a specific combination for "small" or "fast" not produce something close to the smallest or fastest encoding every time?

Generally no, but care needs to be taken at the junction of aligned to unaligned data as the statistics gathered can skew considerably. Also note that the code doesn't explore which method works every time. It fully explores the first 3 slices, and then uses whatever worked best on those for the next 50. It then goes back to 3 more trials, 50 repeat, etc. This way the overhead is minimised, but there is some room for changing that 3/50 ratio for different compression levels (eg 3/200 at -1) or compression profiles (small vs fast). I haven't measured to see what impact it would have. There are a few more subtleties which could probably be improved further. Data series where the answer as to which method is best isn't clear may have more than 3 out of 50 trials before picking a winner, and likewise if the same answer is continually occurring then the number of repeat (50) slices can increase.

Finally it has the notion of a revised method list. Initially each data series has a list of compression methods it's happy to evaluate (some are type specific, such as CRAM 3.1's name tokeniser or the fqzcomp qual encoder). If repeated 3/50 cycles illustrate that a particular compression method is always at least 20% behind the chosen one, then after 4 such rounds it's completely removed from the list of available methods and doesn't take part in subsequent trials (unless reset at the mapped / unmapped junction). Again those parameters perhaps should be adjusted by -1 to -9 or fast / small profiles. There's lots of room for tweaking as you can see.

Fortunately CRAM decoders don't need to know any of these. The format is self describing so they just follow the instructions as to which method to use for each block and apply it.

jkbonfield commented 3 years ago

Closing as this is a question that I think has been answered. You should be able to comment though still.

jkbonfield / io_lib

understanding cram_size and re-compressing crams #42