samtools / htscodecs

Custom compression for CRAM and others.
Other
30 stars 18 forks source link

Bit packing inputs with 256 symbols do not disable bit packing #65

Closed zaeleus closed 2 years ago

zaeleus commented 2 years ago

Encoders that use hts_pack disable bit packing if there are more than 16 symbols, e.g., rANS Nx16 and the arithmetic range coder. However, the checks that disable bit packing compare against a potentially truncated value. When attempting to bit pack inputs with 256 symbols, out[c_meta_len] is 0, out[c_meta_len] > 16 fails, and the encoders do not disable bit packing.

An example using the rANS Nx16 test program (version 1.3.0):

$ ruby -e "print (0..255).to_a.pack('C*')" | ./rans4x16pr -r -o 128 | xxd
Took 46003 microseconds,   0.0 MB/s
00000000: a082 0000 8200 0001 0203 0405 0607 0809  ................
00000010: 0a0b 0c0d 0e0f 1011 1213 1415 1617 1819  ................
00000020: 1a1b 1c1d 1e1f 2021 2223 2425 2627 2829  ...... !"#$%&'()
00000030: 2a2b 2c2d 2e2f 3031 3233 3435 3637 3839  *+,-./0123456789
00000040: 3a3b 3c3d 3e3f 4041 4243 4445 4647 4849  :;<=>?@ABCDEFGHI
00000050: 4a4b 4c4d 4e4f 5051 5253 5455 5657 5859  JKLMNOPQRSTUVWXY
00000060: 5a5b 5c5d 5e5f 6061 6263 6465 6667 6869  Z[\]^_`abcdefghi
00000070: 6a6b 6c6d 6e6f 7071 7273 7475 7677 7879  jklmnopqrstuvwxy
00000080: 7a7b 7c7d 7e7f 8081 8283 8485 8687 8889  z{|}~...........
00000090: 8a8b 8c8d 8e8f 9091 9293 9495 9697 9899  ................
000000a0: 9a9b 9c9d 9e9f a0a1 a2a3 a4a5 a6a7 a8a9  ................
000000b0: aaab acad aeaf b0b1 b2b3 b4b5 b6b7 b8b9  ................
000000c0: babb bcbd bebf c0c1 c2c3 c4c5 c6c7 c8c9  ................
000000d0: cacb cccd cecf d0d1 d2d3 d4d5 d6d7 d8d9  ................
000000e0: dadb dcdd dedf e0e1 e2e3 e4e5 e6e7 e8e9  ................
000000f0: eaeb eced eeef f0f1 f2f3 f4f5 f6f7 f8f9  ................
00000100: fafb fcfd feff                           ......
jkbonfield commented 2 years ago

I think this was probably something that I tightened up in the spec with an explicit "not permitted to have nsym > 16 or nsym == 0" statement at a later stage.

The bug is really one of not knowing how best to deal with errors; eg failed to allocate memory (hard error) vs inappropriate data (soft error). I think I took the wrong choice in making it return a buffer in the latter case, as it's both wasteful of CPU cycles and also punts the error handling elsewhere. Specifically here: https://github.com/samtools/htscodecs/blob/843d4f63b1c64905881b4648916a4d027baa1a1c/htscodecs/rANS_static4x16pr.c#L1267 (which should check for 0 too, as stated in the specification).

However given the handling of the NULL case anyway, hts_pack should probably just treat all error types the same. The caller can, if it chooses, always look at errno for ENOMEM to distinguish.

Thanks for the bug report.