I got some way reverse-engineering the format so that I can do the bitshuffle independently of lz4 in my application but kept stubbing my toes - some clear documentation on how it is used would be very useful for non-canonical implementations.
For example: it would appear that the on disk representation takes the form of
BE uint32_t compressed_block_size <compressed block> BE uint32_t compressed_block_size <compressed block> BE uint32_t compressed_block_size <compressed block> ...
where <compressed_block> is the result of previously compressing 8192 bytes, then there is a partial block which is smaller, finally a (looks like) verbatim uncompressed teeny bit at the end which is some residual. I could try compressing and then unpacking arbitrary bit patterns to resolve this but it feels like some canonical definition of the on-disk format (beyond, of course, reading the source code) would be a useful addition to this library.
I got some way reverse-engineering the format so that I can do the bitshuffle independently of lz4 in my application but kept stubbing my toes - some clear documentation on how it is used would be very useful for non-canonical implementations.
For example: it would appear that the on disk representation takes the form of
where
<compressed_block>
is the result of previously compressing 8192 bytes, then there is a partial block which is smaller, finally a (looks like) verbatim uncompressed teeny bit at the end which is some residual. I could try compressing and then unpacking arbitrary bit patterns to resolve this but it feels like some canonical definition of the on-disk format (beyond, of course, reading the source code) would be a useful addition to this library.