ronanh / intcomp

Fast integer compression library
Apache License 2.0
87 stars 4 forks source link

Seekable streams #7

Open klauspost opened 1 year ago

klauspost commented 1 year ago

Force creation of blocks at fixed boundaries to enable arbitrary position decoding and reversed iteration

The idea of fixed offsets doesn't sound like a great solution to seeking.

The easiest is to limit the "frames" (or whatever you call the collection of blocks) you generate. This will allow you to skip forward just by reading the first byte and skipping that number of bytes.

For sorted seeking, however you would need to add an optional min/max value. You only really need 1 bit to signify a min/max value follows the size. So if you limit the size of frames (for 32 bits), you have plenty of bits left to add an indicator that a min/max value follows.

Say your frames are max 1<<20 entries each. This means that you have 4MB of uint32 data. Considering the decompression speed you don't really need more fine grained seeking. The frame size can easily be configurable on the compression side.

This will as a side effect also allow you to compress and decompress these frames concurrently.

If you want reverse seeking you need an additional block at the end with the frame size, so you can skip backwards.

You can of course also have the index separately, where you for each frame record uncompressed length, compressed length and min/max. This would be similar to the index I made for S2 - except for the min/max part.

A separate index of course means that you don't have to modify anything, except make frames smaller.

ronanh commented 1 year ago

Thanks for the advice. I've started some work for this 'fixed boundaries' frames thing, and it adds more complexity that I had thought at first, so you're probably right.

I use your zstd lib extensively, saw this S2 thing in the release notes, but never really looked what it really was: this is really interesting.

klauspost commented 1 year ago

Yeah. Fixed boundaries will limit the compression significantly (unless you can do some magic I cannot think of).

Having a bunch of blocks as a frame (or what you'd call it) that can be a starting point for decompression would be reasonable.

The min/max can of each frame can either cater for the the search in sorted or quick rejection of frames (if looking for outliers).