Example for compressing/decompressing an arbitrary number of u32 int values

quickwit-oss / bitpacking

SIMD algorithms for integer compression via bitpacking. This crate is a port of a C library called simdcomp.

MIT License

267 stars 30 forks source link

Example for compressing/decompressing an arbitrary number of u32 int values #27

Closed cisaacson closed 4 years ago

cisaacson commented 4 years ago

I have been able to compress and decompress using BitPacker4x, and the compression looks quite good on sorted u32 values. I have a large number of u32 values to compress (up to a few 1000), and to compress I looped through chunks equal to 128 u32 values in size. I need to store the final compressed result, and I did this by appending each to a result Vec<u8> which I can then store. This works fine.

Now I need to take the result Vec<u8> and decompress each chunk. At this point I no longer have the compressed_len for each chunk of bytes. Is there a way to do this so that I can decompress each compressed chunk and rebuild the original Vec<u32>? In the original library from Daniel Lemere in C++ he does support any arbitrary number of int values, that would be really nice here too. But if you can guide me on how to use it properly that will be very helpful.

fulmicoton commented 4 years ago

The library only supports compresssing 128 integers (assuming you use BitPacker4x). You need to build your own codec on top of it.

You also need to somehow define how you will store the number of bits used to compress your block.

For instance,

[1 byte: store num bits per int for block #1]
[block #1: integers 0..128 ]
[1 byte: store num bits per int for block #2 
[block #2: integers 128..256]
...
[1 byte: store num bits per int for block #n]
[block #n: integers (n-1)*128..n*128]
[last incomplete block (< 128 integers) for the reminder, using variable byte encoding]

The length of a bitpacked block packed using b bits per int is : (16 x b) bytes.

The following bench shows an implementation of such a codec, but is missing the "remaining variable int encoded block".

https://github.com/tantivy-search/bitpacking/blob/master/src/bitpacking_bench.rs#L95-L122

cisaacson commented 4 years ago

ThanksPaul for the quick response, makes sense. We will implement it this way.

Cory

-- Cory Isaacson http://www.coryisaacson.com On Jun 13, 2020, 5:50 PM -0600, Paul Masurel notifications@github.com, wrote:

The library only supports compresssing 128 integers (assuming you use BitPacker4x). You need to build your own codec on top of it. You also need to somehow define how you will store the number of bits used to compress your block. For instance, [1 byte: store num bits for block #1] [block #1: integers 0..128 ] [1 byte: store num bits for block #2 [block #2: integers 128..256] ... [1 byte: store num bits for block #n] [block #n: integers (n-1)128..n128] [last incomplete block (< 128 integers) for the reminder, using variable byte encoding] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cisaacson commented 4 years ago

Paul, your advice helped a lot, we have things working the way we want with excellent compression. Great work on the library.