lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.9k stars 215 forks source link

feat: apply general compression for string field's offsets #3019

Open niyue opened 1 week ago

niyue commented 1 week ago

In Lance format v2, while the byte array of a string field can benefit from general compression, the offsets array for the string field remains uncompressed (not even bit-packed). As a result, if the strings are relatively short, the overall compression of the field is limited. Since the general compression is applied at the buffer level, it could also be applied to other array types, including the offsets array, to improve overall compression efficiency.

broccoliSpicy commented 3 days ago

I kind of had a idea of apply encodings and compressions based on Datablock (more generalized than arrow data type) and centralized some datablock statistics in PR https://github.com/lancedb/lance/pull/2986.

The next step for this direction I plan to do is a refactor of current encoder, to make them work with datablock rather than arrow array.

something like this:

// bit-pack encoder
fn encode(data: DataBlock) -> DataBlock {
}

fn encode(data: DataBlock) -> EncodedPage {
    // for all datablocks in this datablock structure,
    //     while data can be encoded and we didn't hit the encoding cascade level limit
    //         select a encoding 
    //         do encode

for the string type in this issue, it has datablock of: https://github.com/lancedb/lance/blob/038f239345826f75bc1e6f9831209f40ce6b530d/rust/lance-encoding/src/data.rs#L446-L462

I may want to change it to:

 pub struct VariableWidthBlock { 
     /// The data buffer 
     pub data: FixedWidthDatablock, 
     /// The offsets buffer (contains num_values + 1 offsets) 
     /// 
     /// Offsets MUST start at 0 
     pub offsets: FixedWidthDatablock, 

     /// The number of values represented by this block 
     pub num_values: u64, 

     pub block_info: BlockInfo, 

     pub used_encodings: UsedEncoding, 
 } 

then we will be able to apply encodings and compressions dynamically based on datablock type and its statistics, the output of encoding will also be a datablock so we can apply encodings recursively.

westonpace commented 3 days ago

I definitely think we should bitpack string offsets. I think we are close to solidifying bitpacking in 2.1 (thanks @broccoliSpicy).

I think it's reasonable to allow general compression to be a catch-all in 2.1. Perhaps a boolean flag like lance-encoding:maximize-compression which, if true, will apply general compression in any situation where it would have previously applied value?

niyue commented 2 days ago

centralized some datablock statistics in PR https://github.com/lancedb/lance/pull/2986

I definitely think we should bitpack string offsets

Bit packing may not always outperform Zstd, as its effectiveness depends on the data distribution. Choosing either bit packing or zstd might not yield optimal results in certain scenarios. Consulting data statistics to select the most suitable compression method for each case seems like a more effective approach, but I am unsure whether the statistics available in DataBlock are sufficient for this purpose.

westonpace commented 2 days ago

The main thing we look at for bit packing is bit width (e.g. how many bits are used) which should give us a good indication of how effective bitpacking will be.

So it'll probably be "bit width < X" implies bit packing and use general compression otherwise.

broccoliSpicy commented 1 day ago

Bit packing may not always outperform Zstd, as its effectiveness depends on the data distribution. Choosing either bit packing or zstd might not yield optimal results in certain scenarios. Consulting data statistics to select the most suitable compression method for each case seems like a more effective approach, but I am unsure whether the statistics available in DataBlock are sufficient for this purpose.

yeah, encoding selection is going to be a very fun problem to tackle. In btrblocks, they have a fixed set of encodings and apply them on a sampling of the input, then select the best encoding, vortex uses similar strategy. In nimble, the encoding selection policy is designed to be plugin-able, some policies they already has can be found here.

We still need to pondering on this problem and do lots of empirical analysis to get a sense of how to do it better(write speed, read speed, compression ratio, intended workload and its data pattern).

but I am unsure whether the statistics available in DataBlock are sufficient for this purpose.

this should be all right, we can add more statistics as we go along.