Better determination of FSST output buffer size

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.96k stars 220 forks source link

Better determination of FSST output buffer size #2606

Open westonpace opened 4 months ago

westonpace commented 4 months ago

Currently, the FSST output buffer is initialized to 8 * compressed_length. This should be safe (no FSST symbol is more than 8 bytes I think) but it is slow and excessive. We can store the original length of the string as part of the encoded page and use that to size the output buffer exactly.

broccoliSpicy commented 4 months ago

One idea is to store the original string size inside some header(maybe inside symbol table) and allocate the exact size before decompression

broccoliSpicy commented 4 months ago

in our fsst implementation, symbol table and decompress output is decoupled, the user is responsible to associate which symbol table with which decompressed output, which makes me worry, maybe we can have some code in both symbol table and decompressed output for matching,

broccoliSpicy commented 4 months ago

This should be safe (no FSST symbol is more than 8 bytes I think) but it is slow and excessive.

yes, the FSST symbol is of size [1, 8]

westonpace commented 4 months ago

One idea is to store the original string size inside some header(maybe inside symbol table) and allocate the exact size before decompression

That could work, I think that's fine for the FSST crate. We always have encodings.proto too. I don't see any big pros/cons to either approach so feel free to pick one.

broccoliSpicy commented 3 months ago

ha @westonpace , I am starting to code store the original size and found a problem: we are decoding it in decode batch size, not the original compression size, so the original size is a far overshot than `8 * compressed_length"

I checked the original implementation, they just allocate a big chunk of memory for decompression

#define FSST_MEMBUF (1ULL<<22)

---

   dstMem[0] = srcBuf[1] + (FSST_MEMBUF*(1ULL+decompress));
   dstMem[1] = dstMem[0] + (FSST_MEMBUF*(2ULL-decompress));

---

    dstLen[swap] = fsst_decompress(&decoder, srcLen[swap] - hdr, srcBuf[swap] + hdr, FSST_MEMBUF, dstBuf[swap] = dstMem[swap]);

original implementation

let me check out duckdb...

broccoliSpicy commented 3 months ago

in duckdb, the decompressed values are emitted one by one and they reuse the same [decompress_buffer] (https://github.com/duckdb/duckdb/blob/18254ec5d9162db2e4aeec8159691a3b17217494/src/storage/compression/fsst.cpp#L511) of size = StringUncompressed::GetStringBlockLimit(segment.GetBlockManager().GetBlockSize());

struct StringUncompressed {
public:
    static CompressionFunction GetFunction(PhysicalType data_type);
    static idx_t GetStringBlockLimit(const idx_t block_size) {
        return MinValue(AlignValueFloor(block_size / 4), DEFAULT_STRING_BLOCK_LIMIT);
    }

public:
    //! The default maximum string size for sufficiently big block sizes
    static constexpr idx_t DEFAULT_STRING_BLOCK_LIMIT = 4096;
};

duckdb

westonpace commented 3 months ago

we are decoding it in decode batch size, not the original compression size

Oh, good point.

I checked the original implementation, they just allocate a big chunk of memory for decompression

in duckdb, the decompressed values are emitted one by one and they reuse the same [decompress_buffer]

These seem like good approaches. When the scheduler is created we can allocate one largish (64MiB?) decompression buffer. Or maybe, during decompression we can check the size (with unlikely / cold on the "not large enough" branch) and increase as needed.

Then, after decompression, we just copy out the decompressed data into a newly allocated exact-size buffer.

broccoliSpicy commented 3 months ago

I am still trying to figure out how to reuse one PageScheduler.decompression_buffer in PageDecoder, PageDecoder.decode is multi-threaded

westonpace commented 3 months ago

@broccoliSpicy can you use some kind of thread local?

broccoliSpicy commented 3 months ago

ha, I actually didn't know about thread_local, thank you so much @westonpace I think our PageDecoder.decode thread run once and gone, right? we actually want each of them to have their own output buffer so they can run concurrently.

after PR#2626, the decompression performance is on a par with parquet, we can add cascading later and the file size will be less than parquet while decompression slower