Open westonpace opened 4 months ago
One idea is to store the original string size inside some header(maybe inside symbol table) and allocate the exact size before decompression
in our fsst implementation, symbol table and decompress output is decoupled, the user is responsible to associate which symbol table with which decompressed output, which makes me worry, maybe we can have some code in both symbol table and decompressed output for matching,
This should be safe (no FSST symbol is more than 8 bytes I think) but it is slow and excessive.
yes, the FSST symbol is of size [1, 8]
One idea is to store the original string size inside some header(maybe inside symbol table) and allocate the exact size before decompression
That could work, I think that's fine for the FSST crate. We always have encodings.proto too. I don't see any big pros/cons to either approach so feel free to pick one.
ha @westonpace , I am starting to code store the original size
and found a problem:
we are decoding it in decode batch size, not the original compression size, so the original size is a far overshot than `8 * compressed_length"
I checked the original implementation, they just allocate a big chunk of memory for decompression
#define FSST_MEMBUF (1ULL<<22)
---
dstMem[0] = srcBuf[1] + (FSST_MEMBUF*(1ULL+decompress));
dstMem[1] = dstMem[0] + (FSST_MEMBUF*(2ULL-decompress));
---
dstLen[swap] = fsst_decompress(&decoder, srcLen[swap] - hdr, srcBuf[swap] + hdr, FSST_MEMBUF, dstBuf[swap] = dstMem[swap]);
let me check out duckdb...
in duckdb, the decompressed values are emitted one by one and they reuse the same [decompress_buffer] (https://github.com/duckdb/duckdb/blob/18254ec5d9162db2e4aeec8159691a3b17217494/src/storage/compression/fsst.cpp#L511) of size = StringUncompressed::GetStringBlockLimit(segment.GetBlockManager().GetBlockSize());
struct StringUncompressed {
public:
static CompressionFunction GetFunction(PhysicalType data_type);
static idx_t GetStringBlockLimit(const idx_t block_size) {
return MinValue(AlignValueFloor(block_size / 4), DEFAULT_STRING_BLOCK_LIMIT);
}
public:
//! The default maximum string size for sufficiently big block sizes
static constexpr idx_t DEFAULT_STRING_BLOCK_LIMIT = 4096;
};
we are decoding it in decode batch size, not the original compression size
Oh, good point.
I checked the original implementation, they just allocate a big chunk of memory for decompression
in duckdb, the decompressed values are emitted one by one and they reuse the same [decompress_buffer]
These seem like good approaches. When the scheduler is created we can allocate one largish (64MiB?) decompression buffer. Or maybe, during decompression we can check the size (with unlikely / cold on the "not large enough" branch) and increase as needed.
Then, after decompression, we just copy out the decompressed data into a newly allocated exact-size buffer.
I am still trying to figure out how to reuse one PageScheduler.decompression_buffer
in PageDecoder
, PageDecoder.decode
is multi-threaded
@broccoliSpicy can you use some kind of thread local?
ha, I actually didn't know about thread_local
, thank you so much @westonpace
I think our PageDecoder.decode
thread run once and gone, right? we actually want each of them to have their own output buffer so they can run concurrently.
after PR#2626, the decompression performance is on a par with parquet, we can add cascading later and the file size will be less than parquet while decompression slower
Currently, the FSST output buffer is initialized to
8 * compressed_length
. This should be safe (no FSST symbol is more than 8 bytes I think) but it is slow and excessive. We can store the original length of the string as part of the encoded page and use that to size the output buffer exactly.