Open broccoliSpicy opened 1 month ago
my another concern here is under the assumption that we want our write/read page size in align with the physical disk/cloud storage's optimal write/read size,
but we actually can't tell the output size of a encoding when we are issuing it
We spoke in person (well, google meet) about this issue and here is my understanding:
Now that we have compressive encodings we need to worry about the difference between "decoded size" and "encoded size". Our current approach is "accumulate at least 8MB of decoded data, encode all data, write page" (the 8MB is configurable). If an encoding is very compressive then we might write small pages.
In addition, many encodings have a preferred "chunk size". For example, FSST creates a unique symbol table for each chunk. In Fastlanes style bitpacking/for/delta the authors operate in chunks of size 1024 rows (and each chunk may have a unique bit width).
I propose something like this:
@westonpace ha, there are actually some comments related to this issue under #2563, nothing new that we haven't covered in the google meet though
For encodings like fsst, we store the decoding meta data in
PageInfo.encoding
, however, we may expect many decoding meta data in one page after compaction