Opinion on primitive block implementation with chunks instead of a single value arrays

sdruzkin commented 3 months ago

In presto-orc and nimble we have a use case when we need to merge/flatten a bunch of blocks into a single map aka constructing a flat map. Flat map is an encoding kind for maps when we store value streams for each key independently instead of having one big value stream for all keys. Flat maps usually have a complex value type like array<int> or array<long>.

Here is the code doing flat map merging https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/reader/MapFlatSelectiveStreamReader.java#L445

map key1: entry values block 1 \
map key2: entry values block 2 \
map key3: entry values block 3 ---> map[key1, value1,....]
map key4: entry values block 4 /
map key5: entry values block 5 /

We use a regular map block builder for merging. One of the problems is that the leaf level block builders need to copy the leaf value arrays multiple times when they grow, and it's quite expensive operation when they grow to 100k-1m+ elements. Ideally, to avoid merging we can have a dedicated block and type for flat maps or present the flat map block as a struct, but it will require quite a lot of changes. Calculating the exact number of elements before merging to create the block builders of the exact expected size is also expensive due to complex data types and different block encodings.

To mitigate the problem with expensive resizing in the block builder I'm thinking of creating a chunked implementations of the Int and Long blocks and their builders.

Something like this, which is pretty similar to Apache Arrow chunked array:

ChunkedIntArrayBlock {
 int [][] valueChunks;

 int getInt(int positions) {
  int chunk = getChunkForPosition(position);
  int inChunkPosition = getPositionInChunk(position);
  return valueChunks[chunk][inChunkPosition];
 }

 int getChunkForPosition(int position) {
   ...
 }
 int getPositionInChunk(int position) {
   ...
 }
}

The construction will be cheaper, but read access will be a bit more expensive. Behavior wise it would be identical to the current implementations and should be compatible with the existing block encoders. Unfortunately the block interface is a bit complicated, so it will need at least a few days spent on each type.

Do you guys have any opinion on this approach or creation of new block types?

tdcmeehan commented 3 months ago

Calculating the exact number of elements before merging to create the block builders of the exact expected size is also expensive due to complex data types and different block encodings.

Can you help me understand why this is? Isn't this just peaking at the metadata of the underlying blocks--what makes it expensive?

sdruzkin commented 3 months ago

Calculating the exact number of elements before merging to create the block builders of the exact expected size is also expensive due to complex data types and different block encodings.

Can you help me understand why this is? Isn't this just peaking at the metadata of the underlying blocks--what makes it expensive?

There are two factors that contribute to the complexity:

Currently there is no API that allows to get number of elements in the block with a break down by the nesting level for the complex types.
In majority of cases we do merging of roughly 10K-25K value blocks with an upper limit of 30K. The value blocks are usually represented as complex types. Dereferencing those is very-very unfriendly for the CPU cache.
Implementing the counting logic for flat encoded blocks is fine, but for dictionary encoded blocks we would need to do multiple traversals between parent dict ids and dict values block. By taking into account the block encoding we are putting ourselves into a position when the code needs to be updated when a new type of block encoding is added.

Another use-case for the chunked encoded blocks is application of deltas, when we have a base block and a bunch of delete row or update cell deltas merged together with the base block after loading.

There are also a few more alternatives to the chunked encoding that I can think of:

create a multi-block dictionary - a dictionary backed by multiple value blocks;
estimate the block sizes based on the previous runs, this approach has a bunch of issues with over or under-allocation

prestodb / presto

Opinion on primitive block implementation with chunks instead of a single value arrays #22852