Based on #262, this PR removes a measurable overhead of copying values to insert into dictionaries to a temporary buffer. I had doing this because the values may have come from sparse memory locations (e.g. packed in a parquet.Value) and needed to be presented as a contiguous array of keys to the probing tables.
I introduced a hashprobe/sparse package which offers an abstraction layer for accessing arrays of values in sparse memory locations, and allows the underlying algorithms to work on non-contiguous memory areas. I think this package could be reused in other places in parquet-go, I'll see how I could refactor the code to leverage it elsewhere in a follow up.
I measured a significant performance regression on large dictionary inserts (1M+ values), which I tracked down to being caused by CPU cache invalidation causing multiple passes over the memory areas to amplify the cost of cache misses. I figured out a mechanism for chunking the inserts so that the second passes are able to hit CPU caches. I documented these learnings in the code as well.
Based on #262, this PR removes a measurable overhead of copying values to insert into dictionaries to a temporary buffer. I had doing this because the values may have come from sparse memory locations (e.g. packed in a
parquet.Value
) and needed to be presented as a contiguous array of keys to the probing tables.I introduced a
hashprobe/sparse
package which offers an abstraction layer for accessing arrays of values in sparse memory locations, and allows the underlying algorithms to work on non-contiguous memory areas. I think this package could be reused in other places in parquet-go, I'll see how I could refactor the code to leverage it elsewhere in a follow up.I measured a significant performance regression on large dictionary inserts (1M+ values), which I tracked down to being caused by CPU cache invalidation causing multiple passes over the memory areas to amplify the cost of cache misses. I figured out a mechanism for chunking the inserts so that the second passes are able to hit CPU caches. I documented these learnings in the code as well.