segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Do not copy values to temporary memory buffer when inserting values into dictionaries #263

Closed achille-roussel closed 2 years ago

achille-roussel commented 2 years ago

Based on #262, this PR removes a measurable overhead of copying values to insert into dictionaries to a temporary buffer. I had doing this because the values may have come from sparse memory locations (e.g. packed in a parquet.Value) and needed to be presented as a contiguous array of keys to the probing tables.

I introduced a hashprobe/sparse package which offers an abstraction layer for accessing arrays of values in sparse memory locations, and allows the underlying algorithms to work on non-contiguous memory areas. I think this package could be reused in other places in parquet-go, I'll see how I could refactor the code to leverage it elsewhere in a follow up.

I measured a significant performance regression on large dictionary inserts (1M+ values), which I tracked down to being caused by CPU cache invalidation causing multiple passes over the memory areas to amplify the cost of cache misses. I figured out a mechanism for chunking the inserts so that the second passes are able to hit CPU caches. I documented these learnings in the code as well.

name                                                  old time/op  new time/op  delta
Dictionary/Insert/INT32/N=100                          237ns ± 0%   227ns ± 1%   -4.32%  (p=0.000 n=10+10)
Dictionary/Insert/INT32/N=1000                        2.15µs ± 0%  1.98µs ± 0%   -7.78%  (p=0.000 n=10+9)
Dictionary/Insert/INT32/N=10000                       24.3µs ± 0%  22.2µs ± 0%   -8.50%  (p=0.000 n=10+9)
Dictionary/Insert/INT32/N=100000                       361µs ± 6%   328µs ± 2%   -9.08%  (p=0.000 n=10+10)
Dictionary/Insert/INT32/N=1000000                     8.32ms ± 9%  8.02ms ± 4%     ~     (p=0.095 n=10+9)
Dictionary/Insert/INT64/N=100                          255ns ± 0%   216ns ± 0%  -15.26%  (p=0.000 n=10+10)
Dictionary/Insert/INT64/N=1000                        2.29µs ± 0%  2.11µs ± 0%   -7.66%  (p=0.000 n=10+10)
Dictionary/Insert/INT64/N=10000                       26.5µs ± 0%  23.8µs ± 0%  -10.24%  (p=0.000 n=9+10)
Dictionary/Insert/INT64/N=100000                       748µs ± 2%   708µs ± 2%   -5.27%  (p=0.000 n=10+9)
Dictionary/Insert/INT64/N=1000000                     12.2ms ± 4%  11.8ms ± 4%   -3.25%  (p=0.010 n=10+9)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=100       404ns ± 0%   413ns ± 0%   +2.29%  (p=0.000 n=10+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=1000     4.62µs ± 0%  4.41µs ± 3%   -4.34%  (p=0.000 n=9+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=10000    95.1µs ± 1%  96.4µs ± 2%   +1.38%  (p=0.011 n=10+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=100000   1.20ms ± 6%  1.20ms ± 1%     ~     (p=0.780 n=10+9)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=1000000  27.6ms ± 2%  26.5ms ± 3%   -3.93%  (p=0.000 n=9+9)

name                                                  old value/s  new value/s  delta
Dictionary/Insert/INT32/N=100                           422M ± 0%    441M ± 1%   +4.50%  (p=0.000 n=10+10)
Dictionary/Insert/INT32/N=1000                          465M ± 0%    504M ± 0%   +8.41%  (p=0.000 n=9+9)
Dictionary/Insert/INT32/N=10000                         412M ± 0%    450M ± 0%   +9.29%  (p=0.000 n=10+9)
Dictionary/Insert/INT32/N=100000                        278M ± 6%    305M ± 2%   +9.87%  (p=0.000 n=10+10)
Dictionary/Insert/INT32/N=1000000                       120M ± 9%    125M ± 4%     ~     (p=0.095 n=10+9)
Dictionary/Insert/INT64/N=100                           393M ± 0%    463M ± 0%  +17.99%  (p=0.000 n=10+10)
Dictionary/Insert/INT64/N=1000                          437M ± 0%    474M ± 0%   +8.30%  (p=0.000 n=10+10)
Dictionary/Insert/INT64/N=10000                         378M ± 0%    421M ± 0%  +11.41%  (p=0.000 n=9+10)
Dictionary/Insert/INT64/N=100000                        134M ± 2%    141M ± 2%   +5.56%  (p=0.000 n=10+9)
Dictionary/Insert/INT64/N=1000000                      82.0M ± 4%   84.0M ± 7%     ~     (p=0.052 n=10+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=100        247M ± 0%    242M ± 0%   -2.23%  (p=0.000 n=10+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=1000       217M ± 0%    227M ± 3%   +4.56%  (p=0.000 n=9+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=10000      105M ± 1%    104M ± 2%   -1.35%  (p=0.011 n=10+10)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=100000    83.1M ± 5%   83.1M ± 1%     ~     (p=0.780 n=10+9)
Dictionary/Insert/FIXED_LEN_BYTE_ARRAY(16)/N=1000000   36.3M ± 2%   37.7M ± 3%   +4.09%  (p=0.000 n=9+9)