segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 104 forks source link

Improve BYTE_ARRAY dictionaries #273

Closed achille-roussel closed 2 years ago

achille-roussel commented 2 years ago

This PR contributes to #226 by optimizing the byteArrayDictionary type.

I made multiple attempts to improve throughput of probing operations for string values (reflected in this branch history) but without being able to significantly increase throughput in the way that I did for probing fixed-length values (e.g. int32). The increased complexity did not justify the returns, so we will stick with map[string]int32 in this case.

Overall, measurable but minor improvements, bug fixes, and simplification of the code.

name                                    old time/op  new time/op  delta
Dictionary/Insert/BYTE_ARRAY/N=100      1.40µs ± 0%  1.36µs ± 1%  -2.88%  (p=0.000 n=8+9)
Dictionary/Insert/BYTE_ARRAY/N=1000     13.8µs ± 1%  13.6µs ± 1%  -1.73%  (p=0.000 n=10+9)
Dictionary/Insert/BYTE_ARRAY/N=10000     414µs ± 2%   403µs ± 0%  -2.72%  (p=0.000 n=10+10)
Dictionary/Insert/BYTE_ARRAY/N=100000   5.58ms ± 4%  5.57ms ± 2%    ~     (p=0.579 n=10+10)
Dictionary/Insert/BYTE_ARRAY/N=1000000   105ms ± 2%   105ms ± 2%    ~     (p=0.436 n=10+10)

name                                    old value/s  new value/s  delta
Dictionary/Insert/BYTE_ARRAY/N=100       71.3M ± 1%   73.5M ± 1%  +3.04%  (p=0.000 n=9+9)
Dictionary/Insert/BYTE_ARRAY/N=1000      72.3M ± 1%   73.6M ± 1%  +1.77%  (p=0.000 n=10+9)
Dictionary/Insert/BYTE_ARRAY/N=10000     24.1M ± 2%   24.8M ± 0%  +2.79%  (p=0.000 n=10+10)
Dictionary/Insert/BYTE_ARRAY/N=100000    17.9M ± 4%   18.0M ± 2%    ~     (p=0.579 n=10+10)
Dictionary/Insert/BYTE_ARRAY/N=1000000   9.56M ± 2%   9.51M ± 2%    ~     (p=0.436 n=10+10)
achille-roussel commented 2 years ago

Thanks for the review!