polarsignals / frostdb

❄️ Coolest database around 🧊 Embeddable column database written in Go.
Apache License 2.0
1.27k stars 65 forks source link

pqarrow: centralize ReadValues when writing a page #917

Closed asubiotto closed 1 month ago

asubiotto commented 1 month ago

Currently, when converting a parquet page to an arrow record, all the writers would repeat the slow path of allocating a parquet.Values slice, read all values, and write them to their underlying builder. However, this code already existed one level above and is more efficient since it reuses a parquet.Values slice.

This commit removes the repetition from the writers and leaves only the concrete implementation of writing existing values to an arrow builder. Callers can also check if the ValueWriter implements the PageWriter interface, which can also offer a fast path for writing a parquet page directly.

The improvement is especially noticeable in Query/Values since the slow path would previously fall back to write all the page values, rather than just the dictionary values.

                │  benchmain   │               benchpw                │
                │    sec/op    │    sec/op     vs base                │
Query/Types-12     109.9m ± 1%    109.7m ± 2%        ~ (p=0.353 n=10)
Query/Labels-12    219.5µ ± 1%    214.1µ ± 2%   -2.46% (p=0.011 n=10)
Query/Values-12   7716.3µ ± 3%    207.7µ ± 4%  -97.31% (p=0.000 n=10)
Query/Merge-12     223.1m ± 2%    220.6m ± 1%   -1.08% (p=0.035 n=10)
Query/Range-12     117.5m ± 1%    116.1m ± 2%        ~ (p=0.218 n=10)
Query/Filter-12    9.888m ± 3%   10.025m ± 4%        ~ (p=0.684 n=10)
geomean            19.08m         10.38m       -45.58%

                │   benchmain    │               benchpw                │
                │      B/op      │     B/op      vs base                │
Query/Types-12      254.3Mi ± 1%   252.1Mi ± 3%        ~ (p=0.353 n=10)
Query/Labels-12     400.6Ki ± 0%   400.7Ki ± 0%        ~ (p=0.796 n=10)
Query/Values-12   12644.7Ki ± 0%   853.5Ki ± 0%  -93.25% (p=0.000 n=10)
Query/Merge-12      574.7Mi ± 1%   576.6Mi ± 1%        ~ (p=0.247 n=10)
Query/Range-12      212.0Mi ± 0%   212.0Mi ± 0%        ~ (p=0.190 n=10)
Query/Filter-12     13.52Mi ± 0%   13.52Mi ± 0%        ~ (p=0.739 n=10)
geomean             35.56Mi        22.67Mi       -36.25%

                │  benchmain  │               benchpw               │
                │  allocs/op  │  allocs/op   vs base                │
Query/Types-12    64.32k ± 6%   64.30k ± 4%        ~ (p=0.631 n=10)
Query/Labels-12   1.802k ± 0%   1.802k ± 0%        ~ (p=0.840 n=10)
Query/Values-12   3.677k ± 0%   2.192k ± 0%  -40.37% (p=0.000 n=10)
Query/Merge-12    1.435M ± 0%   1.435M ± 0%        ~ (p=0.424 n=10)
Query/Range-12    174.3k ± 0%   174.2k ± 0%   -0.00% (p=0.044 n=10)
Query/Filter-12   4.255k ± 0%   4.255k ± 0%        ~ (p=0.487 n=10)
geomean           27.72k        25.43k        -8.25%

                │  benchmain   │            benchpw             │
                │    B/msec    │    B/msec     vs base          │
Query/Filter-12   3.238Mi ± 3%   3.194Mi ± 4%  ~ (p=0.724 n=10)