Open GregoryKimball opened 7 months ago
I have worries about the run-end encoding. On the one hand, it is compatible with what pyArrow is doing. So that's good. And run-end is easier to use than run-length because you can search through it more quickly. So there's upside there, too.
The downside with run-end encoding, for parquet files, is that they don't compress as well as run-length encoding. (Because lengths are fewer bits than ends.) So, it makes sense that the Apache developers of the parquet format selected run-length and not run-end.
So the question is, are we still getting a good performance improvement with this plan? I worry that we won't because the transcoding will be slow, because it is dominated by the memory bandwidth.
Currently, without any RLE support, a normal database query runs like this:
In the image below, we can see step 2, where we decode RLE columns (3 of them in this case). It takes 43.2ms. And then we perform a group-by operation, which is taking 4.4ms.
With RLE processing support, it can run like this:
The decompression step is the same in both. The RLE decoding takes roughly the same amount of time in both cases. However, the processing gets faster. The decode operations are still taking around the same time, only slightly faster, 40.9ms. However, the group-by operation itself has gotten much faster, down to 1.48ms.
With RLE transcoding to REE, I think that it will look like this:
I think that the transcoding from RLE to REE will take a similar amount of time as decoding RLE data to regular data would take. Probably a little faster because there is less memory to write. I think that the subsequent processing of the REE data, because it's no longer dictionary encoded, will be slower than processing RLE-data but still faster than processing decoded data. And finally, the decoding of REE data, will be dominated by the memory bandwidth.
The extra round-trip on memory will cause the whole thing to be overall slower in my estimation. I already have code that does all of these steps so I can measure a prototype and see what we get.
Is your feature request related to a problem? Please describe. Using a parquet reader option, we could allow the user to specify columns that they would like to receive as dictionary-encoded in the output table. For the specified columns, the Parquet reader would transcode multiple Parquet dictionary-encoded column chunks into an Arrow dictionary-encoded column.
Describe the solution you'd like
Part 1 - Confirm correct and efficient dictionary processing in libcudf
encode
anddecode
with axes including data type, cardinality and row count. Add checks that data is correctly round-tripped through dictionary encoding and decoding.int32
andfloat
value types. Benchmarks should include strings data type and axes for varying cardinality and row count.Part 2 - Parquet-to-Arrow dictionary transcoding
Describe alternatives you've considered Use
dictionary::encode
to encode target columns immediately after materialization by the Parquet reader. This approach will realize the downstream benefits of dictionary encoding, at the cost of additional work in Parquet decode and dictionary encode. We would benefit from sample queries and profiles that compare materialized column versus dictionary column processing in libcudf workflows. Such profiles could be used to estimate the performance improvement from adding Parquet-to-Arrow dictionary transcoding to the libcudf Parquet reader.Part 3 - Introduce run-end encoded type in libcudf, and then add Parquet-to-Arrow run-length/run-end transcoding
The Parquet format supports a run-length encoding / bit-packing hybrid and this could be transcoded into a run-end encoded Arrow type. To begin this project, we need to add run-end encoding as a new type to libcudf, introduce decode and encode functions, confirm correctness across libcudf APIs and audit for performance hotspots. A run-end encoded type in libcudf would allow us to support "constant" or "scalar" columns as requested in #15308. If libcudf supported a run-end encoded type, transcoding into this type from Parquet run-length encoded data would not be a zero-copy operation and would require converting the Parquet bit-packed "lengths" to Arrow fixed-width "ends".