rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.33k stars 888 forks source link

[FEA] Add Parquet-to-Arrow dictionary transcoding to the parquet reader #15199

Open GregoryKimball opened 7 months ago

GregoryKimball commented 7 months ago

Is your feature request related to a problem? Please describe. Using a parquet reader option, we could allow the user to specify columns that they would like to receive as dictionary-encoded in the output table. For the specified columns, the Parquet reader would transcode multiple Parquet dictionary-encoded column chunks into an Arrow dictionary-encoded column.

Describe the solution you'd like

Part 1 - Confirm correct and efficient dictionary processing in libcudf

  1. Add benchmarks for dictionary encode and decode with axes including data type, cardinality and row count. Add checks that data is correctly round-tripped through dictionary encoding and decoding.
  2. Expand unit testing when using dictionary types for reductions, join keys, aggregation keys, aggregation values and other operations. Include string and numeric types as dictionary values. Please note that although libcudf can represent dictionaries of lists (needs to be checked), in Parquet only leaf values can be dictionary-encoded.
  3. Expand benchmarks for dictionary operations. As of 24.04 we only have a dictionary reduction benchmarks on int32 and float value types. Benchmarks should include strings data type and axes for varying cardinality and row count.
  4. Consider signed int for index type. Revisit the int types that can be used as indices. Revisit compatibility differences between libcudf dictionary and Arrow dictionary.
  5. Consider dropping the sorted key requirement for improved python compatibility. We use natural order of index today and we could add a mapping layer to indexes to stop constraining the indices.

Part 2 - Parquet-to-Arrow dictionary transcoding

  1. Estimate the performance of transcoding Parquet dictionary-encoded column chunks into arrow dictionary-encoded columns. Each Parquet dictionary-encoded column chunk with begins with a dictionary page. To create an Arrow-compliant dictionary column, we need to merge the values from the dictionary page in each column chunk into a single set of values for the arrow dictionary-encoded column. Then to generate the indices data, we need to re-map the indices from each column chunk against the indices in the combined values.
  2. Please note that PLAIN_DICTIONARY encoding is deprecated in Parquet 2.0. To support the new default RLE_DICTIONARY, we will need to add a conversion step from Parquet bit-packed indices into Arrow fixed-width indices.
  3. The parquet format allows different encodings for each column chunk within a column. In the case of dictionaries, the Parquet specification describes cases where PLAIN encoding will be mixed with DICTIONARY encoding, "If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding". To support this case we would need to add special handling.

Describe alternatives you've considered Use dictionary::encode to encode target columns immediately after materialization by the Parquet reader. This approach will realize the downstream benefits of dictionary encoding, at the cost of additional work in Parquet decode and dictionary encode. We would benefit from sample queries and profiles that compare materialized column versus dictionary column processing in libcudf workflows. Such profiles could be used to estimate the performance improvement from adding Parquet-to-Arrow dictionary transcoding to the libcudf Parquet reader.

Part 3 - Introduce run-end encoded type in libcudf, and then add Parquet-to-Arrow run-length/run-end transcoding

The Parquet format supports a run-length encoding / bit-packing hybrid and this could be transcoded into a run-end encoded Arrow type. To begin this project, we need to add run-end encoding as a new type to libcudf, introduce decode and encode functions, confirm correctness across libcudf APIs and audit for performance hotspots. A run-end encoded type in libcudf would allow us to support "constant" or "scalar" columns as requested in #15308. If libcudf supported a run-end encoded type, transcoding into this type from Parquet run-length encoded data would not be a zero-copy operation and would require converting the Parquet bit-packed "lengths" to Arrow fixed-width "ends".

esoha-nvidia commented 4 months ago

I have worries about the run-end encoding. On the one hand, it is compatible with what pyArrow is doing. So that's good. And run-end is easier to use than run-length because you can search through it more quickly. So there's upside there, too.

The downside with run-end encoding, for parquet files, is that they don't compress as well as run-length encoding. (Because lengths are fewer bits than ends.) So, it makes sense that the Apache developers of the parquet format selected run-length and not run-end.

So the question is, are we still getting a good performance improvement with this plan? I worry that we won't because the transcoding will be slow, because it is dominated by the memory bandwidth.


Currently, without any RLE support, a normal database query runs like this:

  1. Decompress parquet file. (gzip/zstd/etc)
  2. Decode RLE columns.
  3. Process query on decoded data.

In the image below, we can see step 2, where we decode RLE columns (3 of them in this case). It takes 43.2ms. And then we perform a group-by operation, which is taking 4.4ms. image

With RLE processing support, it can run like this:

  1. Decompress parquet file into RLE-encoded data. (gzip/zstd/etc)
  2. Process query on RLE-encoded data.
  3. Decode RLE columns.

The decompression step is the same in both. The RLE decoding takes roughly the same amount of time in both cases. However, the processing gets faster. The decode operations are still taking around the same time, only slightly faster, 40.9ms. However, the group-by operation itself has gotten much faster, down to 1.48ms.

image

With RLE transcoding to REE, I think that it will look like this:

  1. Decompress parquet file into RLE-encoded data. (gzip/zstd/etc)
  2. Transcode columns from RLE to REE.
  3. Process REE data.
  4. Decode REE data.

I think that the transcoding from RLE to REE will take a similar amount of time as decoding RLE data to regular data would take. Probably a little faster because there is less memory to write. I think that the subsequent processing of the REE data, because it's no longer dictionary encoded, will be slower than processing RLE-data but still faster than processing decoded data. And finally, the decoding of REE data, will be dominated by the memory bandwidth.

The extra round-trip on memory will cause the whole thing to be overall slower in my estimation. I already have code that does all of these steps so I can measure a prototype and see what we get.