rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 883 forks source link

[FEA] Add Parquet transcoding to `cpp/examples` #15344

Closed GregoryKimball closed 3 months ago

GregoryKimball commented 5 months ago

Is your feature request related to a problem? Please describe. Recently we added a libcudf example for processing nested data types. The deduplication example uses a command line interface to receive a filename, perform some relational algebra and output basic timing data to the console.

Let's add an example that performs parquet file transcoding. The example can read a Parquet file that contains a single column, and then write it using a specified encoding and compression, and then read the file again. Finally, the example can confirm the data is the same between the first and second reads.

Describe the solution you'd like

Here is an snippet showing how the example might be called: ./parquet_io ~/in.pq ~/out.pq DELTA_BYTE_ARRAY ZSTD where the parameters represent input_filepath, output_filepath, column_encoding and compression_type. Valid encodings include DICTIONARY, PLAIN, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, plus (soon to be) BYTE_STREAM_SPLIT.

The example will print to console the time elapsed for (1) the initial read, (2) the write, and (3) the second read.

The example does not need to verify is the requested encoding is valid (e.g. string column with DELTA_BINARY_PACKED or int64 column with DELTA_BYTE_ARRAY). Let's only operate on the first column to keep things simple.

Describe alternatives you've considered Use a cuDF-python example instead, but I'd rather have a C++ example for each feature that we write a blog about.

etseidl commented 5 months ago

Currently if an invalid encoding is requested the parquet writer will print a warning and fall back to the default (dictionary in most cases).