rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.02k stars 871 forks source link

[FEA] Add Avro reader benchmarks to the cuIO benchmarking suite #14574

Open GregoryKimball opened 7 months ago

GregoryKimball commented 7 months ago

Is your feature request related to a problem? Please describe. We have reader benchmarks for CSV, JSON, Parquet and ORC in the cuIO nvbench benchmarking suite. We should add benchmarking for the Avro reader.

The cuIO benchmarks are located here: https://github.com/rapidsai/cudf/tree/branch-24.02/cpp/benchmarks/io

Unfortunately, we don't have an Avro writer implementation in libcudf, so the naive approach of modeling benchmarks after json_reader_input.cpp will not work.

Describe the solution you'd like Our options would be:

Describe alternatives you've considered Continue without automated benchmarks for the Avro reader

Additional context The libcudf Avro reader does not support nested types so the benchmarks should start by only covering primitive types.

GregoryKimball commented 5 months ago

Update: we received a request to prioritize Avro support because it is the format used in the OSCAR dataset (https://oscar-project.github.io/documentation/versions/oscar-2301/). If additional NLP datasets and LLM applications find need of Avro we may choose to prioritize Avro development.