Open GregoryKimball opened 7 months ago
Update: we received a request to prioritize Avro support because it is the format used in the OSCAR dataset (https://oscar-project.github.io/documentation/versions/oscar-2301/). If additional NLP datasets and LLM applications find need of Avro we may choose to prioritize Avro development.
Is your feature request related to a problem? Please describe. We have reader benchmarks for CSV, JSON, Parquet and ORC in the cuIO nvbench benchmarking suite. We should add benchmarking for the Avro reader.
The cuIO benchmarks are located here: https://github.com/rapidsai/cudf/tree/branch-24.02/cpp/benchmarks/io
Unfortunately, we don't have an Avro writer implementation in libcudf, so the naive approach of modeling benchmarks after json_reader_input.cpp will not work.
Describe the solution you'd like Our options would be:
Describe alternatives you've considered Continue without automated benchmarks for the Avro reader
Additional context The libcudf Avro reader does not support nested types so the benchmarks should start by only covering primitive types.