huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.94k stars 137 forks source link

Optimize parquet output for remote reading #263

Open H-Plus-Time opened 1 month ago

H-Plus-Time commented 1 month ago

TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.

Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).

The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).

Setting write_page_index to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).

guipenedo commented 3 weeks ago

Thanks for bringing this up. Would you be willing to implement this change in a PR?

H-Plus-Time commented 2 weeks ago

Yep, should have something over the weekend