Open patrikkj opened 1 year ago
Is this meant to nudge the development of native polars datasets and for those to include metadata files? I ask that because polars doesn't do datasets anyways.
Yup! There is a non-native workaround which is to wrap it as a pyarrow dataset and using pl.scan_pyarrow_dataset(...)
.
Is there active development to implement native polars datasets?
By native I mean something along the lines of pl.scan_dataset(...)
using the native parquet reader - could not find any related PRs/branches. Closest would probably be supporting partition-aware parquet scanning, #426 and #4347.
I think the problem description of #4347 perfectly addresses the issues with using pyarrow -> polars to scan partitioned datasets 😬
Problem description
It would be great to support parsing
_metadata
and_custom_metadata
files as implemented in Spark/Pyarrow/Dask 🌟As described in the PyArrow docs:
Despite not being a part of the official parquet standard, support for this convention is essential for interop. with systems where this is the agreed upon parquet storage spec. It offers a clear separation between data and metadata and can result in significant performance improvements for remote filesystems.
Link to implementations:
parquet.hadoop.ParquetFileReader
| Link to GitHub / Docs (seespark.sql.parquet.mergeSchema
)arrow::datasets::ParquetDatasetFactory
| Link to GitHub / Docs.dataframe/io/parquet/core
| Link to GitHub / DocsFully working example using
pyarrow
:Relevant docs: