Support reading partitioned parquet datasets using `_metadata` files

patrikkj commented 1 year ago

Problem description

It would be great to support parsing _metadata and _custom_metadata files as implemented in Spark/Pyarrow/Dask 🌟

As described in the PyArrow docs:

Those files include information about the schema of the full dataset (for _common_metadata) and potentially all row group metadata of all files in the partitioned dataset as well (for _metadata). The actual files are metadata-only Parquet files. Note this is not a Parquet standard, but a convention set in practice by those frameworks.

Using those files can give a more efficient creation of a parquet Dataset, since it can use the stored schema and and file paths of all row groups, instead of inferring the schema and crawling the directories for all Parquet files (this is especially the case for filesystems where accessing files is expensive).

Despite not being a part of the official parquet standard, support for this convention is essential for interop. with systems where this is the agreed upon parquet storage spec. It offers a clear separation between data and metadata and can result in significant performance improvements for remote filesystems.

Link to implementations:

Java Spark/HDFS parquet.hadoop.ParquetFileReader | Link to GitHub / Docs (see spark.sql.parquet.mergeSchema)
PyArrow (C++ impl.) arrow::datasets::ParquetDatasetFactory | Link to GitHub / Docs.
Dask dataframe/io/parquet/core | Link to GitHub / Docs

Fully working example using pyarrow:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq

# Create some example data
table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
                             "Brittle stars", "Centipede"]})

# Write dataset
metadata_collector = []
pq.write_to_dataset(table=table, 
                    root_path="./dummy", 
                    metadata_collector=metadata_collector)

# Write metadata
pq.write_metadata(schema=table.schema, 
                  where="./dummy/_metadata", 
                  metadata_collector=metadata_collector)

# Results in the following file tree:
# .
# └── dummy
#     ├── 4895d51acbc3460bb16f1d058ec89e20-0.parquet
#     └── _metadata

# Read data from metadata file
dataset = ds.parquet_dataset("./dummy/_metadata")

# >>> print(dataset.to_table())
# pyarrow.Table
# n_legs: int64
# animal: string
# ----
# n_legs: [[2,2,4,4,5,100]]
# animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Relevant docs:

deanm0000 commented 1 year ago

Is this meant to nudge the development of native polars datasets and for those to include metadata files? I ask that because polars doesn't do datasets anyways.

patrikkj commented 1 year ago

Yup! There is a non-native workaround which is to wrap it as a pyarrow dataset and using pl.scan_pyarrow_dataset(...).

Is there active development to implement native polars datasets? By native I mean something along the lines of pl.scan_dataset(...) using the native parquet reader - could not find any related PRs/branches. Closest would probably be supporting partition-aware parquet scanning, #426 and #4347.

I think the problem description of #4347 perfectly addresses the issues with using pyarrow -> polars to scan partitioned datasets 😬

pola-rs / polars

Support reading partitioned parquet datasets using `_metadata` files #7707

Problem description