pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.96k stars 1.93k forks source link

Optionally return metadata when using scan_parquet #6915

Open yuuuxt opened 1 year ago

yuuuxt commented 1 year ago

Problem description

I'm used to printing the shape of a parquet file when I load it, which helps checking if it's the correct file and estimating execution time:

import pandas as pd
from pathlib import Path
import polars as pl
# in pandas
data = pd.read_parquet(Path(xx,xx,"xx.parquet"))
print(data.shape)

In polars it's not straightforward to get row numbers, and some boilerplate code is needed. (ref: #3792)

# currently in polars
from pyarrow.parquet import ParquetFile

def get_rows(file_path):
    pf = ParquetFile(file_path)
    return pf.metadata.num_rows

file_path_data = Path(xx,xx,"xx.parquet") # separate path variable is required
data = pl.scan_parquet(file_path_data)
print(get_rows(file_path_data), data.width)

It would be nice if pl.scan_parquet can optionally return the metadata as well:

# what if:
data, metadata = pl.scan_parquet(Path(xx,xx,"xx.parquet"), return_metadata=True)
print(metadata.num_rows, data.width)

ref: #11404

deanm0000 commented 1 year ago

I think this works for finding number of rows.

data = pl.scan_parquet(file_path_data)
data.select(pl.count()).collect()[0,0]

I mean I know it works but I believe it works by using the metadata and not reading the whole file.

I agree that it'd be nice to have a metadata utility but in the interim you can use that.

Personally, I don't like the metadata plus data returned tuple from the same function. I'd rather see read_parquet_schema get replaced with read_parquet_metadata and have that be more like pq.ParquetFile that returns all the metadata/statistics.

Linking to https://github.com/pola-rs/polars/issues/6870

yuuuxt commented 1 year ago
data = pl.scan_parquet(file_path_data)
data.select(pl.count()).collect()[0,0]

I mean I know it works but I believe it works by using the metadata and not reading the whole file.

I guess you mean it's not using metadata and it's not instantaneous? An arbitrary example on my side is that getting from metadata takes "0.0s", while this method takes ~1 minute.

Personally, I don't like the metadata plus data returned tuple from the same function. I'd rather see read_parquet_schema get replaced with read_parquet_metadata and have that be more like pq.ParquetFile that returns all the metadata/statistics.

Having read_parquet_metadata is an improvement, but as you see similar result can be achieved by using pyarrow already. Also because it's a separate function, 3 lines of code is still required: 1 for file_path, one for scan_parquet, and one for read_parquet_metadata. e.g. Loading dozens of files in one notebook would introduce dozens of redundant and one-time file_path variables.

But of course it's not a big deal and is only about what's the preferred API design. While discussing, I just start using a simple customized "scan_parquet" which wraps pl.scan_parquet and ParquetFile.metadata from pyarrow to achieve this.

RmStorm commented 1 week ago

I would also really like access to the metadata via polars without having to pull in pyarrow :man_shrugging: . It would be very useful.