Open yuuuxt opened 1 year ago
I think this works for finding number of rows.
data = pl.scan_parquet(file_path_data)
data.select(pl.count()).collect()[0,0]
I mean I know it works but I believe it works by using the metadata and not reading the whole file.
I agree that it'd be nice to have a metadata utility but in the interim you can use that.
Personally, I don't like the metadata plus data returned tuple from the same function. I'd rather see read_parquet_schema
get replaced with read_parquet_metadata
and have that be more like pq.ParquetFile that returns all the metadata/statistics.
data = pl.scan_parquet(file_path_data) data.select(pl.count()).collect()[0,0]
I mean I know it works but I believe it works by using the metadata and not reading the whole file.
I guess you mean it's not using metadata and it's not instantaneous? An arbitrary example on my side is that getting from metadata takes "0.0s", while this method takes ~1 minute.
Personally, I don't like the metadata plus data returned tuple from the same function. I'd rather see
read_parquet_schema
get replaced withread_parquet_metadata
and have that be more like pq.ParquetFile that returns all the metadata/statistics.
Having read_parquet_metadata
is an improvement, but as you see similar result can be achieved by using pyarrow
already. Also because it's a separate function, 3 lines of code is still required: 1 for file_path
, one for scan_parquet
, and one for read_parquet_metadata
. e.g. Loading dozens of files in one notebook would introduce dozens of redundant and one-time file_path
variables.
But of course it's not a big deal and is only about what's the preferred API design. While discussing, I just start using a simple customized "scan_parquet" which wraps pl.scan_parquet
and ParquetFile.metadata
from pyarrow
to achieve this.
I would also really like access to the metadata via polars without having to pull in pyarrow :man_shrugging: . It would be very useful.
Problem description
I'm used to printing the shape of a parquet file when I load it, which helps checking if it's the correct file and estimating execution time:
In
polars
it's not straightforward to get row numbers, and some boilerplate code is needed. (ref: #3792)It would be nice if
pl.scan_parquet
can optionally return the metadata as well:ref: #11404