Closed aseembits93 closed 3 years ago
Currently petastorm has no API for letting you know the number of rows that will be returned. It should be possible to add for the make_reader
API, but should also be possible to query using standard pyarrow parquet reading tools.
Supposing your dataset was created with materialize_dataset
, you can directly query the _metadata
Parquet Metadata File:
import pyarrow.parquet as pq
pmd = pq.read_metadata("/tmp/helloworld/_metadata")
pmd.num_rows
Thanks @v01dXYZ! I'll close the issue.
Hi, Thanks for sharing this repo. It's really useful for my work. I was wondering how I could know the size of a dataset I'm loading using the pytorch API.
Let me Know how I could do it. Thanks in advance!