uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Schema inference does not apply filters to Metadata Discovery #591

Closed dmcguire81 closed 3 years ago

dmcguire81 commented 4 years ago

Schema inference in petastorm.reader.make_reader utilizes pyarrow.parquet.ParquetDataset to discover metadata (via petastorm.etl.dataset_metadata.get_schema_from_dataset_url), but does not pass filters as an argument (see the instantiation here). This means that there is a bottleneck in both make_reader and make_batch_reader (since it uses the process of schema inference to try to give error messages) for any dataset with a large number of partitions, irrespective of the filters passed in to those functions. Essentially, adding that argument was not meaningful without using it for metadata, in addition to data.

dmcguire81 commented 3 years ago

Circling back to test, the filters argument does not appear to be passed to pyarrow.parquet.ParquetManifest, so can't actually speed up the discovery.