Schema inference does not apply filters to Metadata Discovery

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.78k stars 285 forks source link

Schema inference in petastorm.reader.make_reader utilizes pyarrow.parquet.ParquetDataset to discover metadata (via petastorm.etl.dataset_metadata.get_schema_from_dataset_url), but does not pass filters as an argument (see the instantiation here). This means that there is a bottleneck in both make_reader and make_batch_reader (since it uses the process of schema inference to try to give error messages) for any dataset with a large number of partitions, irrespective of the filters passed in to those functions. Essentially, adding that argument was not meaningful without using it for metadata, in addition to data.

uber / petastorm

Schema inference does not apply filters to Metadata Discovery #591