Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k
stars
285
forks
source link
Schema inference does not apply filters to Metadata Discovery #591
Schema inference in petastorm.reader.make_reader utilizes pyarrow.parquet.ParquetDataset to discover metadata (via petastorm.etl.dataset_metadata.get_schema_from_dataset_url), but does not pass filters as an argument (see the instantiation here). This means that there is a bottleneck in both make_reader and make_batch_reader (since it uses the process of schema inference to try to give error messages) for any dataset with a large number of partitions, irrespective of the filters passed in to those functions. Essentially, adding that argument was not meaningful without using it for metadata, in addition to data.
Schema inference in
petastorm.reader.make_reader
utilizespyarrow.parquet.ParquetDataset
to discover metadata (viapetastorm.etl.dataset_metadata.get_schema_from_dataset_url
), but does not passfilters
as an argument (see the instantiation here). This means that there is a bottleneck in bothmake_reader
andmake_batch_reader
(since it uses the process of schema inference to try to give error messages) for any dataset with a large number of partitions, irrespective of thefilters
passed in to those functions. Essentially, adding that argument was not meaningful without using it for metadata, in addition to data.