Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
It seems that currently rowgroup_selector parameter in make_reader allows filtering using only a single indexed field. This is limiting, since one might need to filter a dataset according to multiple indexed fields. Predicates can be used in combination with rowgroup_selector to get the desired data, but this approach might require extensive I/O against the data repository.
The code below solves this issue by adding the functionality to filter row groups using multiple indexes with IntersectIndexSelector and UnionIndexSelector.
It seems that currently
rowgroup_selector
parameter inmake_reader
allows filtering using only a single indexed field. This is limiting, since one might need to filter a dataset according to multiple indexed fields. Predicates can be used in combination with rowgroup_selector to get the desired data, but this approach might require extensive I/O against the data repository.The code below solves this issue by adding the functionality to filter row groups using multiple indexes with IntersectIndexSelector and UnionIndexSelector.