uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Multiple index selectors #369

Closed GregAru closed 5 years ago

GregAru commented 5 years ago

It seems that currently rowgroup_selector parameter in make_reader allows filtering using only a single indexed field. This is limiting, since one might need to filter a dataset according to multiple indexed fields. Predicates can be used in combination with rowgroup_selector to get the desired data, but this approach might require extensive I/O against the data repository.

The code below solves this issue by adding the functionality to filter row groups using multiple indexes with IntersectIndexSelector and UnionIndexSelector.

CLAassistant commented 5 years ago

CLA assistant check
All committers have signed the CLA.