uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Support to get sample by index #378

Closed un-knight closed 5 years ago

un-knight commented 5 years ago

I want to get sample by index, but there seems no way to do this, sine Reader get samples only by walking through the row. In this case, I can't even shuffle the samples by index.

selitvin commented 5 years ago

If you want to query for a certain row you can use predicates, or a combination of indexes + predicates. This is not a tool for implementing arbitrary sampling policies, since it is not very efficient. The actual issue is that parquet format is not well suited for random row access, but for chunk/batch processing.