uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Reduce the size of a rowgroup for mnist example. #398

Closed selitvin closed 5 years ago

selitvin commented 5 years ago

Default row-group size is 256MB. We end up having just a single rowgroup of 60K rows. It is really awkward to read from such a dataset since all 60K images need to be decoded before user gets a single row. With smaller rowgroup size, a user gets first samples right away making it easier to work with the code.