uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

make_batch_reader + shuffling queue clarification #705

Closed brent-lemieux closed 3 years ago

brent-lemieux commented 3 years ago

Hello, I am hoping to get clarification on this NOTE in the docstrings of petastorm.pytorch.DataLoader: https://github.com/uber/petastorm/blob/20e46e03ab303148cad4e1e803e8a354469949c7/petastorm/pytorch.py#L155

NOTE: if you are using ``make_batch_reader``, this shuffling queue will be randomizing the order of the
entire batches and not changing the order of elements within a batch. This is likely not what you intend to do.

I'm not quite sure what this means. Does batch here refer to the make_batch_reader batch or the batch to be fed to the model? It seems to me that each make_batch_reader batch is completely randomized if shuffling_queue_capacity is larger than the make_batch_reader batch.

Is this what the NOTE is referring to? If so, is this not recommended because of added latency? I was hoping to use make_batch_reader + the petastorm.pytorch.DataLoader with the RandomShufflingBuffer and wanted to make sure I'm not violating best practices by doing so.

Thanks for your help!

v01dXYZ commented 3 years ago

It seems the code adds the batch elements one by one to the shuffling buffer:

https://github.com/uber/petastorm/blob/20e46e03ab303148cad4e1e803e8a354469949c7/petastorm/pytorch.py#L210-L215

The NOTE seems outdated.

selitvin commented 3 years ago

I edited the docstring. Can you please see if it answers your question?

https://github.com/uber/petastorm/pull/709/files

brent-lemieux commented 3 years ago

Thanks, that helps a lot!