make_batch_reader + shuffling queue clarification

brent-lemieux commented 3 years ago

Hello, I am hoping to get clarification on this NOTE in the docstrings of petastorm.pytorch.DataLoader: https://github.com/uber/petastorm/blob/20e46e03ab303148cad4e1e803e8a354469949c7/petastorm/pytorch.py#L155

NOTE: if you are using ``make_batch_reader``, this shuffling queue will be randomizing the order of the
entire batches and not changing the order of elements within a batch. This is likely not what you intend to do.

I'm not quite sure what this means. Does batch here refer to the make_batch_reader batch or the batch to be fed to the model? It seems to me that each make_batch_reader batch is completely randomized if shuffling_queue_capacity is larger than the make_batch_reader batch.

Is this what the NOTE is referring to? If so, is this not recommended because of added latency? I was hoping to use make_batch_reader + the petastorm.pytorch.DataLoader with the RandomShufflingBuffer and wanted to make sure I'm not violating best practices by doing so.

Thanks for your help!

v01dXYZ commented 3 years ago

It seems the code adds the batch elements one by one to the shuffling buffer:

https://github.com/uber/petastorm/blob/20e46e03ab303148cad4e1e803e8a354469949c7/petastorm/pytorch.py#L210-L215

The NOTE seems outdated.

selitvin commented 3 years ago

I edited the docstring. Can you please see if it answers your question?

https://github.com/uber/petastorm/pull/709/files

brent-lemieux commented 3 years ago

Thanks, that helps a lot!

uber / petastorm

make_batch_reader + shuffling queue clarification #705