Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
NOTE: if you are using ``make_batch_reader``, this shuffling queue will be randomizing the order of the
entire batches and not changing the order of elements within a batch. This is likely not what you intend to do.
I'm not quite sure what this means. Does batch here refer to the make_batch_reader batch or the batch to be fed to the model? It seems to me that each make_batch_reader batch is completely randomized if shuffling_queue_capacity is larger than the make_batch_reader batch.
Is this what the NOTE is referring to? If so, is this not recommended because of added latency? I was hoping to use make_batch_reader + the petastorm.pytorch.DataLoader with the RandomShufflingBuffer and wanted to make sure I'm not violating best practices by doing so.
Hello, I am hoping to get clarification on this
NOTE
in the docstrings ofpetastorm.pytorch.DataLoader
: https://github.com/uber/petastorm/blob/20e46e03ab303148cad4e1e803e8a354469949c7/petastorm/pytorch.py#L155I'm not quite sure what this means. Does
batch
here refer to themake_batch_reader
batch or the batch to be fed to the model? It seems to me that eachmake_batch_reader
batch is completely randomized ifshuffling_queue_capacity
is larger than themake_batch_reader
batch.Is this what the
NOTE
is referring to? If so, is this not recommended because of added latency? I was hoping to usemake_batch_reader
+ thepetastorm.pytorch.DataLoader
with theRandomShufflingBuffer
and wanted to make sure I'm not violating best practices by doing so.Thanks for your help!