Open scottjlee opened 9 months ago
@scottjlee Hi, I'm curious has there been any updates on this issue? I've seen this severely impact our training loop also.
@scottjlee following up on this - my perf for tabular data loading (arrays of 12K float32s) goes from about 200s for 1 epoch to 1200s for 1 epoch when I set a local_shuffle_buffer_size
to <= batch_size * prefetch_batches.
this seems like a major degradation. are there any tips for improving this (outside of the doc page that already exists)?
Sorry for the delay on this folks, we haven't had bandwidth to look into this code path for quite some time.
Other than the tips on this docs page, I would suggest:
local_shuffle_buffer_size
until the "Batch iteration time breakdown" time shown in ds.stats()
is low@scottjlee
no problem. i am using file based shuffling
what about if you perform a map_batches(shuffle_func, batch_size=None)
on the dataset? is it possible you could get better shuffle performance - assuming the block size is very different from the batch_size leading to some differentiation in batches epoch/epoch?
i'm also a bit surprised that the local_shuffle_buffer_size degrades performance so much. in theory if we have our buffer_size less than or equal to prefetch_batches * batch_size, shouldn't it be pretty fast?
What happened + What you expected to happen
When iterating over a Ray Dataset within the
TorchTrainer
train loop, a non-None
local_shuffle_buffer_size
causes a decrease in throughput compared to disabling the local shuffle buffer.In the case with 1 GPU node with the
multi_node_train_benchmark
, we observed 417 img/s (local shuffle buffer enabled) vs 753 img/s (disabled).Versions / Dependencies
Ray master
Reproduction script
multi_node_train_benchmark
benchmark with prefetch batches specified and empty model, e.g.multi_node_train_benchmark
in a heterogeneous setting (1 GPU, N extra CPU), we observe the following throughput gaps between enabling and disabling the local shuffle buffer.There are two gaps of interest:
Issue Severity
None