[Data] Enabling local shuffle buffer reduces throughput when iterating with Ray Trainer

scottjlee commented 9 months ago

What happened + What you expected to happen

When iterating over a Ray Dataset within the TorchTrainer train loop, a non-None local_shuffle_buffer_size causes a decrease in throughput compared to disabling the local shuffle buffer.

In the case with 1 GPU node with the multi_node_train_benchmark, we observed 417 img/s (local shuffle buffer enabled) vs 753 img/s (disabled).

Versions / Dependencies

Ray master

Reproduction script

multi_node_train_benchmark benchmark with prefetch batches specified and empty model, e.g.

python multi_node_train_benchmark.py --num-workers 1 --file-type image --use-gpu --num-epochs 2 --skip-train-model --prefetch-batches 16

When running the `multi_node_train_benchmark` in a heterogeneous setting (1 GPU, N extra CPU), we observe the following throughput gaps between enabling and disabling the local shuffle buffer.	# Extra CPUs	Local shuffle buffer disabled
0	753	417
1	1233 (+480)	443 (+26)
5	1711	468
10	1764	472

There are two gaps of interest:

The initial gap for the 0-extra CPU case: there is a 45% difference in throughput between enabling/disabling local shuffle buffer.
The scaling gap for the 1-extra CPU case: there is a 95% difference in the additional throughput increase between enabling/disabling local shuffle buffer.

Issue Severity

None

oceanusxiv commented 4 months ago

@scottjlee Hi, I'm curious has there been any updates on this issue? I've seen this severely impact our training loop also.

ae20cg commented 1 week ago

@scottjlee following up on this - my perf for tabular data loading (arrays of 12K float32s) goes from about 200s for 1 epoch to 1200s for 1 epoch when I set a local_shuffle_buffer_size to <= batch_size * prefetch_batches.

this seems like a major degradation. are there any tips for improving this (outside of the doc page that already exists)?

scottjlee commented 1 week ago

Sorry for the delay on this folks, we haven't had bandwidth to look into this code path for quite some time.

Other than the tips on this docs page, I would suggest:

upgrading to latest version of ray
using file-based shuffling when possible, as this is the most efficient
lowering local_shuffle_buffer_size until the "Batch iteration time breakdown" time shown in ds.stats() is low

ae20cg commented 1 week ago

@scottjlee

no problem. i am using file based shuffling

what about if you perform a map_batches(shuffle_func, batch_size=None) on the dataset? is it possible you could get better shuffle performance - assuming the block size is very different from the batch_size leading to some differentiation in batches epoch/epoch?

i'm also a bit surprised that the local_shuffle_buffer_size degrades performance so much. in theory if we have our buffer_size less than or equal to prefetch_batches * batch_size, shouldn't it be pretty fast?

ray-project / ray