ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.5k stars 5.69k forks source link

[Data] Enabling local shuffle buffer reduces throughput when iterating with Ray Trainer #42317

Open scottjlee opened 9 months ago

scottjlee commented 9 months ago

What happened + What you expected to happen

When iterating over a Ray Dataset within the TorchTrainer train loop, a non-None local_shuffle_buffer_size causes a decrease in throughput compared to disabling the local shuffle buffer.

In the case with 1 GPU node with the multi_node_train_benchmark, we observed 417 img/s (local shuffle buffer enabled) vs 753 img/s (disabled).

Versions / Dependencies

Ray master

Reproduction script

multi_node_train_benchmark benchmark with prefetch batches specified and empty model, e.g.

python multi_node_train_benchmark.py --num-workers 1 --file-type image --use-gpu --num-epochs 2 --skip-train-model --prefetch-batches 16
When running the multi_node_train_benchmark in a heterogeneous setting (1 GPU, N extra CPU), we observe the following throughput gaps between enabling and disabling the local shuffle buffer. # Extra CPUs Local shuffle buffer disabled Local shuffle buffer size = 16
0 753 417
1 1233 (+480) 443 (+26)
5 1711 468
10 1764 472

There are two gaps of interest:

Issue Severity

None

oceanusxiv commented 4 months ago

@scottjlee Hi, I'm curious has there been any updates on this issue? I've seen this severely impact our training loop also.

ae20cg commented 1 week ago

@scottjlee following up on this - my perf for tabular data loading (arrays of 12K float32s) goes from about 200s for 1 epoch to 1200s for 1 epoch when I set a local_shuffle_buffer_size to <= batch_size * prefetch_batches.

this seems like a major degradation. are there any tips for improving this (outside of the doc page that already exists)?

scottjlee commented 1 week ago

Sorry for the delay on this folks, we haven't had bandwidth to look into this code path for quite some time.

Other than the tips on this docs page, I would suggest:

ae20cg commented 1 week ago

@scottjlee

no problem. i am using file based shuffling

what about if you perform a map_batches(shuffle_func, batch_size=None) on the dataset? is it possible you could get better shuffle performance - assuming the block size is very different from the batch_size leading to some differentiation in batches epoch/epoch?

i'm also a bit surprised that the local_shuffle_buffer_size degrades performance so much. in theory if we have our buffer_size less than or equal to prefetch_batches * batch_size, shouldn't it be pretty fast?