ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

[core] Scale shuffle to 200+ nodes #20499

Open stephanie-wang opened 2 years ago

stephanie-wang commented 2 years ago

Shuffle is a key workload for stressing Ray core's distributed dataplane. For large datasets, it requires all-to-all communication and spilling to disk. Thus, shuffle stresses the object transfer and object spilling/restoring protocols in Ray's backend.

So far, we have successfully tested up to 1TB scale (100TB once also but only on 50 nodes) using both Dask-on-Ray and custom shuffle implementations written with Ray tasks. Eventually, we want to ensure that Ray can scale to a petabyte-scale shuffle in a large cluster. This may require using different shuffle algorithms and optimizing Ray's dataplane. A related issue is trying to optimize performance in distributed shuffle to match state-of-the-art.

stephanie-wang commented 2 years ago

@franklsf95 has been working on some ideas here.