Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Shuffle is a key workload for stressing Ray core's distributed dataplane. For large datasets, it requires all-to-all communication and spilling to disk. Thus, shuffle stresses the object transfer and object spilling/restoring protocols in Ray's backend.
So far, we have successfully tested up to 1TB scale (100TB once also but only on 50 nodes) using both Dask-on-Ray and custom shuffle implementations written with Ray tasks. Eventually, we want to ensure that Ray can scale to a petabyte-scale shuffle in a large cluster. This may require using different shuffle algorithms and optimizing Ray's dataplane. A related issue is trying to optimize performance in distributed shuffle to match state-of-the-art.
Shuffle is a key workload for stressing Ray core's distributed dataplane. For large datasets, it requires all-to-all communication and spilling to disk. Thus, shuffle stresses the object transfer and object spilling/restoring protocols in Ray's backend.
So far, we have successfully tested up to 1TB scale (100TB once also but only on 50 nodes) using both Dask-on-Ray and custom shuffle implementations written with Ray tasks. Eventually, we want to ensure that Ray can scale to a petabyte-scale shuffle in a large cluster. This may require using different shuffle algorithms and optimizing Ray's dataplane. A related issue is trying to optimize performance in distributed shuffle to match state-of-the-art.