[core] Scale shuffle to 200+ nodes

Shuffle is a key workload for stressing Ray core's distributed dataplane. For large datasets, it requires all-to-all communication and spilling to disk. Thus, shuffle stresses the object transfer and object spilling/restoring protocols in Ray's backend.

So far, we have successfully tested up to 1TB scale (100TB once also but only on 50 nodes) using both Dask-on-Ray and custom shuffle implementations written with Ray tasks. Eventually, we want to ensure that Ray can scale to a petabyte-scale shuffle in a large cluster. This may require using different shuffle algorithms and optimizing Ray's dataplane. A related issue is trying to optimize performance in distributed shuffle to match state-of-the-art.

ray-project / ray

[core] Scale shuffle to 200+ nodes #20499