ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.16k stars 373 forks source link

Add 1K, 5K and 10K RayCluster/RayJob scalability test results #2218

Closed andrewsykim closed 6 days ago

andrewsykim commented 3 months ago

Why are these changes needed?

Per https://github.com/ray-project/kuberay/issues/2069, adds 1K and 5K RayCluster / RayJob test results

Related issue number

https://github.com/ray-project/kuberay/issues/2069

Checks

andrewsykim commented 3 months ago

@kevin85421

kevin85421 commented 2 months ago

https://github.com/kubernetes/perf-tests/tree/master/clusterloader2#measurement

andrewsykim commented 2 months ago

Great! Would you mind adding a README to briefly document the benchmark results we have? Thanks!

I updated the README with some references on how to understand the results

andrewsykim commented 6 days ago

Btw, what's the configuration of KubeRay (e.g. CPUs? memory? reconcile-concurrency)? Thanks!

We used 16 CPU requests and 32GiB memory limit with --reconcile-concurrency=5. https://github.com/ray-project/kuberay/pull/2228 was an important fix for the RayJob scalability tests.

In practice we didn't need that much. This is the CPU / memory graphs from the most recent run:

image

image