ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.17k stars 373 forks source link

[RayService] Reduce Flakiness by Avoiding Overwhelm of the Proxy Actor in RayService Tests. #1999

Closed Yicheng-Lu-llll closed 5 months ago

Yicheng-Lu-llll commented 6 months ago

Why are these changes needed?

https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3506#018e3637-d117-4fdc-a098-d13cdd2850d1/243-2217

The example above shows that the request sent to RayService failed during the rolling upgrade test. Upon further examination of the logs and code, I found the failure occurs at the stage where the serve service is fully updated, and the request is being sent to the new RayCluster.

KubeRay only switches to the new RayCluster once new RayCluster is ready, and the application running on it is also ready. Considering we allocate only a little cpu and memory to the Ray Pod during the test, it is possible that we are overwhelming the proxy actor.

This PR adds a delay to avoid overwhelming the proxy actor.

Related issue number

Checks