The example above shows that the request sent to RayService failed during the rolling upgrade test. Upon further examination of the logs and code, I found the failure occurs at the stage where the serve service is fully updated, and the request is being sent to the new RayCluster.
KubeRay only switches to the new RayCluster once new RayCluster is ready, and the application running on it is also ready. Considering we allocate only a little cpu and memory to the Ray Pod during the test, it is possible that we are overwhelming the proxy actor.
This PR adds a delay to avoid overwhelming the proxy actor.
Why are these changes needed?
https://buildkite.com/ray-project/ray-ecosystem-ci-kuberay-ci/builds/3506#018e3637-d117-4fdc-a098-d13cdd2850d1/243-2217
The example above shows that the request sent to RayService failed during the rolling upgrade test. Upon further examination of the logs and code, I found the failure occurs at the stage where the serve service is fully updated, and the request is being sent to the new RayCluster.
KubeRay only switches to the new RayCluster once new RayCluster is ready, and the application running on it is also ready. Considering we allocate only a little cpu and memory to the Ray Pod during the test, it is possible that we are overwhelming the proxy actor.
This PR adds a delay to avoid overwhelming the proxy actor.
Related issue number
Checks