Closed Yicheng-Lu-llll closed 4 months ago
I have triggered the CI test 5 times, and the RayService test has passed every time. The RayService Zero Downtime Rollout Test seems stable now. Note, the failed CI tests are RayCluster tests, which are unrelated to this PR.
Why are these changes needed?
Recently, RayService zero downtime rollout test becomes flaky:
https://github.com/ray-project/kuberay/pull/1837 describes the root cause and introduces a hotfix. My previous pr https://github.com/ray-project/kuberay/pull/1928 attempts to address the issue and remove the hotfix but as shown avoce, it actually fails to resolve them.
This PR introduces logic to ensure that, for zero-downtime rollout tests, during the period when the serve service has not been fully updated and the old RayCluster has not been deleted,
RayServiceUpdateCREvent
expects the query request to be processed by either the old or the new RayCluster.Related issue number
Checks