ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.01k stars 341 forks source link

[RayService] Address Recent Flakiness in RayService Zero Downtime Rollout Test #1979

Closed Yicheng-Lu-llll closed 4 months ago

Yicheng-Lu-llll commented 4 months ago

Why are these changes needed?

Recently, RayService zero downtime rollout test becomes flaky:

https://github.com/ray-project/kuberay/pull/1837 describes the root cause and introduces a hotfix. My previous pr https://github.com/ray-project/kuberay/pull/1928 attempts to address the issue and remove the hotfix but as shown avoce, it actually fails to resolve them.

This PR introduces logic to ensure that, for zero-downtime rollout tests, during the period when the serve service has not been fully updated and the old RayCluster has not been deleted, RayServiceUpdateCREvent expects the query request to be processed by either the old or the new RayCluster.

Related issue number

Checks

Yicheng-Lu-llll commented 4 months ago

I have triggered the CI test 5 times, and the RayService test has passed every time. The RayService Zero Downtime Rollout Test seems stable now. Note, the failed CI tests are RayCluster tests, which are unrelated to this PR.