Open shrekris-anyscale opened 2 years ago
cc @brucez-anyscale @wilsonwang371 @iycheng
@brucez-anyscale Bruce, i remember we have seen something similar to this before while using port forwarding, right?
I think @simon-mo and @iycheng have fixed this.
Can we close this issue? Thanks! @shrekris-anyscale @brucez-anyscale
Hi @kevin85421, this is still an issue, but I'm not sure if it's caused by Ray Serve itself or by KubeRay. It's somewhat mitigated by this Ray change, but I think we should leave this issue open for tracking. I've classified it as a P2.
@shrekris-anyscale what's the priority and impact of this issue now?
We've made more progress on this issue. #33384 will further reduce any downtime while the worker nodes are down. That change should ensure minimal downtime when this issue happens.
After merging that change, I'd be comfortable marking this issue as a P3, or closing it altogether.
Search before asking
KubeRay Component
Others
What happened + What you expected to happen
I had a Kubernetes cluster on GKE with 2 nodes that was running a
RayService
. It had 2 worker pods and 1 head pod. It also had a 1-node Redis cluster configured to support GCS Fault Tolerance:I started a port-forward to a worker pod and successfully got responses from my deployments:
I then killed the head pod:
Once the head pod was deleted, it started recovering:
My port-forward did not immediately die, and the worker pod was not immediately restarted, which makes me think that GCS fault tolerance was configured correctly. However, while the head pod was recovering, all my
curl
requests hung. Note: my port-forward was eventually terminated and the worker pods were restarted after the head pod came back up.Eventually, the head pod came back up, and the worker pods were restarted. After that, I could reconnect to the cluster and get successful responses from my deployments.
I can't tell if I simply misconfigured GCS fault tolerance, or if this is how GCS fault tolerance is meant to behave.
Reproduction script
Serve application: https://github.com/ray-project/serve_config_examples/blob/42d10bab77741b40d11304ad66d39a4ec2345247/sleepy_pid.py
Kubernetes config file:
Anything else
No response
Are you willing to submit a PR?