Closed shrekris-anyscale closed 2 years ago
This issue is not reproducible, so it's unlikely to be an actual bug. However, https://github.com/ray-project/kuberay/issues/616 raised the same issue but it didn't have as clean a repro. I'm filing this issue (and subsequently closing this) to track a clean repro in case it's useful in the future.
Search before asking
KubeRay Component
Others
What happened + What you expected to happen
Note: It turns out that this is not an issue. The bug in #616 was likely due to a misconfigured config file. The HTTPProxies do update with new replicas. I'm filing (and subsequently closing) this issue for tracking purposes since #616 didn't have a clean repro.
When KubeRay is deployed with GCS Fault Tolerance (FT), Serve's HTTPProxies don't update when a replica crashes and recovers. Instead, they seem to route requests only to the replicas that never crashed.
Reproduction script
Python Code:
Vanilla Kubernetes Config
Fault Tolerant (FT) Kubernetes Config
Reproduction Steps
exec
ing into one of the pods, and using the Python interpreter. See this guide for an explanation.curl
requests again. If you deployed the vanilla config, you should still see 6 different PIDs, but one of them will be new. This is the replica that died and recovered. This is the expected behavior.exec
ing into a worker pod, using the Python interpreter to doserve.get_deployment("SleepyPid").get_handle()
, and then making requests through the handle). You can also see that the new replica is alive by runningray list actors --filter "class_name=ServeReplica:SleepyPid"
.Anything else
This issue is not reproducible, so it's unlikely to be an actual bug. However, #616 raised the same issue but it didn't have as clean a repro. I'm filing this issue (and subsequently closing this) to track a clean repro in case it's useful in the future.
Are you willing to submit a PR?