[X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I deployed one rayService and it worked fine. After two days I noticed that the worker pod had restarted one time, and Serve deployment no longer exists, so I dig into kuberay-operator's log. found something:
16:20:23.337077721Z 16:20:23.336Z DEBUG events Normal {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144753"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}
16:20:23.281653316Z 16:20:23.281Z INFO controllers.RayService Mark cluster as waiting for Serve deployments {"ServiceName": "bps/chatglm3", "rayCluster": {"apiVersion": "ray.io/v1", "kind": "RayCluster", "namespace": "bps", "name": "chatglm3-raycluster-hl5jb"}}
16:20:23.281709212Z 16:20:23.281Z INFO controllers.RayService Cluster is healthy but not ready: checking again in 2s {"ServiceName": "bps/chatglm3"}
16:20:23.265201098Z 16:20:23.264Z DEBUG controllers.RayService getAndCheckServeStatus {"prev statuses": {"default":{"status":"RUNNING","lastUpdateTime":"2024-03-29T05:18:32Z","healthLastUpdateTime":"2024-03-29T05:18:32Z","serveDeploymentStatuses":{"default_QADeployment":{"status":"HEALTHY","lastUpdateTime":"2024-03-29T05:18:32Z","healthLastUpdateTime":"2024-03-29T05:18:32Z"}}}}, "serve statuses": {"default":{"name":"default","status":"NOT_STARTED","deployments":{}}}}
16:20:23.265275079Z 16:20:23.265Z DEBUG controllers.RayService getAndCheckServeStatus {"new statuses": {"default":{"status":"NOT_STARTED","lastUpdateTime":"16:20:23Z","healthLastUpdateTime":"16:20:23Z"}}}
16:20:23.265286862Z 16:20:23.265Z INFO controllers.RayService Check serve health {"ServiceName": "bps/chatglm3", "isHealthy": true, "isReady": false, "isActive": true}
16:20:22.366000604Z 16:20:22.365Z INFO controllers.RayService FetchHeadServiceURL {"head service name": "chatglm3-raycluster-hl5jb-head-svc", "namespace": "bps"}
16:20:22.366013411Z 16:20:22.365Z INFO controllers.RayService FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}
16:20:22.366341590Z 16:20:22.366Z DEBUG controllers.RayService shouldUpdate {"shouldUpdateServe": false, "reason": "Current Serve config matches cached Serve config, and some deployments have been deployed for cluster chatglm3-raycluster-hl5jb", "cachedServeConfig": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}, "current Serve config": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}}
16:20:22.365567027Z 16:20:22.365Z INFO controllers.RayService Reconciling the cluster component. {"ServiceName": "bps/chatglm3"}
16:20:22.365986832Z 16:20:22.365Z INFO controllers.RayService Active Ray cluster config matches goal config.
16:20:22.365996182Z 16:20:22.365Z INFO controllers.RayService Reconciling the Serve component. Only the active Ray cluster exists. {"ServiceName": "bps/chatglm3"}
16:20:20.364754711Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
16:20:20.364761174Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
16:20:20.364788281Z 16:20:20.364Z DEBUG events Normal {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144703"}, "reason": "FailedToGetServeDeploymentStatus", "message": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}
16:20:20.364648100Z 16:20:20.364Z ERROR controllers.RayService Fail to reconcileServe. {"ServiceName": "bps/chatglm3", "error": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}
16:20:20.364711548Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
16:20:20.364720634Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
16:20:20.364728182Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
16:20:20.364735023Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
16:20:20.364741819Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
16:20:20.364748243Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
16:20:20.340512551Z 16:20:20.340Z INFO controllers.RayService Active Ray cluster config matches goal config.
16:20:20.340567747Z 16:20:20.340Z INFO controllers.RayService Reconciling the Serve component. Only the active Ray cluster exists. {"ServiceName": "bps/chatglm3"}
16:20:20.340575138Z 16:20:20.340Z INFO controllers.RayService FetchHeadServiceURL {"head service name": "chatglm3-raycluster-hl5jb-head-svc", "namespace": "bps"}
16:20:20.340197424Z 16:20:20.339Z INFO controllers.RayService Reconciling the cluster component. {"ServiceName": "bps/chatglm3"}
16:20:20.340581374Z 16:20:20.340Z INFO controllers.RayService FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}
16:20:20.340899593Z 16:20:20.340Z DEBUG controllers.RayService shouldUpdate {"shouldUpdateServe": false, "reason": "Current Serve config matches cached Serve config, and some deployments have been deployed for cluster chatglm3-raycluster-hl5jb", "cachedServeConfig": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}, "current Serve config": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}}
16:20:19.22906402Z 16:20:19.022Z INFO controllers.RayCluster Unconditional requeue after {"cluster name": "chatglm3-raycluster-hl5jb", "seconds": 300}
16:20:19.22844133Z 16:20:19.022Z INFO controllers.RayCluster Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds {"cluster name": "chatglm3-raycluster-hl5jb"}
16:20:19.22349399Z 16:20:19.022Z INFO controllers.RayCluster reconcilePods {"removing the pods in the scaleStrategy of": "qa-group"}
16:20:19.22359950Z 16:20:19.022Z INFO controllers.RayCluster reconcilePods {"workerReplicas": 1, "runningPods": 1, "diff": 0}
16:20:19.22368752Z 16:20:19.022Z INFO controllers.RayCluster reconcilePods {"all workers already exist for group": "qa-group"}
16:20:19.22296523Z 16:20:19.021Z INFO controllers.RayCluster reconcilePods {"worker Pod": "chatglm3-raycluster-hl5jb-worker-qa-group-9mknf", "shouldDelete": false, "reason": "KubeRay does not need to delete the worker Pod chatglm3-raycluster-hl5jb-worker-qa-group-9mknf. The Pod status is Running, and the Ray container terminated status is nil."}
16:20:19.21810681Z 16:20:19.021Z INFO controllers.RayCluster reconcilePods {"head Pod": "chatglm3-raycluster-hl5jb-head-7gmlz", "shouldDelete": false, "reason": "KubeRay does not need to delete the head Pod chatglm3-raycluster-hl5jb-head-7gmlz. The Pod status is Running, and the Ray container terminated status is nil."}
16:20:19.21821506Z 16:20:19.021Z INFO controllers.RayCluster reconcilePods {"desired workerReplicas (always adhering to minReplicas/maxReplica)": 1, "worker group": "qa-group", "maxReplicas": 1, "minReplicas": 1, "replicas": 1}
16:20:19.21762167Z 16:20:19.021Z INFO controllers.RayCluster reconcileHeadService {"1 head service found": "chatglm3-raycluster-hl5jb-head-svc"}
16:20:19.21750968Z 16:20:19.021Z INFO controllers.RayCluster Reconciling Ingress
16:20:19.21796719Z 16:20:19.021Z INFO controllers.RayCluster reconcilePods {"Found 1 head Pod": "chatglm3-raycluster-hl5jb-head-7gmlz", "Pod status": "Running", "Pod restart policy": "Always", "Ray container terminated status": "nil"}
16:20:19.21687644Z 16:20:19.021Z INFO controllers.RayCluster reconciling RayCluster {"cluster name": "chatglm3-raycluster-hl5jb"}
16:20:18.338676359Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
16:20:18.338685589Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
16:20:18.338693600Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
16:20:18.338700382Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
16:20:18.338713847Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
16:20:18.338720242Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
16:20:18.338618804Z 16:20:18.338Z ERROR controllers.RayService Fail to reconcileServe. {"ServiceName": "bps/chatglm3", "error": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}
16:20:18.338707131Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
16:20:18.338726480Z /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
16:20:18.338752120Z 16:20:18.338Z DEBUG events Normal {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144671"}, "reason": "FailedToGetServeDeploymentStatus", "message": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}
16:20:18.281654133Z 16:20:18.281Z INFO controllers.RayService FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}
16:20:18.281326087Z 16:20:18.280Z INFO controllers.RayService Reconciling the cluster component. {"ServiceName": "bps/chatglm3"}
16:20:18.281408262Z 16:20:18.281Z INFO controllers.RayService Reconciling the Serve component. Only the active Ray cluster exists. {"ServiceName": "bps/chatglm3"}
It mentioned If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. but the link already changed.
If someone is willing to help me find out why the serve deployment was down until I recreate the rayService again, I could provide the ray service CR yaml.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I deployed one rayService and it worked fine. After two days I noticed that the worker pod had restarted one time, and Serve deployment no longer exists, so I dig into kuberay-operator's log. found something:
It mentioned
If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details.
but the link already changed.If someone is willing to help me find out why the serve deployment was down until I recreate the rayService again, I could provide the ray service CR yaml.
Reproduction script
None
Anything else
No response
Are you willing to submit a PR?