ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

[Bug] Fail to reconcileServe #2057

Open LronDC opened 3 months ago

LronDC commented 3 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

I deployed one rayService and it worked fine. After two days I noticed that the worker pod had restarted one time, and Serve deployment no longer exists, so I dig into kuberay-operator's log. found something:


16:20:23.337077721Z   16:20:23.336Z DEBUG   events  Normal  {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144753"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}

16:20:23.281653316Z   16:20:23.281Z INFO    controllers.RayService  Mark cluster as waiting for Serve deployments   {"ServiceName": "bps/chatglm3", "rayCluster": {"apiVersion": "ray.io/v1", "kind": "RayCluster", "namespace": "bps", "name": "chatglm3-raycluster-hl5jb"}}

16:20:23.281709212Z   16:20:23.281Z INFO    controllers.RayService  Cluster is healthy but not ready: checking again in 2s  {"ServiceName": "bps/chatglm3"}

16:20:23.265201098Z   16:20:23.264Z DEBUG   controllers.RayService  getAndCheckServeStatus  {"prev statuses": {"default":{"status":"RUNNING","lastUpdateTime":"2024-03-29T05:18:32Z","healthLastUpdateTime":"2024-03-29T05:18:32Z","serveDeploymentStatuses":{"default_QADeployment":{"status":"HEALTHY","lastUpdateTime":"2024-03-29T05:18:32Z","healthLastUpdateTime":"2024-03-29T05:18:32Z"}}}}, "serve statuses": {"default":{"name":"default","status":"NOT_STARTED","deployments":{}}}}

16:20:23.265275079Z   16:20:23.265Z DEBUG   controllers.RayService  getAndCheckServeStatus  {"new statuses": {"default":{"status":"NOT_STARTED","lastUpdateTime":"16:20:23Z","healthLastUpdateTime":"16:20:23Z"}}}

16:20:23.265286862Z   16:20:23.265Z INFO    controllers.RayService  Check serve health  {"ServiceName": "bps/chatglm3", "isHealthy": true, "isReady": false, "isActive": true}

16:20:22.366000604Z   16:20:22.365Z INFO    controllers.RayService  FetchHeadServiceURL {"head service name": "chatglm3-raycluster-hl5jb-head-svc", "namespace": "bps"}

16:20:22.366013411Z   16:20:22.365Z INFO    controllers.RayService  FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}

16:20:22.366341590Z   16:20:22.366Z DEBUG   controllers.RayService  shouldUpdate    {"shouldUpdateServe": false, "reason": "Current Serve config matches cached Serve config, and some deployments have been deployed for cluster chatglm3-raycluster-hl5jb", "cachedServeConfig": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}, "current Serve config": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}}

16:20:22.365567027Z   16:20:22.365Z INFO    controllers.RayService  Reconciling the cluster component.  {"ServiceName": "bps/chatglm3"}

16:20:22.365986832Z   16:20:22.365Z INFO    controllers.RayService  Active Ray cluster config matches goal config.

16:20:22.365996182Z   16:20:22.365Z INFO    controllers.RayService  Reconciling the Serve component. Only the active Ray cluster exists.    {"ServiceName": "bps/chatglm3"}

16:20:20.364754711Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2

16:20:20.364761174Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

16:20:20.364788281Z   16:20:20.364Z DEBUG   events  Normal  {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144703"}, "reason": "FailedToGetServeDeploymentStatus", "message": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}

16:20:20.364648100Z   16:20:20.364Z ERROR   controllers.RayService  Fail to reconcileServe. {"ServiceName": "bps/chatglm3", "error": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}

16:20:20.364711548Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile

16:20:20.364720634Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114

16:20:20.364728182Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

16:20:20.364735023Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311

16:20:20.364741819Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem

16:20:20.364748243Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266

16:20:20.340512551Z   16:20:20.340Z INFO    controllers.RayService  Active Ray cluster config matches goal config.

16:20:20.340567747Z   16:20:20.340Z INFO    controllers.RayService  Reconciling the Serve component. Only the active Ray cluster exists.    {"ServiceName": "bps/chatglm3"}

16:20:20.340575138Z   16:20:20.340Z INFO    controllers.RayService  FetchHeadServiceURL {"head service name": "chatglm3-raycluster-hl5jb-head-svc", "namespace": "bps"}

16:20:20.340197424Z   16:20:20.339Z INFO    controllers.RayService  Reconciling the cluster component.  {"ServiceName": "bps/chatglm3"}

16:20:20.340581374Z   16:20:20.340Z INFO    controllers.RayService  FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}

16:20:20.340899593Z   16:20:20.340Z DEBUG   controllers.RayService  shouldUpdate    {"shouldUpdateServe": false, "reason": "Current Serve config matches cached Serve config, and some deployments have been deployed for cluster chatglm3-raycluster-hl5jb", "cachedServeConfig": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}, "current Serve config": {"importPath":"qa.deployment","runtimeEnv":"working_dir: \"file:///home/ray/qa-embedding/qa.zip\"\n","deployments":[{"name":"QADeployment","numReplicas":1,"rayActorOptions":{"numGpus":1}}]}}

16:20:19.22906402Z   16:20:19.022Z  INFO    controllers.RayCluster  Unconditional requeue after {"cluster name": "chatglm3-raycluster-hl5jb", "seconds": 300}

16:20:19.22844133Z   16:20:19.022Z  INFO    controllers.RayCluster  Environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is not set, using default value of 300 seconds  {"cluster name": "chatglm3-raycluster-hl5jb"}

16:20:19.22349399Z   16:20:19.022Z  INFO    controllers.RayCluster  reconcilePods   {"removing the pods in the scaleStrategy of": "qa-group"}

16:20:19.22359950Z   16:20:19.022Z  INFO    controllers.RayCluster  reconcilePods   {"workerReplicas": 1, "runningPods": 1, "diff": 0}

16:20:19.22368752Z   16:20:19.022Z  INFO    controllers.RayCluster  reconcilePods   {"all workers already exist for group": "qa-group"}

16:20:19.22296523Z   16:20:19.021Z  INFO    controllers.RayCluster  reconcilePods   {"worker Pod": "chatglm3-raycluster-hl5jb-worker-qa-group-9mknf", "shouldDelete": false, "reason": "KubeRay does not need to delete the worker Pod chatglm3-raycluster-hl5jb-worker-qa-group-9mknf. The Pod status is Running, and the Ray container terminated status is nil."}

16:20:19.21810681Z   16:20:19.021Z  INFO    controllers.RayCluster  reconcilePods   {"head Pod": "chatglm3-raycluster-hl5jb-head-7gmlz", "shouldDelete": false, "reason": "KubeRay does not need to delete the head Pod chatglm3-raycluster-hl5jb-head-7gmlz. The Pod status is Running, and the Ray container terminated status is nil."}

16:20:19.21821506Z   16:20:19.021Z  INFO    controllers.RayCluster  reconcilePods   {"desired workerReplicas (always adhering to minReplicas/maxReplica)": 1, "worker group": "qa-group", "maxReplicas": 1, "minReplicas": 1, "replicas": 1}

16:20:19.21762167Z   16:20:19.021Z  INFO    controllers.RayCluster  reconcileHeadService    {"1 head service found": "chatglm3-raycluster-hl5jb-head-svc"}

16:20:19.21750968Z   16:20:19.021Z  INFO    controllers.RayCluster  Reconciling Ingress

16:20:19.21796719Z   16:20:19.021Z  INFO    controllers.RayCluster  reconcilePods   {"Found 1 head Pod": "chatglm3-raycluster-hl5jb-head-7gmlz", "Pod status": "Running", "Pod restart policy": "Always", "Ray container terminated status": "nil"}

16:20:19.21687644Z   16:20:19.021Z  INFO    controllers.RayCluster  reconciling RayCluster  {"cluster name": "chatglm3-raycluster-hl5jb"}

16:20:18.338676359Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile

16:20:18.338685589Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114

16:20:18.338693600Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler

16:20:18.338700382Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311

16:20:18.338713847Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266

16:20:18.338720242Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2

16:20:18.338618804Z   16:20:18.338Z ERROR   controllers.RayService  Fail to reconcileServe. {"ServiceName": "bps/chatglm3", "error": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}

16:20:18.338707131Z   sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem

16:20:18.338726480Z     /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

16:20:18.338752120Z   16:20:18.338Z DEBUG   events  Normal  {"object": {"kind":"RayService","namespace":"bps","name":"chatglm3","uid":"35022649-d398-47f7-95f7-27cbc4d5cbdb","apiVersion":"ray.io/v1","resourceVersion":"550144671"}, "reason": "FailedToGetServeDeploymentStatus", "message": "Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. err: Get \"http://chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365/api/serve/deployments/status\": dial tcp 100.64.224.28:52365: connect: connection refused"}

16:20:18.281654133Z   16:20:18.281Z INFO    controllers.RayService  FetchHeadServiceURL {"head service URL": "chatglm3-raycluster-hl5jb-head-svc.bps.svc.cluster.local:52365", "port": "dashboard-agent"}

16:20:18.281326087Z   16:20:18.280Z INFO    controllers.RayService  Reconciling the cluster component.  {"ServiceName": "bps/chatglm3"}

16:20:18.281408262Z   16:20:18.281Z INFO    controllers.RayService  Reconciling the Serve component. Only the active Ray cluster exists.    {"ServiceName": "bps/chatglm3"}

It mentioned If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. but the link already changed.

If someone is willing to help me find out why the serve deployment was down until I recreate the rayService again, I could provide the ray service CR yaml.

Reproduction script

None

Anything else

No response

Are you willing to submit a PR?