[X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows:
Timed out while waiting for GCS to become available
log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.
Reproduction script
apiVersion: ray.io/v1
kind: RayCluster
metadata:
labels:
controller-tools.k8s.io: "1.0"
# A unique identifier for the head node and workers of this cluster.
name: raycluster-complete
spec:
rayVersion: '2.38.0'
# Ray head pod template
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: '0.0.0.0'
# pod template
template:
metadata:
labels: {}
spec:
containers:
- name: ray-head
image: harbor.unijn.cn/sunjian/ray
resources:
limits:
cpu: 10
memory: 20Gi
requests:
cpu: 10
memory: 20Gi
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /fs/nlp/sunjian
name: ray-logs
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 4
# logical group name, for this called large-group, also can be functional
groupName: large-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker
image: harbor.unijn.cn/sunjian/ray
resources:
limits:
cpu: 10
nvidia.com/gpu: 4
memory: 20Gi
requests:
cpu: 10
nvidia.com/gpu: 4
memory: 20Gi
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /fs/nlp/sunjian
name: ray-logs
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
######################status#################################
Anything else
67 seconds elapsed: Waiting for GCS to be ready.
134 seconds elapsed: Waiting for GCS to be ready.
[2024-11-14 09:10:17,717 E 208 208] gcs_rpc_client.h:179: Failed to connect to GCS at address raycluster-complete-head-svc.sunjian.svc.cluster.local:6379 within 5 seconds.
[2024-11-14 09:10:51,721 W 208 208] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.
Traceback (most recent call last):
File "/usr/local/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2614, in main
return cli()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2377, in healthcheck
gcs_client = ray._raylet.GcsClient(address=address)
File "python/ray/_raylet.pyx", line 2705, in ray._raylet.GcsClient.__cinit__
File "python/ray/includes/gcs_client.pxi", line 64, in ray._raylet.NewGcsClient.standalone
File "python/ray/includes/common.pxi", line 114, in ray._raylet.check_status_timeout_as_rpc_error
ray.exceptions.RpcError: Timed out while waiting for GCS to become available.
203 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
gcs_server.out:
[2024-11-14 17:16:38,071 I 75 75] (gcs_server) gcs_server_main.cc:52: Ray cluster metadata ray_version=2.38.0 ray_commit=385ee466260ef3cd218d5e372aef5d39338b7b94
[2024-11-14 17:16:38,074 I 75 75] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for GCS
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_NODE
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_ACTOR
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_DRIVER_JOB
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:322: Set ray event level to warning
[2024-11-14 17:16:38,080 I 75 75] (gcs_server) gcs_server.cc:76: GCS storage type is StorageType::IN_MEMORY
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 4f7b90abd98f7cdc6d1076bfa86d2f5f978300a83f585ace064062bf
[2024-11-14 17:16:38,083 I 75 75] (gcs_server) gcs_server.cc:645: Autoscaler V2 enabled: 0
[2024-11-14 17:16:38,087 I 75 75] (gcs_server) grpc_server.cc:134: GcsServer server started, listening on port 6379.
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:247: Gcs Debug state:
GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_restart_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetAllAvailableResources request count: 0
- GetAllTotalResources request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
Publisher:
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:843: main_service_ Event stats:
Global stats: 23 total (6 active)
Queueing time: mean = 88.973 ms, max = 228.266 ms, min = 5.591 us, total = 2.046 s
Execution time: mean = 9.936 ms, total = 228.523 ms
Event stats:
GcsInMemoryStore.Put - 9 total (0 active), Execution time: mean = 25.370 ms, total = 228.326 ms, Queueing time: mean = 176.614 ms, max = 227.521 ms, min = 5.591 us, total = 1.590 s
GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 25.400 us, total = 126.999 us, Queueing time: mean = 101.508 us, max = 105.025 us, min = 95.198 us, total = 507.540 us
PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 1.814 us, total = 7.256 us, Queueing time: mean = 114.083 ms, max = 228.266 ms, min = 228.064 ms, total = 456.330 ms
NodeInfoGcsService.grpc_server.GetClusterId - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 62.770 us, total = 62.770 us, Queueing time: mean = 10.564 us, max = 10.564 us, min = 10.564 us, total = 10.564 us
NodeInfoGcsService.grpc_server.GetClusterId.HandleRequestImpl - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:845: pubsub_io_context_ Event stats:
Global stats: 2 total (1 active)
Queueing time: mean = 10.771 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
Execution time: mean = 85.118 us, total = 170.237 us
Event stats:
Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 170.237 us, total = 170.237 us, Queueing time: mean = 21.543 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:847: task_io_context_ Event stats:
Global stats: 2 total (1 active)
Queueing time: mean = 8.081 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
Execution time: mean = 46.675 us, total = 93.350 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 93.350 us, total = 93.350 us, Queueing time: mean = 16.161 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:849: ray_syncer_io_context_ Event stats:
Global stats: 2 total (0 active)
Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
Execution time: mean = 4.326 us, total = 8.652 us
Event stats:
RaySyncerRegister - 2 total (0 active), Execution time: mean = 4.326 us, total = 8.652 us, Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
[2024-11-14 17:16:39,538 I 75 75] (gcs_server) gcs_node_manager.cc:85: Registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_node_manager.cc:91: Finished registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_placement_group_manager.cc:819: A new node: c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625 registered, will try to reschedule all the infeasible placement groups.
[2024-11-14 17:16:39,547 I 75 151] (gcs_server) ray_syncer.cc:377: Get connection node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows:
Timed out while waiting for GCS to become available
log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.Reproduction script
Anything else
gcs_server.out:
Are you willing to submit a PR?