ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.3k stars 415 forks source link

[Bug] ray.exceptions.RpcError: Timed out while waiting for GCS to become available. #2540

Open BeerTai opened 1 week ago

BeerTai commented 1 week ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When I execute ray-cluster in k8s, my worker pod keeps initializing, the log shows: Timed out while waiting for GCS to become available log gcs_server.err in /tmp/ray/session_latest/logs/ is empty.

Reproduction script

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: raycluster-complete
spec:
  rayVersion: '2.38.0'
  # Ray head pod template
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
    # pod template
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: harbor.unijn.cn/sunjian/ray
          resources:
            limits:
              cpu: 10
              memory: 20Gi
            requests:
              cpu: 10
              memory: 20Gi
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /fs/nlp/sunjian
              name: ray-logs
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 4
    # logical group name, for this called large-group, also can be functional
    groupName: large-group
    rayStartParams: {}
    #pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: harbor.unijn.cn/sunjian/ray
          resources:
            limits:
              cpu: 10
              nvidia.com/gpu: 4
              memory: 20Gi 
            requests:
              cpu: 10
              nvidia.com/gpu: 4 
              memory: 20Gi 
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # use volumeMounts.Optional.
          # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
          volumeMounts:
            - mountPath:  /fs/nlp/sunjian
              name: ray-logs
        # use volumes
        # Refer to https://kubernetes.io/docs/concepts/storage/volumes/
        volumes:
          - name: ray-logs
            emptyDir: {}
######################status#################################

Anything else

67 seconds elapsed: Waiting for GCS to be ready.
134 seconds elapsed: Waiting for GCS to be ready.
[2024-11-14 09:10:17,717 E 208 208] gcs_rpc_client.h:179: Failed to connect to GCS at address raycluster-complete-head-svc.sunjian.svc.cluster.local:6379 within 5 seconds.
[2024-11-14 09:10:51,721 W 208 208] gcs_client.cc:178: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.
Traceback (most recent call last):
  File "/usr/local/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2614, in main
    return cli()
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/scripts/scripts.py", line 2377, in healthcheck
    gcs_client = ray._raylet.GcsClient(address=address)
  File "python/ray/_raylet.pyx", line 2705, in ray._raylet.GcsClient.__cinit__
  File "python/ray/includes/gcs_client.pxi", line 64, in ray._raylet.NewGcsClient.standalone
  File "python/ray/includes/common.pxi", line 114, in ray._raylet.check_status_timeout_as_rpc_error
ray.exceptions.RpcError: Timed out while waiting for GCS to become available.
203 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.

gcs_server.out:

[2024-11-14 17:16:38,071 I 75 75] (gcs_server) gcs_server_main.cc:52: Ray cluster metadata ray_version=2.38.0 ray_commit=385ee466260ef3cd218d5e372aef5d39338b7b94
[2024-11-14 17:16:38,074 I 75 75] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for GCS
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_NODE
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_ACTOR
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:491: Ray Event initialized for EXPORT_DRIVER_JOB
[2024-11-14 17:16:38,078 I 75 75] (gcs_server) event.cc:322: Set ray event level to warning
[2024-11-14 17:16:38,080 I 75 75] (gcs_server) gcs_server.cc:76: GCS storage type is StorageType::IN_MEMORY
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2024-11-14 17:16:38,082 I 75 75] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 4f7b90abd98f7cdc6d1076bfa86d2f5f978300a83f585ace064062bf
[2024-11-14 17:16:38,083 I 75 75] (gcs_server) gcs_server.cc:645: Autoscaler V2 enabled: 0
[2024-11-14 17:16:38,087 I 75 75] (gcs_server) grpc_server.cc:134: GcsServer server started, listening on port 6379.
[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:247: Gcs Debug state:

GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0

GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_restart_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager:
- GetAllAvailableResources request count: 0
- GetAllTotalResources request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

Publisher:

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0

[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:843: main_service_ Event stats:

Global stats: 23 total (6 active)
Queueing time: mean = 88.973 ms, max = 228.266 ms, min = 5.591 us, total = 2.046 s
Execution time:  mean = 9.936 ms, total = 228.523 ms
Event stats:
        GcsInMemoryStore.Put - 9 total (0 active), Execution time: mean = 25.370 ms, total = 228.326 ms, Queueing time: mean = 176.614 ms, max = 227.521 ms, min = 5.591 us, total = 1.590 s
        GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 25.400 us, total = 126.999 us, Queueing time: mean = 101.508 us, max = 105.025 us, min = 95.198 us, total = 507.540 us
        PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 1.814 us, total = 7.256 us, Queueing time: mean = 114.083 ms, max = 228.266 ms, min = 228.064 ms, total = 456.330 ms
        NodeInfoGcsService.grpc_server.GetClusterId - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 62.770 us, total = 62.770 us, Queueing time: mean = 10.564 us, max = 10.564 us, min = 10.564 us, total = 10.564 us
        NodeInfoGcsService.grpc_server.GetClusterId.HandleRequestImpl - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s

[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:845: pubsub_io_context_ Event stats:

Global stats: 2 total (1 active)
Queueing time: mean = 10.771 us, max = 21.543 us, min = 21.543 us, total = 21.543 us
Execution time:  mean = 85.118 us, total = 170.237 us
Event stats:
        Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
        PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 170.237 us, total = 170.237 us, Queueing time: mean = 21.543 us, max = 21.543 us, min = 21.543 us, total = 21.543 us

[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:847: task_io_context_ Event stats:

Global stats: 2 total (1 active)
Queueing time: mean = 8.081 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
Execution time:  mean = 46.675 us, total = 93.350 us
Event stats:
        PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 93.350 us, total = 93.350 us, Queueing time: mean = 16.161 us, max = 16.161 us, min = 16.161 us, total = 16.161 us
        GcsTaskManager.GcJobSummary - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s

[2024-11-14 17:16:38,311 I 75 75] (gcs_server) gcs_server.cc:849: ray_syncer_io_context_ Event stats:

Global stats: 2 total (0 active)
Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us
Execution time:  mean = 4.326 us, total = 8.652 us
Event stats:
        RaySyncerRegister - 2 total (0 active), Execution time: mean = 4.326 us, total = 8.652 us, Queueing time: mean = 27.348 us, max = 32.609 us, min = 22.087 us, total = 54.696 us

[2024-11-14 17:16:39,538 I 75 75] (gcs_server) gcs_node_manager.cc:85: Registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_node_manager.cc:91: Finished registering node info, address = 10.233.70.150, node name = 10.233.70.150 node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625
[2024-11-14 17:16:39,539 I 75 75] (gcs_server) gcs_placement_group_manager.cc:819: A new node: c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625 registered, will try to reschedule all the infeasible placement groups.
[2024-11-14 17:16:39,547 I 75 151] (gcs_server) ray_syncer.cc:377: Get connection node_id=c429e1f1a07b5244e103af32116ac9528307ec62f91a13a56b3b9625

Are you willing to submit a PR?

beccabai commented 2 days ago

I meet similar problem. Did you solve it?