[Bug] Job submit failed with actor died

SimonCqk commented 9 months ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I launched a simple RayJob instance in a standard Alibaba Cloud K8S cluster where the creation of resource objects such as HeadWorker/Service aligns with expectations. However, I encountered the following intermittent error when submitting tasks (running straightforward example code):

Inside the Head Worker node, I found the following error stack trace in the corresponding session directory.

Reproduction script

Anything else

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

kevin85421 commented 9 months ago

cc @architkulkarni

architkulkarni commented 9 months ago

@SimonCqk can you share the rayjob spec?

architkulkarni commented 9 months ago

Does the issue happen on multi-node clusters or does it happen on single-node clusters too?

SimonCqk commented 9 months ago

@architkulkarni hi, rayjob spec is shown below, docker registry has been blurred since it is an internal address.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  finalizers:
  - ray.io/rayjob-finalizer
  generation: 2
  name: rayjob-test
  namespace: damo-cv
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  rayClusterSpec:
    headGroupSpec:
      rayStartParams:
        dashboard-host: 0.0.0.0
      template:
        metadata: {}
        spec:
          containers:
          - image: xxxx/cvl/ray:2.8.0-py310-cu118
            name: ray-head
            ports:
            - containerPort: 6379
              name: gcs-server
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            resources:
              limits:
                cpu: "4"
                memory: 8G
              requests:
                cpu: "4"
                memory: 8G
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
            value: sigma_public
          volumes:
          - configMap:
              items:
              - key: sample_code.py
                path: sample_code.py
              name: ray-job-code-sample
            name: code-sample
    rayVersion: 2.8.0
    workerGroupSpecs:
    - groupName: small-group
      maxReplicas: 5
      minReplicas: 1
      rayStartParams: {}
      replicas: 1
      scaleStrategy: {}
      template:
        metadata: {}
        spec:
          containers:
          - image: xxxx/cvl/ray:2.8.0-py310-cu118
            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - ray stop
            name: ray-worker
            resources:
              limits:
                cpu: "4"
                memory: 8G
              requests:
                cpu: "4"
                memory: 8G
  submitterPodTemplate:
    metadata: {}
    spec:
      containers:
      - args:
        - "while true; do\n  if ray health-check --address ${RAYCLUSTER_SVC_NAME}.damo-cv.svc.cluster.local:6379;
          then\n    echo \"GCS is ready.\"\n    break\n  fi\n  echo \"Waiting for
          GCS to be ready.\"\n  sleep 5    \ndone\n\nray job submit --address http://${RAYCLUSTER_SVC_NAME}.damo-cv.svc.cluster.local:8265
          -- python /home/ray/samples/sample_code.py\n\nwhile true; echo \"hello\";
          do sleep 30; done;\n"
        command:
        - /bin/sh
        - -c
        image: xxxx/cvl/ray:2.8.0-py310-cu118
        name: rayjob-submitter
        resources: {}
      restartPolicy: Never

SimonCqk commented 9 months ago

Does the issue happen on multi-node clusters or does it happen on single-node clusters too?

the issue occurs in our production environment with multi-node at large scale.

SimonCqk commented 9 months ago

@architkulkarni hi, any update on this issue?

rickyyx commented 9 months ago

@SimonCqk - would it be possible for you to provide the head node's log? In particularly gcs_server.out and gcs_server.err if any.

It would be nice to see dashboard.log and raylet.out/err as well.

If it's not too large, a zip of all the ray log files would be appreciated!

By looking at just the error, it seems that Job Manager fails to connect to GCS. If GCS on the head node is operating as per normal, it might be a netowrk issue from the job manager's node.

Do you know where the job manager is run, and if that node has any issues?

architkulkarni commented 9 months ago

Thanks, and just to add on -- despite the log coming from job_manager.py, the failure is coming from a JobSupervisor Ray Actor (there's one of these per Ray Job.). In the first screenshot, the IP address of the actor is listed, which might be helpful in locating the node.

v4if commented 9 months ago

@SimonCqk - would it be possible for you to provide the head node's log? In particularly gcs_server.out and gcs_server.err if any.

It would be nice to see dashboard.log and raylet.out/err as well.

If it's not too large, a zip of all the ray log files would be appreciated!

By looking at just the error, it seems that Job Manager fails to connect to GCS. If GCS on the head node is operating as per normal, it might be a netowrk issue from the job manager's node.

Do you know where the job manager is run, and if that node has any issues?

See if these help.

dashboard.log:

2023-12-07 18:22:08,056 INFO web_log.py:206 -- 33.4.129.27 [08/Dec/2023:02:22:08 +0000] 'GET /api/jobs/rayjob-test-p967g HTTP/1.1' 404 195 bytes 2293 us '-' 'Go-http-client/1.1'
2023-12-07 18:22:11,069 INFO web_log.py:206 -- 33.4.129.27 [08/Dec/2023:02:22:11 +0000] 'GET /api/jobs/rayjob-test-p967g HTTP/1.1' 404 195 bytes 2681 us '-' 'Go-http-client/1.1'
2023-12-07 18:22:13,968 ERROR web_protocol.py:403 -- Error handling request
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 433, in _handle_request
    resp = await request_handler(request)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aiohttp/web_app.py", line 504, in _handle
    resp = await handler(request)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aiohttp/web_middlewares.py", line 117, in impl
    return await handler(request)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/http_server_head.py", line 150, in metrics_middleware
    status_tag = f"{floor(response.status / 100)}xx"
AttributeError: 'NoneType' object has no attribute 'status'
2023-12-07 18:22:13,983 INFO web_log.py:206 -- 33.6.146.66 [08/Dec/2023:02:22:13 +0000] 'GET /api/jobs/raysubmit_2siN5swbQYgJ7N8N HTTP/1.1' 200 1530 bytes 2519 us '-' 'python-requests/2.31.0'

gcs_server.out：

[2023-12-07 18:22:01,579 I 44 44] (gcs_server) gcs_actor_scheduler.cc:447: Start creating actor 5756071e2fcde3da8df5881901000000 on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a a
t node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_scheduler.cc:484: Finished actor creation task for actor 5756071e2fcde3da8df5881901000000 on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec7
9a9aa81be4425f7a at node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_manager.cc:1245: Failed to create an actor due to the application failure, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_manager.cc:294: Finished creating actor, job id = 01000000, actor id = 5756071e2fcde3da8df5881901000000, status = CreationTaskError: Creatio
nTaskError: Exception raised from an actor init method. Traceback: The actor died because of an error raised in its creation task, ^[[36mray::_ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N:JobS
upervisor.__init__()^[[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result                        
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)                        
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details:
[2023-12-07 18:22:13,895 W 44 44] (gcs_server) gcs_worker_manager.cc:55: Reporting worker exit, worker id = c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a, node id = fffffffffffffffffffff
fffffffffffffffffffffffffffffffffff, address = , exit_type = USER_ERROR, exit_detail = Worker exits because there was an exception in the initialization method (e.g., __init__). Fix the exceptions
from the initialization to resolve the issue. Exception raised from an actor init method. Traceback: The actor died because of an error raised in its creation task, ^[[36mray::_ray_internal_job_act
or_raysubmit_2siN5swbQYgJ7N8N:JobSupervisor.__init__()^[[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at
 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result                        
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details:. Unin
tentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2023-12-07 18:22:13,895 W 44 44] (gcs_server) gcs_actor_manager.cc:961: Worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a on node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc
28f4 exits, type=USER_ERROR, has creation_task_exception = 1
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:1132: Actor 5756071e2fcde3da8df5881901000000 is failed on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a at node
 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, need_reschedule = 0, death context type = CreationTaskFailureContext, remaining_restarts = 0, job id = 01000000
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:730: Actor name _ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N is cleand up.                        
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:13,899 I 44 44] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000                   
[2023-12-07 18:22:13,899 I 44 44] (gcs_server) gcs_actor_manager.cc:812: Tried to destroy actor that does not exist 5756071e2fcde3da8df5881901000000

raylet.out:

[2023-12-07 18:22:00,895 I 141 141] (raylet) runtime_env_agent_client.cc:307: Create runtime env for job 01000000
[2023-12-07 18:22:00,897 I 141 141] (raylet) worker_pool.cc:498: Started worker process with pid 264, the token is 0
[2023-12-07 18:22:01,578 I 141 155] (raylet) object_store.cc:35: Object store current usage 8e-09 / 2.31002 GB.
[2023-12-07 18:22:13,894 I 141 141] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=2, has creation task exception = t
rue
[2023-12-07 18:22:13,894 I 141 141] (raylet) node_manager.cc:1493: Formatted creation task exception: Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 1669, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1669, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1769, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1675, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1610, in ray._raylet.execute_task.function_executor

  File "python/ray/_raylet.pyx", line 4393, in ray._raylet.CoreWorker.run_async_func_or_coro_in_event_loop

  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()

  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception

  File "python/ray/_raylet.pyx", line 4380, in async_func

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/async_compat.py", line 42, in wrapper
    return func(*args, **kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/gcs_aio_client.py", line 64, in __init__
    self._gcs_client = GcsClient(address, nums_reconnect_retry)

  File "python/ray/_raylet.pyx", line 2492, in ray._raylet.GcsClient.__cinit__

  File "python/ray/_raylet.pyx", line 2501, in ray._raylet.GcsClient._connect

  File "python/ray/_raylet.pyx", line 468, in ray._raylet.check_status
  File "python/ray/_raylet.pyx", line 468, in ray._raylet.check_status

ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379:
tcp handshaker shutdown; RPC Error details:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 2064, in ray._raylet.task_execution_handler

  File "python/ray/_raylet.pyx", line 1960, in ray._raylet.execute_task_with_cancellation_handler

  File "python/ray/_raylet.pyx", line 1617, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1618, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1856, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 959, in ray._raylet.store_task_errors

ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ^[[36mray::_ray_internal_job_actor_raysubmit_2siN5swbQ
YgJ7N8N:JobSupervisor.__init__()^[[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_mana
ger.JobSupervisor object at 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379:
tcp handshaker shutdown; RPC Error details:
, worker_id: c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a

rickyyx commented 9 months ago

So looks like the job manager (which is a ray actor) failed to connect to GCS at its creation.

Will you be able to check if ipv4:192.168.251.238:6379 (where gcs lives) is reachable from your other node with ip 33.60.76.44 (the node where the job actor is being created)?

I suspect it might be a networking issue given that gcs doesn't seem to crash.

SimonCqk commented 9 months ago

@SimonCqk - would it be possible for you to provide the head node's log? In particularly gcs_server.out and gcs_server.err if any.

It would be nice to see dashboard.log and raylet.out/err as well.

If it's not too large, a zip of all the ray log files would be appreciated!

By looking at just the error, it seems that Job Manager fails to connect to GCS. If GCS on the head node is operating as per normal, it might be a netowrk issue from the job manager's node.

Do you know where the job manager is run, and if that node has any issues?

unfortunately, both gcs_server.err and raylet.err outputs nothing, and below are the logs for dashboard.log and raylet.log respectively:

gcs_server.out:

[2023-12-07 18:21:36,695 I 44 44] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-12-07 18:21:36,696 I 44 44] (gcs_server) event.cc:234: Set ray event level to warning
[2023-12-07 18:21:36,696 I 44 44] (gcs_server) event.cc:342: Ray Event initialized for GCS
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:80: Loading actor table data.
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:93: Loading actor task spec table data.
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:66: Loading placement group table data.
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:84: Finished loading actor table data, size = 0
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:97: Finished loading actor task spec table data, size = 0
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_init_data.cc:71: Finished loading placement group table data, size = 0
[2023-12-07 18:21:36,697 I 44 44] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: ea9a37e74d9f4c9d834c7ecd04ec94a02354ac3923e7bfaca3bf660d
[2023-12-07 18:21:36,698 I 44 44] (gcs_server) gcs_server.cc:653: Autoscaler V2 enabled: 0
[2023-12-07 18:21:36,699 I 44 44] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 6379.
[2023-12-07 18:21:36,759 I 44 44] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0

[2023-12-07 18:21:36,759 I 44 44] (gcs_server) gcs_server.cc:859: Event stats:

Global stats: 26 total (15 active)
Queueing time: mean = 7.128 ms, max = 61.874 ms, min = 2.195 us, total = 185.327 ms
Execution time:  mean = 2.382 ms, total = 61.944 ms
Event stats:
    InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalKVGcsService.grpc_server.InternalKVPut - 5 total (5 active), CPU time: mean = 0.000 s, total = 0.000 s
    GcsInMemoryStore.GetAll - 5 total (0 active), CPU time: mean = 6.920 us, total = 34.598 us
    PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 1.470 us, total = 5.881 us
    GcsInMemoryStore.Put - 3 total (0 active), CPU time: mean = 20.630 ms, total = 61.891 ms
    GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 11.791 us, total = 11.791 us
    ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2023-12-07 18:21:36,760 I 44 44] (gcs_server) gcs_server.cc:860: GcsTaskManager Event stats:

Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan s, total = 0.000 s
Event stats:

[2023-12-07 18:21:38,360 I 44 44] (gcs_server) gcs_node_manager.cc:53: Registering node info, node id = 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, address = 33.60.76.44, node name = 33.60.76.44
[2023-12-07 18:21:38,360 I 44 44] (gcs_server) gcs_node_manager.cc:59: Finished registering node info, node id = 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, address = 33.60.76.44, node name = 33.60.76.44
[2023-12-07 18:21:38,360 I 44 44] (gcs_server) gcs_placement_group_manager.cc:793: A new node: 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 registered, will try to reschedule all the infeasible placement groups.
[2023-12-07 18:21:51,095 I 44 44] (gcs_server) gcs_node_manager.cc:53: Registering node info, node id = f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f, address = 33.50.139.160, node name = 33.50.139.160
[2023-12-07 18:21:51,095 I 44 44] (gcs_server) gcs_node_manager.cc:59: Finished registering node info, node id = f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f, address = 33.50.139.160, node name = 33.50.139.160
[2023-12-07 18:21:51,095 I 44 44] (gcs_server) gcs_placement_group_manager.cc:793: A new node: f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f registered, will try to reschedule all the infeasible placement groups.
[2023-12-07 18:22:00,611 I 44 44] (gcs_server) gcs_job_manager.cc:42: Adding job, job id = 01000000, driver pid = 118
[2023-12-07 18:22:00,611 I 44 44] (gcs_server) gcs_job_manager.cc:57: Finished adding job, job id = 01000000, driver pid = 118
[2023-12-07 18:22:00,637 W 44 44] (gcs_server) gcs_actor_manager.cc:458: Actor with name '_ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N' was not found.
[2023-12-07 18:22:00,892 I 44 44] (gcs_server) gcs_actor_manager.cc:253: Registering actor, job id = 01000000, actor id = 5756071e2fcde3da8df5881901000000
[2023-12-07 18:22:00,892 I 44 44] (gcs_server) gcs_actor_manager.cc:259: Registered actor, job id = 01000000, actor id = 5756071e2fcde3da8df5881901000000
[2023-12-07 18:22:00,892 I 44 44] (gcs_server) gcs_actor_manager.cc:278: Creating actor, job id = 01000000, actor id = 5756071e2fcde3da8df5881901000000
[2023-12-07 18:22:00,892 I 44 44] (gcs_server) gcs_actor_scheduler.cc:312: Start leasing worker from node f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f for actor 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:00,893 I 44 44] (gcs_server) gcs_actor_scheduler.cc:633: Finished leasing worker from f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f for actor 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:00,893 I 44 44] (gcs_server) gcs_actor_scheduler.cc:312: Start leasing worker from node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 for actor 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:01,579 I 44 44] (gcs_server) gcs_actor_scheduler.cc:633: Finished leasing worker from 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 for actor 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:01,579 I 44 44] (gcs_server) gcs_actor_scheduler.cc:447: Start creating actor 5756071e2fcde3da8df5881901000000 on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a at node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_scheduler.cc:484: Finished actor creation task for actor 5756071e2fcde3da8df5881901000000 on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a at node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_manager.cc:1245: Failed to create an actor due to the application failure, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:13,893 I 44 44] (gcs_server) gcs_actor_manager.cc:294: Finished creating actor, job id = 01000000, actor id = 5756071e2fcde3da8df5881901000000, status = CreationTaskError: CreationTaskError: Exception raised from an actor init method. Traceback: The actor died because of an error raised in its creation task, [36mray::_ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N:JobSupervisor.__init__()[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details:
[2023-12-07 18:22:13,895 W 44 44] (gcs_server) gcs_worker_manager.cc:55: Reporting worker exit, worker id = c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a, node id = ffffffffffffffffffffffffffffffffffffffffffffffffffffffff, address = , exit_type = USER_ERROR, exit_detail = Worker exits because there was an exception in the initialization method (e.g., __init__). Fix the exceptions from the initialization to resolve the issue. Exception raised from an actor init method. Traceback: The actor died because of an error raised in its creation task, [36mray::_ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N:JobSupervisor.__init__()[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details:. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2023-12-07 18:22:13,895 W 44 44] (gcs_server) gcs_actor_manager.cc:961: Worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a on node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 exits, type=USER_ERROR, has creation_task_exception = 1
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:1132: Actor 5756071e2fcde3da8df5881901000000 is failed on worker c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a at node 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, need_reschedule = 0, death context type = CreationTaskFailureContext, remaining_restarts = 0, job id = 01000000
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:730: Actor name _ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N is cleand up.
[2023-12-07 18:22:13,895 I 44 44] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:13,899 I 44 44] (gcs_server) gcs_actor_manager.cc:807: Destroying actor, actor id = 5756071e2fcde3da8df5881901000000, job id = 01000000
[2023-12-07 18:22:13,899 I 44 44] (gcs_server) gcs_actor_manager.cc:812: Tried to destroy actor that does not exist 5756071e2fcde3da8df5881901000000
[2023-12-07 18:22:36,760 I 44 44] (gcs_server) gcs_server.cc:255: GcsNodeManager: 
- RegisterNode request count: 2
- DrainNode request count: 0
- GetAllNodeInfo request count: 40
- GetInternalConfig request count: 3

GcsActorManager: 
- RegisterActor request count: 1
- CreateActor request count: 1
- GetActorInfo request count: 1
- GetNamedActorInfo request count: 1
- GetAllActorInfo request count: 1
- KillActor request count: 1
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 1
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 1

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 12

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GcsTaskManager: 
-Total num task events reported: 9
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0.0290661MiB
-Current num of task events stored: 4
-Total num of actor creation tasks: 1
-Total num of actor tasks: 2
-Total num of normal tasks: 0
-Total num of driver tasks: 1

[2023-12-07 18:21:38,345 I 141 141] (raylet) main.cc:174: Setting cluster ID to: ea9a37e74d9f4c9d834c7ecd04ec94a02354ac3923e7bfaca3bf660d
[2023-12-07 18:21:38,347 I 141 141] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-12-07 18:21:38,348 I 141 141] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.31002GB of memory.
[2023-12-07 18:21:38,348 I 141 141] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2023-12-07 18:21:38,348 I 141 155] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(2310144008, /dev/shm/plasmaXXXXXX)
[2023-12-07 18:21:38,348 I 141 155] (raylet) store.cc:535: ========== Plasma store: =================
Current usage: 0 / 2.31002 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2023-12-07 18:21:38,350 I 141 141] (raylet) grpc_server.cc:129: ObjectManager server started, listening on port 43321.
[2023-12-07 18:21:38,351 I 141 141] (raylet) worker_killing_policy.cc:101: Running GroupByOwner policy.
[2023-12-07 18:21:38,352 I 141 141] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 7600000000 bytes (0.95 system memory), total system memory bytes: 8000000000
[2023-12-07 18:21:38,352 I 141 141] (raylet) node_manager.cc:319: Initializing NodeManager with ID 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4
[2023-12-07 18:21:38,352 I 141 141] (raylet) grpc_server.cc:129: NodeManager server started, listening on port 37317.
[2023-12-07 18:21:38,359 I 141 183] (raylet) agent_manager.cc:64: Monitor agent process with name dashboard_agent/424238335
[2023-12-07 18:21:38,359 I 141 185] (raylet) agent_manager.cc:64: Monitor agent process with name runtime_env_agent
[2023-12-07 18:21:38,360 I 141 141] (raylet) event.cc:234: Set ray event level to warning
[2023-12-07 18:21:38,360 I 141 141] (raylet) event.cc:342: Ray Event initialized for RAYLET
[2023-12-07 18:21:38,361 I 141 141] (raylet) raylet.cc:127: Raylet of id, 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 started. Raylet consists of node_manager and object_manager. node_manager address: 33.60.76.44:37317 object_manager address: 33.60.76.44:43321 hostname: rayjob-test-raycluster-ss6p8-head-cxpxq
[2023-12-07 18:21:38,362 I 141 141] (raylet) node_manager.cc:537: [state-dump] NodeManager:
[state-dump] Node ID: 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4
[state-dump] Node name: 33.60.76.44
[state-dump] InitialConfigResources: {accelerator_type:T4: 10000, memory: 80000000000000, CPU: 40000, object_store_memory: 23100186620000, GPU: 20000, node:33.60.76.44: 10000, node:__internal_head__: 10000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state: 
[state-dump] Local id: 5077174522555753333 Local resources: {"total":{CPU: [40000], GPU: [10000, 10000], node:33.60.76.44: [10000], node:__internal_head__: [10000], memory: [80000000000000], accelerator_type:T4: [10000], object_store_memory: [23100186620000]}}, "available": {CPU: [40000], GPU: [10000, 10000], node:33.60.76.44: [10000], node:__internal_head__: [10000], memory: [80000000000000], accelerator_type:T4: [10000], object_store_memory: [23100186620000]}}, "labels":{"ray.io/node_id":"8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4",} is_draining: 0 is_idle: 1 Cluster resources: node id: 5077174522555753333{"total":{memory: 80000000000000, node:__internal_head__: 10000, GPU: 20000, node:33.60.76.44: 10000, CPU: 40000, object_store_memory: 23100186620000, accelerator_type:T4: 10000}}, "available": {memory: 80000000000000, node:__internal_head__: 10000, GPU: 20000, node:33.60.76.44: 10000, CPU: 40000, object_store_memory: 23100186620000, accelerator_type:T4: 10000}}, "labels":{"ray.io/node_id":"8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4",}, "is_draining": 0} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump] 
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump] 
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 2310018662
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump] 
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num CPP workers: 0
[state-dump] - num CPP drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers: 
[state-dump] Event stats:
[state-dump] Global stats: 24 total (13 active)
[state-dump] Queueing time: mean = 1.237 ms, max = 8.289 ms, min = 8.829 us, total = 29.700 ms
[state-dump] Execution time:  mean = 602.184 us, total = 14.452 ms
[state-dump] Event stats:
[state-dump]    PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), CPU time: mean = 141.203 us, total = 1.553 ms
[state-dump]    NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.GCTaskFailureReason - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 193.186 us, total = 193.186 us
[state-dump]    NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 12.706 ms, total = 12.706 ms
[state-dump]    NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump] 
[state-dump] 
[2023-12-07 18:21:38,363 I 141 141] (raylet) accessor.cc:627: Received notification for node id = 8f5adc209e27065f1aaf9d21d4378038c9590c9d850bc955f4dc28f4, IsAlive = 1
[2023-12-07 18:21:51,095 I 141 141] (raylet) accessor.cc:627: Received notification for node id = f7e39c00d132635d71a544e061171cefa527b343f14aebed7318eb0f, IsAlive = 1
[2023-12-07 18:22:00,612 I 141 141] (raylet) node_manager.cc:622: New job has started. Job id 01000000 Driver pid 118 is dead: 0 driver address: 33.50.139.160
[2023-12-07 18:22:00,895 I 141 141] (raylet) runtime_env_agent_client.cc:307: Create runtime env for job 01000000
[2023-12-07 18:22:00,897 I 141 141] (raylet) worker_pool.cc:498: Started worker process with pid 264, the token is 0
[2023-12-07 18:22:01,578 I 141 155] (raylet) object_store.cc:35: Object store current usage 8e-09 / 2.31002 GB.
[2023-12-07 18:22:13,894 I 141 141] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=2, has creation task exception = true
[2023-12-07 18:22:13,894 I 141 141] (raylet) node_manager.cc:1493: Formatted creation task exception: Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 1669, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1769, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1675, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1610, in ray._raylet.execute_task.function_executor

  File "python/ray/_raylet.pyx", line 4393, in ray._raylet.CoreWorker.run_async_func_or_coro_in_event_loop

  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()

  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception

  File "python/ray/_raylet.pyx", line 4380, in async_func

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/async_compat.py", line 42, in wrapper
    return func(*args, **kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/gcs_aio_client.py", line 64, in __init__
    self._gcs_client = GcsClient(address, nums_reconnect_retry)

  File "python/ray/_raylet.pyx", line 2492, in ray._raylet.GcsClient.__cinit__

  File "python/ray/_raylet.pyx", line 2501, in ray._raylet.GcsClient._connect

  File "python/ray/_raylet.pyx", line 468, in ray._raylet.check_status

ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 2064, in ray._raylet.task_execution_handler

  File "python/ray/_raylet.pyx", line 1960, in ray._raylet.execute_task_with_cancellation_handler

  File "python/ray/_raylet.pyx", line 1617, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1618, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 1856, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 959, in ray._raylet.store_task_errors

ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::_ray_internal_job_actor_raysubmit_2siN5swbQYgJ7N8N:JobSupervisor.__init__()[39m (pid=264, ip=33.60.76.44, actor_id=5756071e2fcde3da8df5881901000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7ff1941ef250>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 162, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.251.238:6379: tcp handshaker shutdown; RPC Error details:
, worker_id: c1eb7e136bf13cfbd7170a65343e3528014a4ec79a9aa81be4425f7a
[2023-12-07 18:22:38,348 I 141 155] (raylet) store.cc:535: ========== Plasma store: =================
Current usage: 0 / 2.31002 GB
- num bytes created total: 8
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

The above exception logs occur during the initial stage of component startup. Subsequent logs are normal, so they are not uploaded again to avoid repetition.

SimonCqk commented 9 months ago

192.168.251.238

This is the cluster IP of the service. I modified the code of Kuberay to inject the service name into the driver pod through the environment variable, ensuring that the service can be accessed before startup.

Do you mean I have to either ensure the gcs server port available when driver pod startup ?

rickyyx commented 9 months ago

not too familiar with kuberay personally , but yeah, I would expect the gcs server port to be available to all other worker pods.

cc @architkulkarni

architkulkarni commented 9 months ago

@SimonCqk thanks for the additional details. If you could answer a few more questions, it might help:

You mentioned the failure happens intermittently. Can you say more about this? For example, if it fails once, does it recover or will it always fail from that point on? If it never recovers, maybe the headservice was deleted and recreated with a different IP address.
Which Ray node is the job supervisor actor running on in the cases where it fails the cases where it succeeds? By default it's always on the Ray head node, but just wanted to confirm.
Why was it necessary to modify kuberay code to inject the service name? Ideally it shouldn't be necessary, was it an attempt to work around the current issue?

cc @kevin85421

kevin85421 commented 8 months ago

@architkulkarni can you triage this issue? Thanks!

adriansblack commented 4 months ago

Also hitting this, wondering if anyone has found a resolution? Thanks!

ray-project / kuberay