Upon submitting the job:
ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
we sometimes receive back the correct response.
However, at other points the program crashes with this error:
Status message: Unexpected error occurred: The actor died because of an error raised in its creation task, ray::_ray_internal_job_actor_raysubmit_xjugKRUQJcNatswT:JobSupervisor.__init__() (pid=2561, ip=10.116.130.14, actor_id=a437fd10ab3c9414e3e8f49001000000, repr=<ray.dashboard.modules.job.job_supervisor.JobSupervisor object at 0x7f6c445128d0>)
File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/job/job_supervisor.py", line 71, in __init__
gcs_aio_client = GcsAioClient(address=gcs_address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.100.196.52:6379: Failed to connect to remote host: FD Shutdown; RPC Error details:
Thanks
Reproduction script
Not really reproducible as we're using a custom image / cluster. We've attempted this on a fresh cluster and still hit the error, suggesting the issue may be with Ray. Any help greatly appreciated.
Anything else
Occurs maybe 80% of the time. As noted sometimes there is no error.
Search before asking
KubeRay Component
Others
What happened + What you expected to happen
Hi, I've spun up a RayCluster roughly following the RayCluster QuickStart (we do however use a custom image) https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html
Upon submitting the job:
ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
we sometimes receive back the correct response. However, at other points the program crashes with this error:Thanks
Reproduction script
Not really reproducible as we're using a custom image / cluster. We've attempted this on a fresh cluster and still hit the error, suggesting the issue may be with Ray. Any help greatly appreciated.
Anything else
Occurs maybe 80% of the time. As noted sometimes there is no error.
Are you willing to submit a PR?