ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] RayCluster Sporadic RPC Error #2180

Closed adriansblack closed 3 weeks ago

adriansblack commented 3 weeks ago

Search before asking

KubeRay Component

Others

What happened + What you expected to happen

Hi, I've spun up a RayCluster roughly following the RayCluster QuickStart (we do however use a custom image) https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html

Upon submitting the job: ray job submit --address http://localhost:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())" we sometimes receive back the correct response. However, at other points the program crashes with this error:

Status message: Unexpected error occurred: The actor died because of an error raised in its creation task, ray::_ray_internal_job_actor_raysubmit_xjugKRUQJcNatswT:JobSupervisor.__init__() (pid=2561, ip=10.116.130.14, actor_id=a437fd10ab3c9414e3e8f49001000000, repr=<ray.dashboard.modules.job.job_supervisor.JobSupervisor object at 0x7f6c445128d0>)
  File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/job/job_supervisor.py", line 71, in __init__
    gcs_aio_client = GcsAioClient(address=gcs_address)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                       ^^^^^^^^^^
ray.exceptions.RaySystemError: System error: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.100.196.52:6379: Failed to connect to remote host: FD Shutdown; RPC Error details:

Thanks

Reproduction script

Not really reproducible as we're using a custom image / cluster. We've attempted this on a fresh cluster and still hit the error, suggesting the issue may be with Ray. Any help greatly appreciated.

Anything else

Occurs maybe 80% of the time. As noted sometimes there is no error.

Are you willing to submit a PR?

adriansblack commented 3 weeks ago

Found a similar issue, will follow-up there: https://github.com/ray-project/kuberay/issues/1744