ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.01k stars 5.78k forks source link

[Ray Client] - Client server failed with runtime_env container #29852

Open igorgad opened 2 years ago

igorgad commented 2 years ago

What happened + What you expected to happen

Hi,

Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does ray.init(runtime_env={'container': ...}). That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited from rayproject/ray:2.0.0-py38.

However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.

---------------------------------------------------------------------------
ConnectionAbortedError                    Traceback (most recent call last)
Cell In [2], line 3
      1 import ray
----> 3 ray.init('ray://localhost:10001', runtime_env={
      4     'container': {
      5             'image': 'docker.io/rayproject/ray:2.0.0-py38',
      6             'run_options': ['--cgroups=enabled'],
      7         },
      8 })

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/_private/worker.py:1248, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
   1246 passed_kwargs.update(kwargs)
   1247 builder._init_args(**passed_kwargs)
-> 1248 ctx = builder.connect()
   1249 from ray._private.usage import usage_lib
   1251 if passed_kwargs.get("allow_multiple") is True:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/client_builder.py:178, in ClientBuilder.connect(self)
    175 if self._allow_multiple_connections:
    176     old_ray_cxt = ray.util.client.ray.set_context(None)
--> 178 client_info_dict = ray.util.client_connect.connect(
    179     self.address,
    180     job_config=self._job_config,
    181     _credentials=self._credentials,
    182     ray_init_kwargs=self._remote_init_kwargs,
    183     metadata=self._metadata,
    184 )
    185 get_dashboard_url = ray.remote(ray._private.worker.get_dashboard_url)
    186 dashboard_url = ray.get(get_dashboard_url.options(num_cpus=0).remote())

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client_connect.py:47, in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
     42 _explicitly_enable_client_mode()
     44 # TODO(barakmich): https://github.com/ray-project/ray/issues/13274
     45 # for supporting things like cert_path, ca_path, etc and creating
     46 # the correct metadata
---> 47 conn = ray.connect(
     48     conn_str,
     49     job_config=job_config,
     50     secure=secure,
     51     metadata=metadata,
     52     connection_retries=connection_retries,
     53     namespace=namespace,
     54     ignore_version=ignore_version,
     55     _credentials=_credentials,
     56     ray_init_kwargs=ray_init_kwargs,
     57 )
     58 return conn

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:252, in RayAPIStub.connect(self, *args, **kw_args)
    250 def connect(self, *args, **kw_args):
    251     self.get_context()._inside_client_test = self._inside_client_test
--> 252     conn = self.get_context().connect(*args, **kw_args)
    253     global _lock, _all_contexts
    254     with _lock:

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/__init__.py:102, in _ClientContext.connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
     94 self.client_worker = Worker(
     95     conn_str,
     96     secure=secure,
   (...)
     99     connection_retries=connection_retries,
    100 )
    101 self.api.worker = self.client_worker
--> 102 self.client_worker._server_init(job_config, ray_init_kwargs)
    103 conn_info = self.client_worker.connection_info()
    104 self._check_versions(conn_info, ignore_version)

File /opt/miniconda3/envs/harmon/lib/python3.8/site-packages/ray/util/client/worker.py:838, in Worker._server_init(self, job_config, ray_init_kwargs)
    830     response = self.data_client.Init(
    831         ray_client_pb2.InitRequest(
    832             job_config=serialized_job_config,
   (...)
    835         )
    836     )
    837     if not response.ok:
--> 838         raise ConnectionAbortedError(
    839             f"Initialization failure from server:\n{response.msg}"
    840         )
    842 except grpc.RpcError as e:
    843     raise decode_exception(e)

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 685, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

The file ray_client_server_23000.err contains

Trying to pull docker.io/rayproject/ray:2.0.0-py38...
Getting image source signatures
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:3b65ec22a9e96affe680712973e88355927506aa3f792ff03330f3a3eb601a98
Copying blob sha256:87f7a5ff197c9418519c096f1f7aa5afceac82f8ada0df33a21a384d55acde5f
Copying blob sha256:8a0031b53b4d14665f9c7ab891ece272998721af9b0d969924d88fc9408ed57c
Copying blob sha256:cde2dbf8dc867dda82c869f13f50d1d88a854128ab07916e9df3d45086b1aca3
Copying blob sha256:d8135c8d3f0ebe84b529d185558505d5dd4b524e282c17b6152aba56b02ed31e
Copying blob sha256:f0d19e69127971cff8b7bfbbe024890de117604b5861e2b106da8cfd3fb81d53
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:57c67e634ccf3c72945b4da73023e28c0efaae0fa95c8c1644180bd9df46be68
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:aea4f35623b6f74ffaaf14a60cf010fa0c69942480aeeb34853366ad58fd4c00
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying blob sha256:78f7682f5042b61bad31612b833dde54498ffcebcd18057bcff8255687020ba7
Copying config sha256:c3b4447db3d173fcc94d5736ee633a6223ef07efc15a2ba1c69a34f673f6c299
Writing manifest to image destination
Storing signatures
2022-10-31 05:37:33,217 INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:23000
2022-10-31 05:37:38,239 INFO server.py:922 -- 25 idle checks before shutdown.
2022-10-31 05:37:43,249 INFO server.py:922 -- 20 idle checks before shutdown.
2022-10-31 05:37:48,260 INFO server.py:922 -- 15 idle checks before shutdown.
2022-10-31 05:37:53,272 INFO server.py:922 -- 10 idle checks before shutdown.
2022-10-31 05:37:58,282 INFO server.py:922 -- 5 idle checks before shutdown.

I can find more info on ray_client_server.err,

2022-10-31 05:36:33,435 INFO server.py:875 -- Starting Ray Client server on 0.0.0.0:10001
2022-10-31 05:36:48,552 INFO proxier.py:670 -- New data connection from client 71aa1ee5efa1441b937aecb493ed977f: 
2022-10-31 05:36:48,566 INFO proxier.py:229 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "docker.io/rayproject/ray:2.0.0-py38", "run_options": ["--cgroups=enabled"]}}.
2022-10-31 05:38:03,708 ERROR proxier.py:332 -- SpecificServer startup failed for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708 INFO proxier.py:340 -- SpecificServer started on port: 23000 with PID: 229 for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:03,708 ERROR proxier.py:681 -- Server startup failed for client: 71aa1ee5efa1441b937aecb493ed977f, using JobConfig: <ray.job_config.JobConfig object at 0x7f85ec1ee460>!
2022-10-31 05:38:03,709 INFO proxier.py:390 -- Specific server 71aa1ee5efa1441b937aecb493ed977f is no longer running, freeing its port 23000
2022-10-31 05:38:33,710 ERROR proxier.py:379 -- Timeout waiting for channel for 71aa1ee5efa1441b937aecb493ed977f
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 374, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2022-10-31 05:38:33,711 WARNING proxier.py:777 -- Retrying Logstream connection. 1 attempts failed.
2022-10-31 05:38:33,712 INFO proxier.py:742 -- 71aa1ee5efa1441b937aecb493ed977f last started stream at 1667219808.5511196. Current stream started at 1667219808.5511196.
2022-10-31 05:38:35,713 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:35,713 WARNING proxier.py:777 -- Retrying Logstream connection. 2 attempts failed.
2022-10-31 05:38:37,715 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:37,715 WARNING proxier.py:777 -- Retrying Logstream connection. 3 attempts failed.
2022-10-31 05:38:39,717 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:39,717 WARNING proxier.py:777 -- Retrying Logstream connection. 4 attempts failed.
2022-10-31 05:38:41,719 ERROR proxier.py:350 -- Unable to find channel for client: 71aa1ee5efa1441b937aecb493ed977f
2022-10-31 05:38:41,719 WARNING proxier.py:777 -- Retrying Logstream connection. 5 attempts failed

Also on runtime_env_setup-ray_client_server_23000.log I could find

12022-10-31 05:36:48,569    INFO container.py:47 -- start worker in container with prefix: podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --env-host --env RAY_RAYLET_PID=154 --cgroups=enabled --entrypoint python docker.io/rayproject/ray:2.0.0-py38

I think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with --net host flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.

Regarding issue severity, I'll leave it at Medium since my only alternatives are:

Thanks,.

Versions / Dependencies

About ray

ray[default]==2.0.0
kuberay-operator: kuberay/operator:v0.3.0

Podman installed on cluster base image

(base) ray@lany-cluster-head-bvkg6:~$ podman info
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.2, commit: '
  cpus: 8
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  hostname: lany-cluster-head-bvkg6
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 100
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.133+
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 27025526784
  memTotal: 33671999488
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: ea1fe3938eefa14eb707f1d22adff4db670645d6
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /tmp/podman-run-1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 0
  swapTotal: 0
  uptime: 283h 18m 10.55s (Approximately 11.79 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/ray/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 0
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.5
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/ray/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 1
  runRoot: /tmp/podman-run-1000/containers
  volumePath: /home/ray/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.2
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 3.4.2

Reproduction script

import ray

ray.init('ray://localhost:10001', runtime_env={
    'container': {
            'image': 'docker.io/rayproject/ray:2.0.0-py38',
            'run_options': ['--cgroups=enabled'],
        },
})

Issue Severity

Medium: It is a significant difficulty but I can work around it.

architkulkarni commented 2 years ago

cc @SongGuyang in case there are any workarounds for container issue.

As another possible workaround, you mentioned conda takes 10 minutes to install. If the conda environment isn't changing often, would it fit your use case to preinstall the conda environment and then just specify the name of the existing environment in the runtime_env? E.g. runtime_env={"conda": "my-existing-env"}. Then it would just be activating the existing environment at runtime instead of installing, so it should be faster.

igorgad commented 2 years ago

Hey @architkulkarni, thanks for your quick reply.

Yes, it's an alternative. I'm curious though. Does preinstalling the conda environment on the head node makes it shareable with new workers? If not it would take a reasonable amount of time to install the conda environment on new workers unless otherwise installed on the base image of the cluster. The problem at the moment is that we try to work with a more generic cluster that attends multiple projects through the use of runtime environments.

architkulkarni commented 2 years ago

Ah no, you would need the conda environment to be on all the nodes of the cluster and have the same name on all nodes.

peterghaddad commented 1 year ago

@architkulkarni I am experiencing issues when trying to test the Alpha Container Runtime feature. Is podman a necessary dependency? I noticed the container runtime is specified in the code (see this issue: https://github.com/ray-project/ray/issues/29665).

We are using Kuberay + Cri-o as our container runtime on kubernetes. Is the expectation for this feature to have the autoscaler launch a new worker? Does this work natively with existing Kubernetes architectures?

We don't have Podman installed nor use it: bash: line 0: exec: podman: not found

architkulkarni commented 1 year ago

Hi @peterghaddad , I believe podman is required. You might be able to find some more details in this thread, but support is limited at the moment: https://discuss.ray.io/t/how-to-use-container-in-runtime-environments/6175/11 I don't expect that this feature has any special compatibility with Kubernetes. Like other runtime_env fields such as conda, this feature would be for worker processes, not nodes launched by the autoscaler (which are also unfortunately called "workers"), so it shouldn't have any interaction with the autoscaler.

peterghaddad commented 1 year ago

Thanks for the response @architkulkarni. So the worker is what pulls the actual image? i.e an image runs within an image when using Kuberay? It may make sense to have an integration for Kuberay where it launches a new Pod with the image specified, installs environments dependencies, then kicks off a job. Food for thought, but think this would be robust when running on K8 environments!

arneyjfs commented 7 months ago

Is there any update on this? I have exactly the same problem but don't have a workaround unfortunately

arneyjfs commented 7 months ago

Here's a bit more info.

I'm attempting to start a job from a python interactive environment. It's important to do it this way as jobs will eventually be submitted by the Prefect job schedular which itegrates to ray via prefect-ray. Here is the python code I am using:

import ray
import time
import logging
from ray.runtime_env import RuntimeEnv

logger = logging.getLogger()

env = RuntimeEnv(container={
    "image": "europe-west2-docker.pkg.dev/<GCP_PROJECT>/test-docker/test-prefect-ray:0.0.1b1",
    "run_options": ["--log-level=debug"]
})

ray.init("ray://<Server-IP>:10001", runtime_env=env)

@ray.remote
def square(x):
    logger.warning('Example log')
    return x * x

start = time.time()
object_references = [
    square.remote(item) for item in range(8)
]
data = ray.get(object_references)
print(data)

I have one node at the moment, the head node, which is a GCP Virtual Machine, started with ray start --head --port=6379 --dashboard-host=<Server-IP>

Logs

There's not too much useful information I can see in the logs, as far as I can tell the container is being downloaded on the head node, and from there is struggling to reach the ray server on the VM (the same machine the container is running on). Starting this container manually I am at least able to ping the host IP from a container bash session.

Output from ray_client_server_23000.err

... ^ truncated ^ ...
time="2024-03-26T11:02:16Z" level=debug msg="running conmon: /usr/libexec/podman/conmon" args="[--api-version 1 -c 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -u 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -r /usr/bin/crun -b /home/jamesarney/.local/share/containers/storage/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata -p /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/pidfile -n focused_buck --exit-dir /run/user/1006/libpod/tmp/exits --full-attach -l journald --log-level debug --syslog --conmon-pidfile /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/jamesarney/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1006/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1006/libpod/tmp --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35]"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

time="2024-03-26T11:02:16Z" level=info msg="Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/conmon: permission denied"
time="2024-03-26T11:02:16Z" level=debug msg="Received: 73931"
time="2024-03-26T11:02:16Z" level=info msg="Got Conmon PID as 73928"
time="2024-03-26T11:02:16Z" level=debug msg="Created container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 in OCI runtime"
time="2024-03-26T11:02:16Z" level=debug msg="Attaching to container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35"
time="2024-03-26T11:02:16Z" level=debug msg="Starting container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 with command [python -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server]"
time="2024-03-26T11:02:16Z" level=debug msg="Started container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35"
time="2024-03-26T11:02:16Z" level=debug msg="Enabling signal proxying"
2024-03-26 11:02:18,161   INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address=None)
2024-03-26 11:02:23,208   INFO server.py:930 -- 25 idle checks before shutdown.
2024-03-26 11:02:28,221   INFO server.py:930 -- 20 idle checks before shutdown.
2024-03-26 11:02:33,233   INFO server.py:930 -- 15 idle checks before shutdown.
2024-03-26 11:02:38,244   INFO server.py:930 -- 10 idle checks before shutdown.
2024-03-26 11:02:43,256   INFO server.py:930 -- 5 idle checks before shutdown.
time="2024-03-26T11:02:48Z" level=debug msg="Called run.PersistentPostRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --userns=keep-id --env RAY_RAYLET_PID=69972 --env RAY_JOB_ID= --env RAY_CLIENT_MODE=0 --env RAY_LD_PRELOAD=1 --env RAY_NODE_ID=f1810c0e0436d3671a5d97bfd1583d77408d9605b7a186f6be6bb733 --env RAY_enable_pipe_based_agent_to_parent_health_check=1 --log-level=debug --entrypoint python europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1 -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server)"

Output from ray_client_server_23000.err

2024-03-25 19:25:36,955   INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address='http://10.128.0.52:56619')
2024-03-26 11:02:15,537   INFO proxier.py:696 -- New data connection from client afda9a422aa8463fad3f5dcf1f09ebe3:
2024-03-26 11:02:15,553   INFO proxier.py:223 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1", "run_options": ["--log-level=debug"]}}.
2024-03-26 11:02:48,668   ERROR proxier.py:333 -- SpecificServer startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:02:48,669   INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 73886 for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:02:48,669   ERROR proxier.py:707 -- Server startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3, using JobConfig: <ray.job_config.JobConfig object at 0x7f0c26b8c490>!
2024-03-26 11:02:56,925   INFO proxier.py:391 -- Specific server afda9a422aa8463fad3f5dcf1f09ebe3 is no longer running, freeing its port 23000
2024-03-26 11:03:18,673   ERROR proxier.py:380 -- Timeout waiting for channel for afda9a422aa8463fad3f5dcf1f09ebe3
Traceback (most recent call last):
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
    self._block(timeout)
  File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-03-26 11:03:18,677   INFO proxier.py:768 -- afda9a422aa8463fad3f5dcf1f09ebe3 last started stream at 1711450935.384032. Current stream started at 1711450935.384032.
2024-03-26 11:03:18,678   WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-03-26 11:03:20,680   ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:20,681   WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-03-26 11:03:22,683   ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:22,683   WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-03-26 11:03:24,685   ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:24,686   WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-03-26 11:03:26,688   ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3
2024-03-26 11:03:26,689   WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.
tanguy-s commented 6 months ago

I am facing the same issue with ray==2.22.0 on ubuntu 22.04. Podman version : 4.6.2

Is there any workaround or pending bug fix ?

jjyao commented 4 months ago

@zcin could you take this one?