Open igorgad opened 2 years ago
cc @SongGuyang in case there are any workarounds for container issue.
As another possible workaround, you mentioned conda
takes 10 minutes to install. If the conda environment isn't changing often, would it fit your use case to preinstall the conda environment and then just specify the name of the existing environment in the runtime_env
? E.g. runtime_env={"conda": "my-existing-env"}
. Then it would just be activating the existing environment at runtime instead of installing, so it should be faster.
Hey @architkulkarni, thanks for your quick reply.
Yes, it's an alternative. I'm curious though. Does preinstalling the conda environment on the head node makes it shareable with new workers? If not it would take a reasonable amount of time to install the conda environment on new workers unless otherwise installed on the base image of the cluster. The problem at the moment is that we try to work with a more generic cluster that attends multiple projects through the use of runtime environments.
Ah no, you would need the conda environment to be on all the nodes of the cluster and have the same name on all nodes.
@architkulkarni I am experiencing issues when trying to test the Alpha Container Runtime feature. Is podman a necessary dependency? I noticed the container runtime is specified in the code (see this issue: https://github.com/ray-project/ray/issues/29665).
We are using Kuberay + Cri-o as our container runtime on kubernetes. Is the expectation for this feature to have the autoscaler launch a new worker? Does this work natively with existing Kubernetes architectures?
We don't have Podman installed nor use it: bash: line 0: exec: podman: not found
Hi @peterghaddad , I believe podman
is required. You might be able to find some more details in this thread, but support is limited at the moment: https://discuss.ray.io/t/how-to-use-container-in-runtime-environments/6175/11 I don't expect that this feature has any special compatibility with Kubernetes. Like other runtime_env fields such as conda
, this feature would be for worker processes, not nodes launched by the autoscaler (which are also unfortunately called "workers"), so it shouldn't have any interaction with the autoscaler.
Thanks for the response @architkulkarni. So the worker is what pulls the actual image? i.e an image runs within an image when using Kuberay? It may make sense to have an integration for Kuberay where it launches a new Pod with the image specified, installs environments dependencies, then kicks off a job. Food for thought, but think this would be robust when running on K8 environments!
Is there any update on this? I have exactly the same problem but don't have a workaround unfortunately
I'm attempting to start a job from a python interactive environment. It's important to do it this way as jobs will eventually be submitted by the Prefect job schedular which itegrates to ray via prefect-ray. Here is the python code I am using:
import ray
import time
import logging
from ray.runtime_env import RuntimeEnv
logger = logging.getLogger()
env = RuntimeEnv(container={
"image": "europe-west2-docker.pkg.dev/<GCP_PROJECT>/test-docker/test-prefect-ray:0.0.1b1",
"run_options": ["--log-level=debug"]
})
ray.init("ray://<Server-IP>:10001", runtime_env=env)
@ray.remote
def square(x):
logger.warning('Example log')
return x * x
start = time.time()
object_references = [
square.remote(item) for item in range(8)
]
data = ray.get(object_references)
print(data)
I have one node at the moment, the head node, which is a GCP Virtual Machine, started with ray start --head --port=6379 --dashboard-host=<Server-IP>
There's not too much useful information I can see in the logs, as far as I can tell the container is being downloaded on the head node, and from there is struggling to reach the ray server on the VM (the same machine the container is running on). Starting this container manually I am at least able to ping the host IP from a container bash session.
Output from ray_client_server_23000.err
... ^ truncated ^ ... time="2024-03-26T11:02:16Z" level=debug msg="running conmon: /usr/libexec/podman/conmon" args="[--api-version 1 -c 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -u 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 -r /usr/bin/crun -b /home/jamesarney/.local/share/containers/storage/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata -p /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/pidfile -n focused_buck --exit-dir /run/user/1006/libpod/tmp/exits --full-attach -l journald --log-level debug --syslog --conmon-pidfile /run/user/1006/containers/overlay-containers/35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/jamesarney/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1006/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1006/libpod/tmp --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg --syslog --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35]" [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied time="2024-03-26T11:02:16Z" level=info msg="Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/conmon: permission denied" time="2024-03-26T11:02:16Z" level=debug msg="Received: 73931" time="2024-03-26T11:02:16Z" level=info msg="Got Conmon PID as 73928" time="2024-03-26T11:02:16Z" level=debug msg="Created container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 in OCI runtime" time="2024-03-26T11:02:16Z" level=debug msg="Attaching to container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35" time="2024-03-26T11:02:16Z" level=debug msg="Starting container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35 with command [python -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server]" time="2024-03-26T11:02:16Z" level=debug msg="Started container 35c98171fdc91debf0b4bd87f196408f29f626cd1c31bc56434f54fc50c6fe35" time="2024-03-26T11:02:16Z" level=debug msg="Enabling signal proxying" 2024-03-26 11:02:18,161 INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address=None) 2024-03-26 11:02:23,208 INFO server.py:930 -- 25 idle checks before shutdown. 2024-03-26 11:02:28,221 INFO server.py:930 -- 20 idle checks before shutdown. 2024-03-26 11:02:33,233 INFO server.py:930 -- 15 idle checks before shutdown. 2024-03-26 11:02:38,244 INFO server.py:930 -- 10 idle checks before shutdown. 2024-03-26 11:02:43,256 INFO server.py:930 -- 5 idle checks before shutdown. time="2024-03-26T11:02:48Z" level=debug msg="Called run.PersistentPostRunE(podman run -v /tmp/ray:/tmp/ray --cgroup-manager=cgroupfs --network=host --pid=host --ipc=host --userns=keep-id --env RAY_RAYLET_PID=69972 --env RAY_JOB_ID= --env RAY_CLIENT_MODE=0 --env RAY_LD_PRELOAD=1 --env RAY_NODE_ID=f1810c0e0436d3671a5d97bfd1583d77408d9605b7a186f6be6bb733 --env RAY_enable_pipe_based_agent_to_parent_health_check=1 --log-level=debug --entrypoint python europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1 -m ray.util.client.server --address=10.128.0.52:6379 --host=0.0.0.0 --port=23000 --mode=specific-server)"
Output from ray_client_server_23000.err
2024-03-25 19:25:36,955 INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.128.0.52:6379', redis_password=None, runtime_env_agent_address='http://10.128.0.52:56619') 2024-03-26 11:02:15,537 INFO proxier.py:696 -- New data connection from client afda9a422aa8463fad3f5dcf1f09ebe3: 2024-03-26 11:02:15,553 INFO proxier.py:223 -- Increasing runtime env reference for ray_client_server_23000.Serialized runtime env is {"container": {"image": "europe-west2-docker.pkg.dev/biocortex-project/test-docker/test-prefect-ray:0.0.1b1", "run_options": ["--log-level=debug"]}}. 2024-03-26 11:02:48,668 ERROR proxier.py:333 -- SpecificServer startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:02:48,669 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 73886 for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:02:48,669 ERROR proxier.py:707 -- Server startup failed for client: afda9a422aa8463fad3f5dcf1f09ebe3, using JobConfig: <ray.job_config.JobConfig object at 0x7f0c26b8c490>! 2024-03-26 11:02:56,925 INFO proxier.py:391 -- Specific server afda9a422aa8463fad3f5dcf1f09ebe3 is no longer running, freeing its port 23000 2024-03-26 11:03:18,673 ERROR proxier.py:380 -- Timeout waiting for channel for afda9a422aa8463fad3f5dcf1f09ebe3 Traceback (most recent call last): File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel grpc.channel_ready_future(server.channel).result( File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result self._block(timeout) File "/home/jamesarney/.cache/pypoetry/virtualenvs/jamesarney-Ei4ktb2p-py3.10/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block raise grpc.FutureTimeoutError() grpc.FutureTimeoutError 2024-03-26 11:03:18,677 INFO proxier.py:768 -- afda9a422aa8463fad3f5dcf1f09ebe3 last started stream at 1711450935.384032. Current stream started at 1711450935.384032. 2024-03-26 11:03:18,678 WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed. 2024-03-26 11:03:20,680 ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:03:20,681 WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed. 2024-03-26 11:03:22,683 ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:03:22,683 WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed. 2024-03-26 11:03:24,685 ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:03:24,686 WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed. 2024-03-26 11:03:26,688 ERROR proxier.py:351 -- Unable to find channel for client: afda9a422aa8463fad3f5dcf1f09ebe3 2024-03-26 11:03:26,689 WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.
I am facing the same issue with ray==2.22.0
on ubuntu 22.04.
Podman version : 4.6.2
Is there any workaround or pending bug fix ?
@zcin could you take this one?
What happened + What you expected to happen
Hi,
Even though runtime_env containers are still experimental, I've been having success using them at the job level in ray applications launched inside the cluster with the job submission. i.e. the script that runs on the cluster does
ray.init(runtime_env={'container': ...})
. That being said, I don't think there's anything wrong with the podman setup on my custom cluster images, inherited fromrayproject/ray:2.0.0-py38
.However, using runtime_env containers with ray client for interactive development leads to the following errors in the initialization of the ray client server.
The file
ray_client_server_23000.err
containsI can find more info on
ray_client_server.err
,Also on
runtime_env_setup-ray_client_server_23000.log
I could findI think this issue is related to the connection between the client proxy and client server that seems to run in the container, however, as stated in the logs, the container is created with
--net host
flag. I wonder if someone from the ray team could point me towards a workaround, or some documentation regarding the setup of the client servers as I am willing to contribute.Regarding issue severity, I'll leave it at
Medium
since my only alternatives are:Thanks,.
Versions / Dependencies
About ray
Podman installed on cluster base image
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.