ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.58k stars 5.71k forks source link

"executable file `python` not found in $PATH" when using runtime_env container in cluster based on anyscale/ray-ml:nightly-py38-cpu image. #27734

Open onlyone2019 opened 2 years ago

onlyone2019 commented 2 years ago

What happened + What you expected to happen

ray job submit --address='http://192.168.0.192:8265' --runtime-env-json='{"working_dir":"./","container":{"image": "anyscale/ray-ml:nightly-py38-cpu", "worker_path": "/root/python/ray/workers/default_worker.py", "run_options": ["--cap-drop SYS_ADMIN","--log-level=debug"]}}' -- python ./debug.py

I submitted a job using above command, but I didn't get the result of f(x). It seems like hanged and stunk at building runtime_env. Also, the raylet.err reminded me "executable file python not found in $PATH" and I don't know how to fix it.

This is the feedback:

Job submission server address: http://192.168.0.192:8265
2022-08-10 14:36:41,777 INFO dashboard_sdk.py:319 -- Package gcs://_ray_pkg_698a6544fb43c3a9.zip already exists, skipping upload.

-------------------------------------------------------
Job 'raysubmit_G3syLLn4YmfZ28um' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_G3syLLn4YmfZ28um
  Query the status of the job:
    ray job status raysubmit_G3syLLn4YmfZ28um
  Request the job to be stopped:
    ray job stop raysubmit_G3syLLn4YmfZ28um

Tailing logs until the job exits (disable with --no-wait):

I knew the job status through ray job status raysubmit_G3syLLn4YmfZ28um:

Job submission server address: None
2022-08-10 14:42:06,737 INFO dashboard_sdk.py:129 -- No address provided, defaulting to http://localhost:8265.
Status for job 'raysubmit_G3syLLn4YmfZ28um': PENDING
Status message: Job has not started yet, likely waiting for the runtime_env to be set up.

I got some error messages from raylet.err:

time="2022-08-10T14:43:46+08:00" level=debug msg="Received: -1"
time="2022-08-10T14:43:46+08:00" level=debug msg="Cleaning up container 5076a7f5e866e4d7f1afa374a36dcc5c5b561477172319c54ebb023b08f45c83"
time="2022-08-10T14:43:46+08:00" level=debug msg="Network is already cleaned up, skipping..."
time="2022-08-10T14:43:53+08:00" level=debug msg="unmounted container \"5076a7f5e866e4d7f1afa374a36dcc5c5b561477172319c54ebb023b08f45c83\""
time="2022-08-10T14:43:54+08:00" level=debug msg="ExitCode msg: \"executable file `python` not found in $path: no such file or directory: oci runtime     attempted to invoke a command that was not found\""
Error: executable file `python` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found
[2022-08-10 14:43:58,130 E 1481817 1481817] (raylet) worker_pool.cc:500: Some workers of the worker process(1487985) have not registered within the t    imeout. The process is dead, probably it crashed during start.
time="2022-08-10T14:43:58+08:00" level=warning msg="Error validating CNI config file /home/wangjie/.config/cni/net.d/87-podman.conflist: [failed to f    ind plugin \"bridge\" in path [/usr/local/libexec/cni /usr/libexec/cni /usr/local/lib/cni /usr/lib/cni /opt/cni/bin] failed to find plugin \"firewall    \" in path [/usr/local/libexec/cni /usr/libexec/cni /usr/local/lib/cni /usr/lib/cni /opt/cni/bin]]"
Error: executable file `python` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found

Versions / Dependencies

ray : 3.0.0.dev0 python : 3.8

Reproduction script

debug.py

import ray
ray.init()

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(2)]
print(ray.get(futures))

Issue Severity

High: It blocks me from completing my task.

jjyao commented 2 years ago

Is this the same question asked on discuss: https://discuss.ray.io/t/how-does-container-in-runtime-env-work/7108

onlyone2019 commented 2 years ago

@jjyao yes, I posted it on Tuesday.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

wuisawesome commented 1 year ago

Marking as p2 since containers are still experimental/alpha. Perhaps @SongGuyang has an update though?

anyscalesam commented 4 months ago

@jjyao - old; but we can close this right? containers for runtime_envs are now supported.