Open wjzhou-ep opened 1 year ago
I'm running into the same intermittent failure using Python 3.10, Ray 2.4.0, using the rayproject/ray-ml:2.4.0-py310-gpu
image with some additional packages and pip installations. Fails after ~3 seconds.
@wjzhou-ep @nate-bush, could you tell me how I can reproduce this?
Are you using ray client? How do you use runtime env? Do you have job level or task/actor level runtime env?
Sorry, not yet for a reliable reproduce script.
Are you using ray client? We are using ray client from ipython notebook
How do you use runtime env? We are using local folder, something like
client = ray.init(
runtime_env={
"working_dir": _get_working_dir(),
"excludes": ["*.ipynb", "tests", "folder_tmp", "*_test.py", "*.html"],
},
)
Do you have job level or task/actor level runtime env? It's job level runtime env
E.g. we have an env, that the working_dir content is not changing for a long time, with multiply clients. Other than that, I don't recognize what's special with our setup.
FYI, please took as a grain of salt external redis (kuberay FT) might fixed the problem, but we are not sure, might just pure lucky.
I'm seeing the same issue every time I start new jobs in an scaled-to-0 cluster in a somewhat large codebase. I normally have to restart the script again once the node is up and runtime env has failed to install for unknown reasons (possibly the missing file). I'm willing to spend some time debugging into what goes wrong, but would love some pointers into how things are put together and how to debug it.
What happened + What you expected to happen
Failed to download runtime_env file package gcs://_ray_pkg_xxxx.zip.
The workaround is to change any file in the folder and re-run remote function.
expected Not seeing this error
info It seems, somehow, the client think the GCS has the runtime_uri while GCS don't have it. Change a file locally will change the hash, and the client will upload it.
Our env is about 1Mb, so I think the 10 min should be long enough for it.
Also, not sure if related, I see these in the runtime_env_agent.log
full log:
Versions / Dependencies
Ray: 2.4.0 Python: 3.10.0 OS: ubuntu 20.04 in docker Kuberay: 0.5.0
Reproduction script
Don't know how to reproduce this reliably.
Issue Severity
Low: It annoys or frustrates me.