Open JacksonCakes opened 10 months ago
My issue might be related: https://github.com/ray-project/ray/issues/39565
@JacksonCakes does the issue happen consistently, and does it happen with Ray 2.7.1? If if happens again, can you send a zip of all the logs from the head node (/tmp/ray/session_latest/logs
)?
Also having this issue.
any update?
I debugged this issue by ssh'ing into the worker nodes, and learning that the default volume size in the cluster launcher is too low for some common docker container images (like nvcr.io/nvidia/pytorch:24.02-py3, with a few additional dependencies).
Similarly, the default timeout might be low, so worker nodes might be initialized and then stuck on an endless loop of being killed after they hit the timeout limit while also running out of memory due to default volume size limits.
same issue
@JacksonCakes @shuiqingliu @ajaichemmanam did upping the volume size as well as extending the default timeout mitigate the problem?
@jaanphare do you think we should up the defaults a bit? If so what size/timeout secs would you recommend?
@JacksonCakes @shuiqingliu @ajaichemmanam did upping the volume size as well as extending the default timeout mitigate the problem?
@jaanphare do you think we should up the defaults a bit? If so what size/timeout secs would you recommend?
I will give it a try.
What happened + What you expected to happen
When I run
ray up -y cluster_config.yml --no-config-cache
, it seems like the cluster has successfully setup based on output below.but then after waiting for more than 10 minutes, it still did not managed to start. I checked with
ray status
but get the following:using
ray monitor cluster_config.yaml
are stuck too.I am not sure if there is problem with my config file below, but I managed to run successfully using the same script below if I delete
/tmp/ray/cluster-default.state
and re-run ray up.Versions / Dependencies
python==3.8.15 ray==2.6.1
Reproduction script
Issue Severity
High: It blocks me from completing my task.