Open ccruttjr opened 6 months ago
I think it's the ports. Ray by default selects some random ports to serve internal traffic, and these port numbers change every time you start. So you can't forward ports by 1 run's results.
https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations
You can set fixed port numbers based on this doc and see if it works.
Hmm but why does the worker connect initially, and then stops? Wouldn't it just never connect? Anyways, I also tried --network=host
and --publish-all
to no avail if that was supposed to fix something.
I also tried running this as I believe you were referencing in the link above but got the same results.
ray start --head --max-worker-port 10005 --node-manager-port 10006 --object-manager-port 10007 --runtime-env-agent-port 10008
edit: also using ray==2.22.0
now instead of 2.20.0
docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app
It turns out you can't easily dockerize a Ray worker because we have many different interconnection requirements. Can you try this and see if it works
docker run -it -v /ray/tmp:/ray/tmp --gpus all --ipc=host --pid=host --network=host --userns=keep-id --env-file <(env) --name my-python-gpu-container my-python-cuda-app
and see if it works
What happened + What you expected to happen
Not using docker, my two computers communicate fine/correctly. Also, if I am using ray on one docker container and connecting to it via another computer without docker, it works fine. If both computers are interacting via docker instances, or the Docker container is not the head, it works for a time, but then the worker docker container stops connecting to head. I know this by using
ray status
. I have it more detailed below and how to reproduce.Versions / Dependencies
ray==2.20.0
Reproduction script
How to easily reproduce
This works (straight computer to computer):
This semi works DOCKERFILE
What is happening in Docker that isn't on the "normal" computer? Is it putting the process to sleep? As a side note, when stopping the worker instances when connected to head, it usually stops 2 ray processes. Stopping the Docker ray after ray only sees one node, however, shows that it is only stopping one process.
Issue Severity
None