ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

Ray Cluster does not work across multiple docker containers #45252

Open ccruttjr opened 4 months ago

ccruttjr commented 4 months ago

What happened + What you expected to happen

Not using docker, my two computers communicate fine/correctly. Also, if I am using ray on one docker container and connecting to it via another computer without docker, it works fine. If both computers are interacting via docker instances, or the Docker container is not the head, it works for a time, but then the worker docker container stops connecting to head. I know this by using ray status. I have it more detailed below and how to reproduce.

Versions / Dependencies

ray==2.20.0

Reproduction script

How to easily reproduce

This works (straight computer to computer):

# On Computer 1
ray start --head # Local node IP: 192.168.250.20
# On Computer 1
ray status # Shows one node
# On Computer 2 on same network
ray start --address='192.168.250.20:6379'
# On Computer 2, wait a few seconds then
ray status # shows two nodes
# On Computer 1, quickly
ray status # shows two nodes
# On Computer 1, wait a bit and then
ray status # shows two nodes
# On both computers
ray stop # should stop all processes :)

This semi works DOCKERFILE

FROM nvcr.io/nvidia/pytorch:24.04-py3
WORKDIR /app
CMD ["bash"]
docker build -t my-python-cuda-app .
# I know the port forwarding is clunky and overkill but just wanted to be sure
docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app
# Now in docker instance
pip install ray==2.20.0
ray start --head # should give different ip
# On Computer 2
ray start --address='192.168.250.20:6379' # Still use the host computer's ip
# Run ray status like above and see two nodes are connected and staying connected
# Now stop both ray instances and make Computer 2 the head and Docker the worker
# If you do ray status soon after adding the Docker worker, it will show two nodes.
# If you wait a bit, however, it will only show Computer 2's node - the head

What is happening in Docker that isn't on the "normal" computer? Is it putting the process to sleep? As a side note, when stopping the worker instances when connected to head, it usually stops 2 ray processes. Stopping the Docker ray after ray only sees one node, however, shows that it is only stopping one process.

Issue Severity

None

rynewang commented 4 months ago

I think it's the ports. Ray by default selects some random ports to serve internal traffic, and these port numbers change every time you start. So you can't forward ports by 1 run's results.

https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations

You can set fixed port numbers based on this doc and see if it works.

ccruttjr commented 4 months ago

Hmm but why does the worker connect initially, and then stops? Wouldn't it just never connect? Anyways, I also tried --network=host and --publish-all to no avail if that was supposed to fix something.

I also tried running this as I believe you were referencing in the link above but got the same results.

ray start --head --max-worker-port 10005 --node-manager-port 10006 --object-manager-port 10007 --runtime-env-agent-port 10008

edit: also using ray==2.22.0 now instead of 2.20.0

rynewang commented 3 months ago

docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app

It turns out you can't easily dockerize a Ray worker because we have many different interconnection requirements. Can you try this and see if it works

docker run -it -v /ray/tmp:/ray/tmp --gpus all --ipc=host --pid=host --network=host --userns=keep-id --env-file <(env) --name my-python-gpu-container my-python-cuda-app

and see if it works