I'm running a toy Ray cluster on two machines with GPUs, using docker and rayproject/ray-ml:2.30.0-py39-gpu image as base (my workload installs some additional packages including a VPN, but I don't think it's relevant here).
Then the cluster is spawned as follows:
A head node is created via docker run passing my custom entrypoint shell script, which starts VPN and eventually runs ray start --head --port=6379 --num-cpus=NUM_CPUS_PLACEHOLDER --node-ip-address=${VPN_IP} --num-gpus=NUM_GPUS_PLACEHOLDER --include-dashboard=true --dashboard-host=127.0.0.1 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &
Then on another machine, same image is spawned, but this time a little bit differently (with entrypoint being tail -f /dev/null to make it basically sit idle)
Then I create an interactive session via docker exec -it ray_worker_gpu /bin/bash and run commands in that shell, as follows:
VPN is created
Ray is started via ray start --address=${HEADNODE_IP_ADDRESS}:6379 --node-name=$(hostname)-ray --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &
At this point, I have two devices running Ray containers and connected to each other just fine.
Now, if I submit a simple Ray job that wants to read something from stdin (provided in reproduction script), the execution on the headnode pauses until eventually a worker is being marked as dead because its heartbeats stopped coming, with a message like has been marked dead because the detector has missed too many heartbeats from it.
Processes on the worker node are still happily running, though, and nothing suspicious in the logs as far as I can see, however logs stop adding - so it feels like the raylet code actually freezes.
If I remove the --block option on the worker, the repro script quits with a (mostly expected) error:
Command failed with error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
OSError: [Errno 5] Input/output error
While blocking indefinitely on waiting for stdin vs failing like above is questionable, node definitely should not be dying if running such a script.
Versions / Dependencies
Ray 2.30
Docker image rayproject/ray-ml:2.30.0-py39-gpu with some additional packages installed inside
Reproduction script
import ray
import subprocess
import sys
@ray.remote
def run_command_with_input():
try:
result = subprocess.run(
[sys.executable, '-c', 'print(input())'],
#['sudo', 'ls'],
check=True,
# stdin=None,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
return result.stdout.decode('utf-8'), None
except subprocess.CalledProcessError as e:
return None, e.stderr.decode('utf-8')
def main():
ray.init()
stdout, stderr = ray.get(run_command_with_input.remote())
if stdout:
print("Command executed successfully:\n", stdout)
if stderr:
print("Command failed with error:\n", stderr)
if stdout is None and stderr is None:
print("No output or error was returned.")
if __name__ == "__main__":
main()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
I'm running a toy Ray cluster on two machines with GPUs, using docker and
rayproject/ray-ml:2.30.0-py39-gpu
image as base (my workload installs some additional packages including a VPN, but I don't think it's relevant here).Then the cluster is spawned as follows:
docker run
passing my custom entrypoint shell script, which starts VPN and eventually runsray start --head --port=6379 --num-cpus=NUM_CPUS_PLACEHOLDER --node-ip-address=${VPN_IP} --num-gpus=NUM_GPUS_PLACEHOLDER --include-dashboard=true --dashboard-host=127.0.0.1 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &
tail -f /dev/null
to make it basically sit idle)docker exec -it ray_worker_gpu /bin/bash
and run commands in that shell, as follows:ray start --address=${HEADNODE_IP_ADDRESS}:6379 --node-name=$(hostname)-ray --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &
At this point, I have two devices running Ray containers and connected to each other just fine. Now, if I submit a simple Ray job that wants to read something from
stdin
(provided in reproduction script), the execution on the headnode pauses until eventually a worker is being marked as dead because its heartbeats stopped coming, with a message likehas been marked dead because the detector has missed too many heartbeats from it
.Processes on the worker node are still happily running, though, and nothing suspicious in the logs as far as I can see, however logs stop adding - so it feels like the raylet code actually freezes.
If I remove the
--block
option on the worker, the repro script quits with a (mostly expected) error:While blocking indefinitely on waiting for stdin vs failing like above is questionable, node definitely should not be dying if running such a script.
Versions / Dependencies
Ray 2.30 Docker image
rayproject/ray-ml:2.30.0-py39-gpu
with some additional packages installed insideReproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.