ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[Core] Worker seen dead by driver if run with `--block` in a TTY console and a task wants to read from stdin #47879

Open vnlitvinov opened 1 month ago

vnlitvinov commented 1 month ago

What happened + What you expected to happen

I'm running a toy Ray cluster on two machines with GPUs, using docker and rayproject/ray-ml:2.30.0-py39-gpu image as base (my workload installs some additional packages including a VPN, but I don't think it's relevant here).

Then the cluster is spawned as follows:

  1. A head node is created via docker run passing my custom entrypoint shell script, which starts VPN and eventually runs ray start --head --port=6379 --num-cpus=NUM_CPUS_PLACEHOLDER --node-ip-address=${VPN_IP} --num-gpus=NUM_GPUS_PLACEHOLDER --include-dashboard=true --dashboard-host=127.0.0.1 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &
  2. Then on another machine, same image is spawned, but this time a little bit differently (with entrypoint being tail -f /dev/null to make it basically sit idle)
  3. Then I create an interactive session via docker exec -it ray_worker_gpu /bin/bash and run commands in that shell, as follows:
  4. VPN is created
  5. Ray is started via ray start --address=${HEADNODE_IP_ADDRESS}:6379 --node-name=$(hostname)-ray --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block &

At this point, I have two devices running Ray containers and connected to each other just fine. Now, if I submit a simple Ray job that wants to read something from stdin (provided in reproduction script), the execution on the headnode pauses until eventually a worker is being marked as dead because its heartbeats stopped coming, with a message like has been marked dead because the detector has missed too many heartbeats from it.

Processes on the worker node are still happily running, though, and nothing suspicious in the logs as far as I can see, however logs stop adding - so it feels like the raylet code actually freezes.

If I remove the --block option on the worker, the repro script quits with a (mostly expected) error:

Command failed with error:
 Traceback (most recent call last):
  File "<string>", line 1, in <module>
OSError: [Errno 5] Input/output error

While blocking indefinitely on waiting for stdin vs failing like above is questionable, node definitely should not be dying if running such a script.

Versions / Dependencies

Ray 2.30 Docker image rayproject/ray-ml:2.30.0-py39-gpu with some additional packages installed inside

Reproduction script

import ray
import subprocess
import sys

@ray.remote
def run_command_with_input():
    try:
        result = subprocess.run(
            [sys.executable, '-c', 'print(input())'],
            #['sudo', 'ls'],
            check=True,
            # stdin=None,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE
        )
        return result.stdout.decode('utf-8'), None
    except subprocess.CalledProcessError as e:
        return None, e.stderr.decode('utf-8')

def main():
    ray.init()

    stdout, stderr = ray.get(run_command_with_input.remote())

    if stdout:
        print("Command executed successfully:\n", stdout)
    if stderr:
        print("Command failed with error:\n", stderr)
    if stdout is None and stderr is None:
        print("No output or error was returned.")

if __name__ == "__main__":
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 1 month ago

@dentiny could you help with this user and see what's wrong here?