ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.13k stars 5.48k forks source link

[Core] Allow manual marking of node death via CLI and API #45632

Open terraflops1048576 opened 1 month ago

terraflops1048576 commented 1 month ago

Description

Ray should allow the user to use the CLI or a Ray API to mark a node death by IP address or by Node ID, so the scheduler knows not to assign any new tasks to this node and for the node's current tasks to be marked as failed with NodeDiedError.

Use case

In our Ray deployment, currently there are some unknown conditions that cause tasks on some preempted nodes to register as "running" and the node to appear in the Ray Dashboard as alive, even though the node is long gone. The node page on the Ray Dashboard displays an empty screen, and the task continues "running" forever.

jjyao commented 1 month ago

In our Ray deployment, currently there are some unknown conditions that cause tasks on some preempted nodes to register as "running" and the node to appear in the Ray Dashboard as alive, even though the node is long gone. The node page on the Ray Dashboard displays an empty screen, and the task continues "running" forever.

Hi @terraflops1048576 this seems a Ray bug that we should fix. Could you elaborate more?

terraflops1048576 commented 1 month ago

I don't really have the ability to diagnose what's going on here. Opening the Chrome DevTools on the node page (http://<cluster ip>/#/cluster/nodes/<node id>) shows:

TypeError: Cannot read properties of undefined (reading '0')
at hc (NodeDetail.tsx:115:30)
at oo (react-dom.production.min.js:157:137)
...

which suggests to me that the cluster can't fetch the information for the node because it's gone. The node IP is unreachable over SSH, which suggests that the node has been preempted.

However, the task continues to show "running" in the Ray Core Dashboard; it's blue. However, it just runs forever and it doesn't terminate. Basically how we encountered this problem is that the tasks appear to run forever, and then clicking on the task to get the node information yields a blank screen. I have screenshots of the problem, but I'm not sure that they're helpful.

terraflops1048576 commented 1 month ago

I should add that I understand that this information is certainly not sufficient to reproduce the bug, and I would love to collect information to track this down -- if I could be told what exactly to gather, because this seems to happen often enough.

I think at least the CLI/API would be a workaround to unstick tasks that get stuck in this state.

jjyao commented 4 weeks ago

@terraflops1048576 Ray has health check so if the underlying machine of a Ray node is gone, then Ray will eventually mark the node has dead after few minutes. Is this not the case?

terraflops1048576 commented 3 weeks ago

This was indeed not the case for some reason when I tried this on Ray 2.12. This caused the running tasks to simply hang. However, I cannot seem to reproduce the issue on Ray 2.24. I suspect the PR #44692 fixed this.