ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.38k stars 5.66k forks source link

CLI: kill actor via ray kill actors *actor_id* or via dashboard #39240

Open raycharleston opened 1 year ago

raycharleston commented 1 year ago

Description

Allow killing an actor from the CLI and/or dashboard. I know we can use ray.kill(handle) from within a driver script or from within the job submitted to a cluster but what about killing a misbehaving actor if the job does not have the kill logic?

I previously mentioned this on the message board and was asked to open this issue.

Link to message board post: https://discuss.ray.io/t/how-to-kill-actor-from-cli-or-dashboard/11952

Use case

I'd like to be able to kill an actor from the Dashboard via a Kill button, which would terminate the actor and call a function with a conventional name(def exit, or def exit) within the Actor if defined.

Following the same pattern, I'd like to be able to terminate an Actor from the CLI. We already have 'ray list actors' , I think it makes sense also to have 'ray kill actors actor_id actor_id actor_id'. If called from the CLI we would also call the same def kill or def kill function on the Actor.

I'm working on an application that will leverage detached named actors as well as non-named Actors in many ActorPools; while working on some operational documentation for the application, I came across this question and realized there is not a method to kill an actor (other than from within the job code) of course we could always figure out the PID for the actor and kill that, but that solution is not very user friendly and in certain environments might require a sysadmin.

How does the community handle terminating hung/runaway actors when the job itself isn't smart enough to recognize an actor is hung and perform the cleanup without user interaction? Or when you don't want to kill the entire job?

If we did go down the path of allowing the termination of an actor from the dashboard and/or CLI, we would want to cleanup the ActorPool references to that actor so the pool does not have references to the terminated Actor.

OpenCoderX commented 2 months ago

This would be a great feature. I just found myself in a situation where I launched a detached actor with max restarts=-1 from a job. It seems like there is no way to kill that actor now, is that correct? Should I just reboot the entire cluster?

zhangkuantian commented 1 month ago

We have started a bunch of unnamed actors through ActorPools. When an actor fails, it keeps retrying, and resources are not being released. However, we currently cannot find a way to kill an actor using its actor ID; the only way to clean up anonymous actors is to restart the cluster, which is too costly. We urgently need a method or command to kill an actor using its actor ID.