Closed alanwguo closed 1 month ago
Ray has an OOM killer, it would be good to have a metric to see how many tasks have been killed by Ray for monitoring purposes.
I'd like to create a graph on teh rate of tasks being oom killed. I can sets up alerts on this as well
Also a metric for num dead nodes with a label for reason would be nice. Especially if we can detect OOM'd nodes
Looks like this already exists, closing.
ray_memory_manager_worker_eviction_total
Description
Ray has an OOM killer, it would be good to have a metric to see how many tasks have been killed by Ray for monitoring purposes.
Use case
I'd like to create a graph on teh rate of tasks being oom killed. I can sets up alerts on this as well