ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.74k stars 5.74k forks source link

[Core] [Observability] Metrics for num oom killed tasks #47587

Closed alanwguo closed 1 month ago

alanwguo commented 1 month ago

Description

Ray has an OOM killer, it would be good to have a metric to see how many tasks have been killed by Ray for monitoring purposes.

Use case

I'd like to create a graph on teh rate of tasks being oom killed. I can sets up alerts on this as well

alanwguo commented 1 month ago

Also a metric for num dead nodes with a label for reason would be nice. Especially if we can detect OOM'd nodes

alanwguo commented 1 month ago

Looks like this already exists, closing.

ray_memory_manager_worker_eviction_total