Open vipese-idoven opened 4 months ago
The way you monitor such tasks will introduce overhead due to the blocking nature of Python threads (until Python3.13 nogil) and cannot scale to more observability dimensions (depending on what telemetry you are collecting it can be up to 50% if not careful). Doing ad-hoc monitoring like this can be acceptable as a quick workaround for certain jobs but it is better to be done externally to the worker process through a global monitor.
I see... Thanks for the quick response @Superskyyy.
I tried monitoring externally using Ray State, but it does not provide a similar functionality, and passing PIDs from other nodes resulted into errors. Also, Ray's guidelines indicates that Ray generate some system metrics that can be used for monitoring, but Prometheus falls short as it is at cluster level (rather than Task or Actor level, which can help preventing killing tasks that are using more memory than initially allocated).
Do you have any suggestions as to how to do it?
We do have an internal experimental implementation of such continuous monitoring system with feedback on ray. We are discussing needs and see if this can be integrated back to open-source gradually. Happy to loop you in when we propose RFC or REPs on it.
I'd really appreciate that – thank you @Superskyyy !
When can we expect it to be publicly available? @Superskyyy Or just a simple idea of how you are using / implementing it, so I can figure something out myself.
@Superskyyy Thanks for the information, Making that part open-source would be great!
Description
Continuous resource monitoring of Ray Tasks and Actor with metrics extraction.
Use case
Currently, Ray does not support other OOM prevention than allocating more resource by Ray Tasks and reduce concurrency. However, this can still trigger OOM if the Task uses more resources than allocated.
Implementing continuous monitoring of resources used by Ray Task and Actors (
ray.util.state
falls short to do this) would allow to design routines to prevent OOM by killing Task / Actors before using more memory than specified inresources
.So far, I've been able to implement a self-monitoring Threaded Actor, also posted in the Ray Discussion Forum, but somehow this seems sub-optimal.