Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.
Use case
Our system is running a very large number of DeploymentHandles (see https://github.com/ray-project/ray/issues/44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stale record_handle_metrics tasks idle on the controller, which then eventually runs out of memory and crashes.
@zcin FYI with https://github.com/ray-project/ray/pull/45957, I suspect this won't be necessary from a performance/scalability perspective, though it may be good design to do it anyway
Description
It would be nice to provide backpressure on handle metrics pushes to the Serve controller so that the controller does not become overloaded.
Relevant code is around these locations:
Currently the metrics push is fire-and-forget, and happens on a fixed interval whether or not the previous push has finished.
Use case
Our system is running a very large number of
DeploymentHandle
s (see https://github.com/ray-project/ray/issues/44784 for more details). We've noticed that the Serve controller gets overloaded (>100% CPU usage) trying to accept all of the metrics pushes, which leads to an ever-increasing number of increasingly-stalerecord_handle_metrics
tasks idle on the controller, which then eventually runs out of memory and crashes.