ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.17k stars 5.8k forks source link

[Autoscaler] Hook function for specific states in Autoscaler #40734

Open astron8t-voyagerx opened 1 year ago

astron8t-voyagerx commented 1 year ago

Description

Situation

We're running a ML server with Ray Serve on Google Cloud. Nowadays, the zone I'm using on GC is suffering "out of GPUs" issue, and on those times, we cannot scale up since they cannot launch any more virtual instances. The problem is, in this situation, the Ray Autoscaler just tries to scale-up forever and we cannot know even this thing is happening.

Wanted Feature

I want to insert a hook function, so that I can send alerts to my team's workspace inside the autoscaler. And there's no option for these demand as far as I know.

Use case

No response

jjyao commented 1 year ago

cc @rickyyx