volga-project / volga

Feature Engine for real-time AI/ML
Apache License 2.0
36 stars 4 forks source link

[Platform] Autoscaling #45

Open anovv opened 4 months ago

anovv commented 4 months ago

Umbrella task to keep track of autoscaling mechanism for Volga jobs.

Ray provides a nice built-in autoscaler which automatically provisions nodes based on pending actors' resource requests - https://docs.ray.io/en/latest/cluster/vms/user-guides/configuring-autoscaling.html. It is also integrated with Kubernetes (i.e. ray node == kube pod) - https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html.

We can build our own actor provisioning logic based on perf metrics and integrate it with existing autoscaler. Coupling it with a cloud-native kube autoscaler (e.g. cluster-autoscaler for cloud-agnostic or karpenter for AWS) will unlock running fully elastic serverless jobs on the cloud. This will consist of two parts: