Autoscaling and policy-driven automations

Hello everyone, 😀

This as a proposal, and a place to discuss about the implementation of autoscaling and policy-driven automations for Vitess. # The general idea is to be able to provide a list of policies / rules (possibly in the spec) for certain events / actions to take place automatically. This would be very useful for specifying custom autoscaling scenarios, or alerts, for example.

The high-level approach to this could be:

We create an "orchestrator" server that takes metrics from our Vitess clusters.
We create some "policies" on when / how to scale up and/or down (based on metrics and limits). Also, we specify the frequency of the check for each policy.
The server checks at the given intervals for each policy and if applicable, runs custom predefined actions to our Vitess clusters.

To be able to achieve this, we need to be able to specify the following info in the spec for any policy:

Metrics and limits (ie. shard size > 256GB, avg cpu load > 60%) This allows us to specify when our automations will be executed. It involves deciding which metrics are useful, as well as a reliable and accurate way to obtain them.
Set of actions when the event gets triggered (ie. execute script, alert, perform backup, etc. ) This allows us to specify what our automations will do when executed. The "execute script" is really the only necessary one, since it allows for custom made workflows and automations.
Interval / frequency of checking (ie. every 1 hour) If we don't specify any metric-limit, the automation just runs at the specified interval. (useful for backups, reports)

All this could be tremendously useful, allowing for custom autoscaling (horizontal and vertical), alerts, reports, integrations, and automated backups.

Please give your thoughts and ideas!

# This is a followup for a Slack discussion. Please check it out for more info.

planetscale / vitess-operator

Autoscaling and policy-driven automations #259

The high-level approach to this could be:

To be able to achieve this, we need to be able to specify the following info in the spec for any policy: