Enable HPA-compatibility

VinozzZ commented 2 months ago

Use stress level as an indicator of when Refinery should scale up — when it rises too high for more than a little while, we should add new refinery capacity.

Unfortunately, it is designed to sit close to zero when a refinery is normally loaded as well as when it’s underloaded. The best way to think about being underloaded is time-based — if the server doesn’t pop into stress for several minutes, the cluster probably has too many pods.

While we will be adjusting which metrics are used to measure stress level in the new release, the basic logic behind how it works it will remain unchanged.

Kubernetes has HPA — Horizontal Pod Autoscaling — which normally monitors CPU and/or memory for all of the pods in a cluster. However, it can also be taught to monitor other metrics, provided there is a custom metrics API server in the cluster.

We should implement an instance of this to return stress level as a custom metric. That may be all we need to do (plus configuring k8s to use it); the k8s autoscaler is pretty smart and can be tuned for aggressiveness, so we might be able to avoid changing anything about the basic way that stress_level works.

We’ll have to build the metrics server with Refinery and include it in our release bundle, and update our Refinery Helm chart to use it.

Note that the custom metrics server could also serve metrics relating to Redis, and possibly even allow the redis part of the cluster to also be autoscaled. That’s advanced mode for another day, but the option is there.

VinozzZ commented 2 months ago

Discovery so far from @kentquirk :

k8s only allows a single metrics adapter for custom metrics
if we provide our own for refinery, anyone using prometheus will have to choose, but users NOT using prom (or any other adapter) will have an easy time of it
if we use the prom adapter, then users who don't use prom will have to install it just to scale refinery
there's a third alternative, KEDA, which we could write a plugin for, which some users may already be working with; it can handle prom and support a refinery plugin at the same time. But it's a bit of a lift for anyone not already using it.
An additional factor is that right now, building our own is dead in the water because of version compatibility problems. Hopefully this won't be a problem for long, but it is currently a blocker.

VinozzZ commented 2 months ago

@TylerHelmuth was able to make Refinery autoscaling up and down using stress_level via the Prometheus Adapter (which requires a Prometheus instance). This means we can, with no extra code/executables, provide anyone who does not already have a custom metric server running on their cluster a solution to auto-scale refinery with stress_level. The solution is to install a bunch of Prometheus stuff, but its technically a solution. The next step would be to see how much I could integrate into the helm chart so that when you install Refinery you can optionally install the extra bits/HorizontalPodAutoscaler needed to scale based on stress_level.

honeycombio / refinery

Enable HPA-compatibility #1087