honeycombio / refinery

Refinery is a trace-aware tail-based sampling proxy. It examines whole traces and intelligently applies sampling decisions (whether to keep or discard) to each trace.
Apache License 2.0
274 stars 86 forks source link

Enable HPA-compatibility #1087

Open VinozzZ opened 2 months ago

VinozzZ commented 2 months ago

Use stress level as an indicator of when Refinery should scale up — when it rises too high for more than a little while, we should add new refinery capacity.

Unfortunately, it is designed to sit close to zero when a refinery is normally loaded as well as when it’s underloaded. The best way to think about being underloaded is time-based — if the server doesn’t pop into stress for several minutes, the cluster probably has too many pods.

While we will be adjusting which metrics are used to measure stress level in the new release, the basic logic behind how it works it will remain unchanged.

Kubernetes has HPA — Horizontal Pod Autoscaling — which normally monitors CPU and/or memory for all of the pods in a cluster. However, it can also be taught to monitor other metrics, provided there is a custom metrics API server in the cluster.

We should implement an instance of this to return stress level as a custom metric. That may be all we need to do (plus configuring k8s to use it); the k8s autoscaler is pretty smart and can be tuned for aggressiveness, so we might be able to avoid changing anything about the basic way that stress_level works.

We’ll have to build the metrics server with Refinery and include it in our release bundle, and update our Refinery Helm chart to use it.

Note that the custom metrics server could also serve metrics relating to Redis, and possibly even allow the redis part of the cluster to also be autoscaled. That’s advanced mode for another day, but the option is there.

VinozzZ commented 2 months ago

Discovery so far from @kentquirk :

VinozzZ commented 2 months ago

@TylerHelmuth was able to make Refinery autoscaling up and down using stress_level via the Prometheus Adapter (which requires a Prometheus instance). This means we can, with no extra code/executables, provide anyone who does not already have a custom metric server running on their cluster a solution to auto-scale refinery with stress_level. The solution is to install a bunch of Prometheus stuff, but its technically a solution. The next step would be to see how much I could integrate into the helm chart so that when you install Refinery you can optionally install the extra bits/HorizontalPodAutoscaler needed to scale based on stress_level.