honeycombio / refinery

Refinery is a trace-aware tail-based sampling proxy. It examines whole traces and intelligently applies sampling decisions (whether to keep or discard) to each trace.
293 stars 93 forks source link

Vertical Pod Autoscaling for Refinery #605

Open fujin opened 1 year ago

fujin commented 1 year ago

So when I set up my Refinery clusters (8 environments, one per), I used the advice in the documentation to size my cache. I haven't re-sized my caches since.

I noticed (albeit much later) that my environment Refineries have drastically different cpu/memory profiles (not entirely unreasonable) but want to free myself from managing the resource requests (as I do of all things). As I have done for many other cases (especially for Go workloads), I looked into vertical pod autoscaling.

If you are unfamiliar with VPA, the premise is that a histogram of utilisation for both cpu/memory (8 days, iirc) is used to calculate recommendations, lower bounds and upper bounds for container resourcing requests; it also tracks for OOM killed events. It's possible to configure it so that the recommendations are applied during pod scheduling, or to automatically evict the workload to make a change (voluntary disruptions). I know that the Refinery docs recommend against downscaling the HPAs due to trace-loss, so that would have to be thought about if one were to use the Auto mode. The other mode is Off which only calculates recommendations which is good for reading from other tools.

It looks like if Refinery were to infer MaxAlloc from the container resource limit, and calculate CacheCapacity from that, when the resource request/limit is changed by e.g. VPA (or even manually), both the MaxAlloc and CacheCapacity would adjust themselves automatically. It's fairly straight forward to take this information from the cgroups, and you may already be doing so for GOMAXPROCS. I accept that it wouldn't be one-size-fits.

automaxprocs is also relevant in that it sets the number of max procs based on the CPU quota (which VPA would adjust).

kentquirk commented 1 year ago

Hi, @fujin. Apologies for the delay; we had a company offsite last week.

Your proposal is interesting. It's also a little tricky, requiring specific features of k8s (afaict). I gave it about an hour, but wasn't able to find a Go library that I thought would be a relatively easy drop-in. If you know of something, please let me know. It's not a must-have, but would make things easier.

In any case, given the sequence of things we're currently planning, this doesn't really fit in immediately, but I can see us getting to it sometime later in the spring.