Closed sharnoff closed 11 months ago
This is also related to neondatabase/cloud#6444. From discussion about that, in addition to providing current CPU/memory reserved, it'd be good to also provide minimum and maximum for the VM (probably taking autoscaling.neon.tech/bounds
into account?)
Rationale
There's a few reasons:
Implementation ideas
I think all these problems can be solved in one go.
My tentative idea is to create a new, separately deployed single-instance-per-cluster component that will expose two metrics for every running pod in the cluster:
pod_or_vm_cpu_requests
andpod_or_vm_mem_requests
(in bytes). Maybe the metrics should be prefixed by the component name, not sure.These metrics will be defined for each running K8s pod as:
vm.neon.tech/usage
annotation is defined, use the CPU and memory given thereThis is what our patched cluster-autoscaler is making decisions with.
The actual implementation should be pretty easy —
pkg/util/watch
can be used, and the "add"/"update"/"delete" callbacks should be relatively simple as well.Areas of future work
We might also want to have a separate metric for VM usage that's only present for VMs, so in grafana we can say "VM usage OR regular pod CPU usage" (where pod CPU usage is actually more like
I could equally see this being used to show migrations for each VM. Cluster-wide totals can be derived from the autoscaler-agent's billing metrics (because it tracks the counts for each VM phase), but we don't have anything available per-VM.
Prior issues, discussions:
cc @cicdteam, @arssher as people who may be interested in the outcome of this.