neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
166 stars 21 forks source link

neonvm-controller: Requeue after 15s if Pending or Running #1016

Closed sharnoff closed 4 months ago

sharnoff commented 4 months ago

Requeueing every second for every VM places a lot of load on our systems, and results in a lot of logs where nothing changed.

Currently we do it because there can be changes to QEMU that we won't pick up otherwise. Those shouldn't happen during the "Running" phase though, so we can reduce the requeue frequency from "every 1s" to "every 15s" for that.

There's a similar thing with Pending VMs -- either we're waiting on the pod to start (in which case we'll be notified anyways), or we're waiting on neonvm-runner/QEMU to start (in which case we'll retry from errors). So we can delay retrying there as well.

To compensate for these changes, we add a "requeueAfter" field to the "Successful reconciliation" log line so that we can still tell if a gap between reconciles is unusually long.

In the future, we may want to reduce this even further. We'll see.


Part of neondatabase/cloud#15591.

This actually ended up much smaller than I thought :)