Requeueing every second for every VM places a lot of load on our systems, and results in a lot of logs where nothing changed.
Currently we do it because there can be changes to QEMU that we won't pick up otherwise. Those shouldn't happen during the "Running" phase though, so we can reduce the requeue frequency from "every 1s" to "every 15s" for that.
There's a similar thing with Pending VMs -- either we're waiting on the pod to start (in which case we'll be notified anyways), or we're waiting on neonvm-runner/QEMU to start (in which case we'll retry from errors). So we can delay retrying there as well.
To compensate for these changes, we add a "requeueAfter" field to the "Successful reconciliation" log line so that we can still tell if a gap between reconciles is unusually long.
In the future, we may want to reduce this even further. We'll see.
Part of neondatabase/cloud#15591.
This actually ended up much smaller than I thought :)
Requeueing every second for every VM places a lot of load on our systems, and results in a lot of logs where nothing changed.
Currently we do it because there can be changes to QEMU that we won't pick up otherwise. Those shouldn't happen during the "Running" phase though, so we can reduce the requeue frequency from "every 1s" to "every 15s" for that.
There's a similar thing with Pending VMs -- either we're waiting on the pod to start (in which case we'll be notified anyways), or we're waiting on neonvm-runner/QEMU to start (in which case we'll retry from errors). So we can delay retrying there as well.
To compensate for these changes, we add a "requeueAfter" field to the "Successful reconciliation" log line so that we can still tell if a gap between reconciles is unusually long.
In the future, we may want to reduce this even further. We'll see.
Part of neondatabase/cloud#15591.
This actually ended up much smaller than I thought :)