Closed sharnoff closed 5 months ago
The mentioned types of VM stuckness are covered by other metrics:
I understand this is more fine-grained in a per-VM granularity level, but there is some duplication, particularly, if we are looking forward to setup alerts based on this. @sharnoff wdyt?
@sharnoff and I discussed this. It's partially covered by the other metrics and having this makes it much easier to track down the progress with each VM.
Status: after the deployment to prod yesterday there is a non-zero number of stuck VM on the dashboard.
For that VMs vm-monitor consistently rejects downscaling request, and those VMs are now considered stuck. The question is: do we want to investigate why autoscaler-agent wants to downscale VMs, but vm-monitor rejects the request, or we just want to consider this situation normal, and remove it from the definition of stuckness?
Thread: https://neondb.slack.com/archives/C03F5SM1N02/p1713893262823009
Implementation finished, some follup in #926
Problem description / Motivation
Currently, "autoscaling stuck" metrics and logs use the following definition:
Current implementation is here.
This misses various other ways that autoscaling may currently be failing for a particular VM, some of which we've seen in prod (e.g. due to missing a pod start event, the scheduler doesn't know about a particular VM and always returns 404 to the autoscaler-agent).
Feature idea(s) / DoD
Some other types of "stuckness" we should look at including:
And possibly also (although more difficult):
See also #594. Tracking the delay between VM object change and VM status change is currently blocked on #592.