Track more liveness in "autoscaling stuck"

sharnoff commented 8 months ago

Problem description / Motivation

Currently, "autoscaling stuck" metrics and logs use the following definition:

An autoscaling-enabled VM is "stuck" if there has not been a successful health check response for the last 20s.

Current implementation is here.

This misses various other ways that autoscaling may currently be failing for a particular VM, some of which we've seen in prod (e.g. due to missing a pod start event, the scheduler doesn't know about a particular VM and always returns 404 to the autoscaler-agent).

Feature idea(s) / DoD

Some other types of "stuckness" we should look at including:

Requests to the scheduler plugin are failing
Requests to update the VM object are failing
Other requests to the vm-monitor are failing

And possibly also (although more difficult):

Scheduler plugin consistently denying desired upscaling
vm-monitor consistently denying desired downscaling (currently sometimes expected in practice)

See also #594. Tracking the delay between VM object change and VM status change is currently blocked on #592.

shayanh commented 8 months ago

The mentioned types of VM stuckness are covered by other metrics:

Requests to the scheduler plugin are failing: covered by Agent → Scheduler plugin request errors
Requests to update the VM object are failing: covered by Agent → NeonVM API request errors
Other requests to the vm-monitor are failing: covered by Agent → vm-monitor request errors

I understand this is more fine-grained in a per-VM granularity level, but there is some duplication, particularly, if we are looking forward to setup alerts based on this. @sharnoff wdyt?

shayanh commented 8 months ago

@sharnoff and I discussed this. It's partially covered by the other metrics and having this makes it much easier to track down the progress with each VM.

Omrigan commented 5 months ago

Status: after the deployment to prod yesterday there is a non-zero number of stuck VM on the dashboard.

For that VMs vm-monitor consistently rejects downscaling request, and those VMs are now considered stuck. The question is: do we want to investigate why autoscaler-agent wants to downscale VMs, but vm-monitor rejects the request, or we just want to consider this situation normal, and remove it from the definition of stuckness?

Thread: https://neondb.slack.com/archives/C03F5SM1N02/p1713893262823009

stradig commented 5 months ago

Implementation finished, some follup in #926

neondatabase / autoscaling

Track more liveness in "autoscaling stuck" #770

Problem description / Motivation

Feature idea(s) / DoD