Currently we use a single gauge for the "autoscaling stuck" metric exposed by the autoscaler-agent.
This is nice because it's simple.
However, a downside of this is that we only know how many VMs were stuck at a particular moment in time — we don't know, for example, how many VMs became stuck between two points in time.
Knowing the number of distinct VMs that became stuck betweeen two points in time would require either (a) looking at the logs, or (b) high-cardinality metrics. But if we're ok having duplicate entries for the same VM, we can just look at the total number of times any VM became stuck, which can be represented by gauge.
Feature idea(s) / DoD
DoD is that we need some way to get the number of VMs that became stuck between two timestamps, rather than just the number that are currently stuck — without this, we're significantly under-counting the rate/quantity of stuck VMs.
Implementation ideas
Continuing from the motivation, if we have a gauge for the number of times any VM became un-stuck, we can subtract it from the number of times VMs have become stuck to get the current number of stuck VMs, replacing / augmenting the current metric.
Alternatively, because stuckness is currently represented by the autoscaling_agent_runners_current{state=...} metric, we could introduce a "runner state transitions" metric, where new_state="stuck" means the VM has become stuck, and old_state="stuck" means the VM is unstuck.
This would similarly allow us to unify the handling for panicked/errored runners (instead of having separate autoscaling_agent_runner_fatal_errors_total / autoscaling_agent_runner_thread_panics_total)
Problem description / Motivation
Currently we use a single gauge for the "autoscaling stuck" metric exposed by the autoscaler-agent.
This is nice because it's simple.
However, a downside of this is that we only know how many VMs were stuck at a particular moment in time — we don't know, for example, how many VMs became stuck between two points in time.
Knowing the number of distinct VMs that became stuck betweeen two points in time would require either (a) looking at the logs, or (b) high-cardinality metrics. But if we're ok having duplicate entries for the same VM, we can just look at the total number of times any VM became stuck, which can be represented by gauge.
Feature idea(s) / DoD
DoD is that we need some way to get the number of VMs that became stuck between two timestamps, rather than just the number that are currently stuck — without this, we're significantly under-counting the rate/quantity of stuck VMs.
Implementation ideas
Continuing from the motivation, if we have a gauge for the number of times any VM became un-stuck, we can subtract it from the number of times VMs have become stuck to get the current number of stuck VMs, replacing / augmenting the current metric.
Alternatively, because stuckness is currently represented by the
autoscaling_agent_runners_current{state=...}
metric, we could introduce a "runner state transitions" metric, wherenew_state="stuck"
means the VM has become stuck, andold_state="stuck"
means the VM is unstuck.This would similarly allow us to unify the handling for panicked/errored runners (instead of having separate
autoscaling_agent_runner_fatal_errors_total
/autoscaling_agent_runner_thread_panics_total
)Related issues
770