Closed MitchellGerdisch closed 1 year ago
gaugevec
is not a kind of metric -- instead, there are a set of stacks_failing
gauges, differentiated by name and namespace labels. This should just be changed in the docs to say "gauge".I can see one pretty obvious problem with how stacks_failing is recorded. The scheme is this:
namespace
and name
, meaning each gauge in the set is particular to an individual stack;This works at all because the labels identify an individual stack, so it doesn't have to keep track of a count, just to say whether the single stack in question qualifies as a failed stack or not. A query of the stacks_failing
metric will usually aggregate over the set of time series to get a count. This is a bit of an abuse of labels, because you're generally not supposed to name individual things (since then it can generate arbitrary numbers of bitty time series). But it does what you expect, at least.
Except: the stacks_failing
gauge isn't set to 0
when the stack is deleted. So, if you delete the stack while it's failing, there will be a stacks_failing{namespace=..., name=...} 1
left behind:
$ kubectl get stacks
No resources found in default namespace.
$ curl http://localhost:8383/metrics | grep stacks_failing
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="podinfo-autoapi",namespace="default"} 1
the
stacks_failing
gauge isn't set to0
when the stack is deleted.
I've had word that fixing this has removed false positives for a production user. So, on the basis that the documentation is corrected, and the reported problem with it is fixed, I'm going to close this.
What happened?
If one sets up port-forwarding from the pulumi operator pod on 8383/metrics one sees something like:
The
# TYPE stacks_failing gauge
line implies it's agauge
While the documentation here: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/docs/metrics.md#metrics-overview indicates it's a
gaugevec
metric.Steps to reproduce
The code here should be able to be used to set up an environment to test what is emitted by the operator metrics: https://github.com/MitchellGerdisch/pulumi-work/tree/master/pulumi-operator
Expected Behavior
The docs and the output from metrics should be in sync.
Actual Behavior
One says stacks_failing is a
gauge
and one says it'sgaugevec
Output of
pulumi about
No response
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).