pulumi / pulumi-kubernetes-operator

A Kubernetes Operator that automates the deployment of Pulumi Stacks
Apache License 2.0
221 stars 55 forks source link

stacks_failing metric may not be documented or implemented correctly #399

Closed MitchellGerdisch closed 1 year ago

MitchellGerdisch commented 1 year ago

What happened?

If one sets up port-forwarding from the pulumi operator pod on 8383/metrics one sees something like:

# HELP stacks_active Number of stacks currently tracked by the Pulumi Kubernetes Operator
# TYPE stacks_active gauge
stacks_active 1
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="core-vault",namespace="pulumi-operator"} 0

The # TYPE stacks_failing gauge line implies it's a gauge

While the documentation here: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/docs/metrics.md#metrics-overview indicates it's a gaugevec metric.

Steps to reproduce

The code here should be able to be used to set up an environment to test what is emitted by the operator metrics: https://github.com/MitchellGerdisch/pulumi-work/tree/master/pulumi-operator

Expected Behavior

The docs and the output from metrics should be in sync.

Actual Behavior

One says stacks_failing is a gauge and one says it's gaugevec

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

squaremo commented 1 year ago
squaremo commented 1 year ago

402 fixes the type of the stacks_failing metric in the documentation, which should prevent some confusion.

I can see one pretty obvious problem with how stacks_failing is recorded. The scheme is this:

This works at all because the labels identify an individual stack, so it doesn't have to keep track of a count, just to say whether the single stack in question qualifies as a failed stack or not. A query of the stacks_failing metric will usually aggregate over the set of time series to get a count. This is a bit of an abuse of labels, because you're generally not supposed to name individual things (since then it can generate arbitrary numbers of bitty time series). But it does what you expect, at least.

Except: the stacks_failing gauge isn't set to 0 when the stack is deleted. So, if you delete the stack while it's failing, there will be a stacks_failing{namespace=..., name=...} 1 left behind:

$ kubectl get stacks
No resources found in default namespace.

$ curl http://localhost:8383/metrics | grep stacks_failing
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="podinfo-autoapi",namespace="default"} 1
squaremo commented 1 year ago

the stacks_failing gauge isn't set to 0 when the stack is deleted.

I've had word that fixing this has removed false positives for a production user. So, on the basis that the documentation is corrected, and the reported problem with it is fixed, I'm going to close this.