Add more labels to prometheus output format

sebhoss commented 6 months ago

Describe the feature:

The prometheus output format should include references to test cases in order to allow more fine grained alerting. The current implementation only contains a summary of test cases for each resource type and a summary across all tests. Since test cases can have different criticality levels, neither of those summaries are sufficient to decide whether a person should be called right now (potentially in the middle of the night) or not.

This has been previously discussed in #607 and my understanding is that it was postponed, but not rejected. There is a valid concern for storage cost, so I don't think this should/must be enabled by default, but rather should be an format option for the prometheus output.

Describe the solution you'd like

The metric goss_tests_outcomes_total should contain additional labels to uniquely identify a single test case or a different metric should be introduced that does exactly that.

Describe alternatives you've considered

My current workaround is an annotation I placed on the prometheus alert for goss tests that links to a runbook which tells the poor soul who has to deal with this to do:

$ ssh ...
$ goss --gossfile ... validate

This returns the failing tests and they can decide whether to fix it right now or go back to bed.

aelsabbahy commented 6 months ago

@petemounce would love your thoughts on this?

petemounce commented 5 months ago

A format option makes sense to me if this is done at all. Should be off by default.

However, there's a potential workaround to needing this depending on the environment's observability capabilities.

@sebhoss the way we handled dealing with goss failure was twofold:

goss hosted a health check. If too many failures happened, the machine was terminated and replaced by auto scaling. No human was woken up; we filed a ticket-priority alert if this happened "too often" (as in; "important but not paging")
we logged the goss output centrally (with an output that was quiet on success, verbose on failure). The ticket contained the log query to ascertain the failing test(s). We had a way to make the health check pull the machine from service vs terminate it. It was then daytime support work to debug whatever was worth debugging to lower the machine-replacement rate.

sebhoss commented 5 months ago

Thanks for the feedback!

I'm totally fine with this feature being disabled by default. That said, I think even your setup @petemounce would benefit from it: Instead of terminating machines by counting all failures, you could count pager-priority failures only since there can be failures that do not really impact service availability and terminating machines because of those might not be what you want.

Since this feature is kinda blocking us from adopting goss for more machines/tests in our org, how about I open up a PR and we discuss technical details over there?

petemounce commented 5 months ago

Sure, PR welcome. Please include test coverage for the new code path?

(One of our guiding principles is that a machine failure should never wake someone up; the system should be resilient to that, and sand is cheaper than carbon to run. We replace them "quickly enough" that there's no overall impact. We page if service is impacted, which can happen for a number of reasons, machine taint and replacement being one).

goss-org / goss

Add more labels to prometheus output format #862