Open ndonegan opened 10 months ago
Cc: @petemounce
I think the original intention for /healthz
is "hey, do the tests run and pass?", and /metrics
is for the prometheus scraper to come along and pull the cumulatively increasing counters.
The /healthz
endpoint has a prometheus format option, which I would expect to return the same data as the other options. From Prometheus' own documentation:
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
Gauges are typically used for measured values like temperatures or current memory usage
This is a pretty good description of the data which I'd expect to be returned from the /healthz
endpoint. It would make it very easy to write up alerts like sum(goss_tests_outcomes_total{outcome=fail}) > 0
without having to worry about calculating the rate over time.
For context, I use goss serve
as a health endpoint which AWS can hit. AWS tends to be aggressive with checking the health endpoints, which is good. It would also be beneficial for Prometheus to be able to hit the same endpoint so we can get metrics on what type of tests are passing and failing.
The problem with using a counter which increments every time the endpoint is hit is that the AWS healthchecks artificially inflate the number stupidly fast. This makes the metrics less useful even when rate is applied.
Having something in the main /metrics endpoint tracking the total number of test runs would actually be handy! However, the /healthz
endpoint's prometheus output really should just be a point in time with counters.
Thanks for elaborating.
Gauges are typically used for measured values like temperatures or current memory usage
The counter docs:
A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
This is what goss' /metrics
implementation does; count test outcomes ("requests for tests to run") over time.
The intention is that
/healthz
/healthz
cause tests to run/metrics
yield the metrics since the start of goss serve
/metrics
whenever it wants to... write up alerts like
sum(goss_tests_outcomes_total{outcome=fail}) > 0
Is there a semantic difference between "AWS health checks hit /healthz and tests run" and "prometheus scraper hits /healthz and tests run" for you?
I think the gauges approach suffers if the particular tests intermittently fail, compared to the counters approach which does not.
With gauges, an alert as you describe will only catch an intermittently-failing test by chance if a scrape happens to coincide with a failure. With counters, the alert looks at the complete history of the outcomes and is not subject to that "leak".
With gauges, one can increase the scrape rate to lower the chance of missing an intermittent failure (but that would only be prompted by luck), at the consequence of increased load and storage on prometheus because it scrapes more often. This is unnecessary for the counters approach.
WDYT?
... without having to worry about calculating the rate over time.
Is that rate over time different from (roughly) sum(rate(goss_tests_outcomes_total{outcome=fail}[60m])) == 0
? (As in "the sum of the rate of failed tests in the last 60m should be 0", and an alert would probably apply a for: 2m
or similar to allow for flickers)
Aside: I appreciate that the lack of documentation isn't helpful here. I intend to address that after #856 merges.
Let's bring this back to basics.
The problem arises with the expectation of what the Prometheus output should be for the /healthz
endpoint. I fully agree that the /metrics
endpoint should have a good overall view with metrics since the daemon was started.
However, every other format available for the healthz endpoint reflects the checks run when the endpoint was queries, and JUST that particular run. Using the example goss.yaml from above:
documentation:
Total Duration: 0.000s
Count: 2, Failed: 1, Skipped: 0
json:
"summary-line": "Count: 2, Failed: 1, Skipped: 0, Duration: 0.000s",
junit:
<testsuite name="goss" errors="0" tests="2" failures="1" skipped="0" time="0.000" timestamp="2024-01-25T21:49:38Z">
nagios:
GOSS CRITICAL - Count: 2, Failed: 1, Skipped: 0, Duration: 0.000s
rspecish:
Count: 2, Failed: 1, Skipped: 0
structured:
"summary-line": "Count: 2, Failed: 1, Duration: 0.000s
The ONLY output format for /healthz
which doesn't represent the tests run for that particular call is prometheus, and this can be easily remedied by converting it to the gauge type.
Any conclusion on this? I'm not chiming in so far since I've mostly stayed out of the Prometheus implementation. Haven't had time to go down the Prometheus rabbit hole or best practices. So, I'm mostly depending on community consensus and contributions time to maintain it.
@aelsabbahy I was awaiting merge of #856 so I could more clearly document the intended use-cases that are supported. I see that's merged now, so I'll plan to write some content in June once back from traveling.
Thank you for the update, have a great trip!
Describe the bug Looking at using the Prometheus output for health checks, and it looks like the incorrect data types are being used for some of the metrics. Rather then using a gauge data type to give the results from this run, it's instead a counter with the cumulative results from multiple runs.
For a /heathz endpoint, it makes more sense to output the data as a guage with just the results for the requested run.
How To Reproduce Using any example set of goss tests, run
goss s -f prometheus
. Then runcurl http://localhost:8080/healthz
. For testing I've been using a simple goss file which will return a pass and fail.Expected Behavior When using the prometheus format healthz endpoint, all the HELP lines indicate that the results returned will be for this run. In practice, they're all counters rather than guages which means they increment between runs.
Actual Behavior With the simple goss.yaml above just doing a two file checks:
First run:
Second Run:
Note that
goss_tests_outcomes_total
andgoss_tests_run_outcomes_total
are counting across runs rather than just returning a gauge for the results from this run.Environment: