goss-org / goss

Quick and Easy server testing/validation
https://goss.rocks
Apache License 2.0
5.5k stars 470 forks source link

Prometheus Output Format Oddity #867

Open ndonegan opened 5 months ago

ndonegan commented 5 months ago

Describe the bug Looking at using the Prometheus output for health checks, and it looks like the incorrect data types are being used for some of the metrics. Rather then using a gauge data type to give the results from this run, it's instead a counter with the cumulative results from multiple runs.

For a /heathz endpoint, it makes more sense to output the data as a guage with just the results for the requested run.

How To Reproduce Using any example set of goss tests, run goss s -f prometheus. Then run curl http://localhost:8080/healthz. For testing I've been using a simple goss file which will return a pass and fail.

file:
  docs:
    title: Check docs exists
    exists: true
  nodocs:
    title: Check nodocs exists
    exists: true

Expected Behavior When using the prometheus format healthz endpoint, all the HELP lines indicate that the results returned will be for this run. In practice, they're all counters rather than guages which means they increment between runs.

Actual Behavior With the simple goss.yaml above just doing a two file checks:

First run:

# HELP goss_tests_outcomes_duration_milliseconds The duration of tests from this run. Note; tests run concurrently.
# TYPE goss_tests_outcomes_duration_milliseconds counter
goss_tests_outcomes_duration_milliseconds{outcome="fail",type="file"} 0
goss_tests_outcomes_duration_milliseconds{outcome="pass",type="file"} 0
# HELP goss_tests_outcomes_total The number of test-outcomes from this run.
# TYPE goss_tests_outcomes_total counter
goss_tests_outcomes_total{outcome="fail",type="file"} 1
goss_tests_outcomes_total{outcome="pass",type="file"} 1
# HELP goss_tests_run_duration_milliseconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_milliseconds counter
goss_tests_run_duration_milliseconds{outcome="pass"} 0
# HELP goss_tests_run_outcomes_total The outcomes of this run as a whole.
# TYPE goss_tests_run_outcomes_total counter
goss_tests_run_outcomes_total{outcome="pass"} 1

Second Run:

# HELP goss_tests_outcomes_duration_milliseconds The duration of tests from this run. Note; tests run concurrently.
# TYPE goss_tests_outcomes_duration_milliseconds counter
goss_tests_outcomes_duration_milliseconds{outcome="fail",type="file"} 0
goss_tests_outcomes_duration_milliseconds{outcome="pass",type="file"} 0
# HELP goss_tests_outcomes_total The number of test-outcomes from this run.
# TYPE goss_tests_outcomes_total counter
goss_tests_outcomes_total{outcome="fail",type="file"} 2
goss_tests_outcomes_total{outcome="pass",type="file"} 2
# HELP goss_tests_run_duration_milliseconds The end-to-end duration of this run.
# TYPE goss_tests_run_duration_milliseconds counter
goss_tests_run_duration_milliseconds{outcome="pass"} 0
# HELP goss_tests_run_outcomes_total The outcomes of this run as a whole.
# TYPE goss_tests_run_outcomes_total counter
goss_tests_run_outcomes_total{outcome="pass"} 2

Note that goss_tests_outcomes_total and goss_tests_run_outcomes_total are counting across runs rather than just returning a gauge for the results from this run.

Environment:

aelsabbahy commented 5 months ago

Cc: @petemounce

petemounce commented 5 months ago

I think the original intention for /healthz is "hey, do the tests run and pass?", and /metrics is for the prometheus scraper to come along and pull the cumulatively increasing counters.

ndonegan commented 5 months ago

The /healthz endpoint has a prometheus format option, which I would expect to return the same data as the other options. From Prometheus' own documentation:

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Gauges are typically used for measured values like temperatures or current memory usage

This is a pretty good description of the data which I'd expect to be returned from the /healthz endpoint. It would make it very easy to write up alerts like sum(goss_tests_outcomes_total{outcome=fail}) > 0 without having to worry about calculating the rate over time.

For context, I use goss serve as a health endpoint which AWS can hit. AWS tends to be aggressive with checking the health endpoints, which is good. It would also be beneficial for Prometheus to be able to hit the same endpoint so we can get metrics on what type of tests are passing and failing.

The problem with using a counter which increments every time the endpoint is hit is that the AWS healthchecks artificially inflate the number stupidly fast. This makes the metrics less useful even when rate is applied.

Having something in the main /metrics endpoint tracking the total number of test runs would actually be handy! However, the /healthz endpoint's prometheus output really should just be a point in time with counters.

petemounce commented 5 months ago

Thanks for elaborating.

Gauges are typically used for measured values like temperatures or current memory usage

The counter docs:

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

This is what goss' /metrics implementation does; count test outcomes ("requests for tests to run") over time.

The intention is that

... write up alerts like sum(goss_tests_outcomes_total{outcome=fail}) > 0

Is there a semantic difference between "AWS health checks hit /healthz and tests run" and "prometheus scraper hits /healthz and tests run" for you?

I think the gauges approach suffers if the particular tests intermittently fail, compared to the counters approach which does not.

With gauges, an alert as you describe will only catch an intermittently-failing test by chance if a scrape happens to coincide with a failure. With counters, the alert looks at the complete history of the outcomes and is not subject to that "leak".

With gauges, one can increase the scrape rate to lower the chance of missing an intermittent failure (but that would only be prompted by luck), at the consequence of increased load and storage on prometheus because it scrapes more often. This is unnecessary for the counters approach.

WDYT?

... without having to worry about calculating the rate over time.

Is that rate over time different from (roughly) sum(rate(goss_tests_outcomes_total{outcome=fail}[60m])) == 0? (As in "the sum of the rate of failed tests in the last 60m should be 0", and an alert would probably apply a for: 2m or similar to allow for flickers)

Aside: I appreciate that the lack of documentation isn't helpful here. I intend to address that after #856 merges.

ndonegan commented 5 months ago

Let's bring this back to basics.

The problem arises with the expectation of what the Prometheus output should be for the /healthz endpoint. I fully agree that the /metrics endpoint should have a good overall view with metrics since the daemon was started.

However, every other format available for the healthz endpoint reflects the checks run when the endpoint was queries, and JUST that particular run. Using the example goss.yaml from above:

documentation:

Total Duration: 0.000s
Count: 2, Failed: 1, Skipped: 0

json:

"summary-line": "Count: 2, Failed: 1, Skipped: 0, Duration: 0.000s",

junit:

<testsuite name="goss" errors="0" tests="2" failures="1" skipped="0" time="0.000" timestamp="2024-01-25T21:49:38Z">

nagios:

GOSS CRITICAL - Count: 2, Failed: 1, Skipped: 0, Duration: 0.000s

rspecish:

Count: 2, Failed: 1, Skipped: 0

structured:

"summary-line": "Count: 2, Failed: 1, Duration: 0.000s

The ONLY output format for /healthz which doesn't represent the tests run for that particular call is prometheus, and this can be easily remedied by converting it to the gauge type.

aelsabbahy commented 1 month ago

Any conclusion on this? I'm not chiming in so far since I've mostly stayed out of the Prometheus implementation. Haven't had time to go down the Prometheus rabbit hole or best practices. So, I'm mostly depending on community consensus and contributions time to maintain it.

petemounce commented 1 month ago

@aelsabbahy I was awaiting merge of #856 so I could more clearly document the intended use-cases that are supported. I see that's merged now, so I'll plan to write some content in June once back from traveling.

aelsabbahy commented 1 month ago

Thank you for the update, have a great trip!