goss-org / goss

Quick and Easy server testing/validation
https://goss.rocks
Apache License 2.0
5.53k stars 470 forks source link

Prometheus output format #771

Closed mhuysamen closed 1 year ago

mhuysamen commented 2 years ago
Checklist

Description of change

Add a prometheus output format.

The format was copied from DracoBlue.

Additions include: meta are added to labels on the testoutput metric goss_test_count metric added

There is no special negotiation (Accept Header) logic, so goss serve must be started with -f prometheus, or scraper must be configured to send Accept header as application/vnd.goss-prometheus.

Endpoint will respond with text/plain; version=0.0.4 as per Prometheus specification.

aelsabbahy commented 1 year ago

Can you explain at a high level the difference between this PR and #607?

I'm going to cross reference this PR with the open issue to hopefully get some reviews on this.

Apologies for missing this earlier, got it mixed up with #607 and didn't realize it was a new PR/approach.

mhuysamen commented 1 year ago

This PR adds a new output format for goss, to present the results in Prometheus format. This is not prometheus metrics for how well goss is running, but this is meant to be triggered via serve, to trigger a test run and output the test results in detail.

Unlike #607 this PR doesn't add a new endpoint. goss already allows the endpoint to be changed with configuration, which serves the purpose to run as a prometheus collector.

Also, no additional external libraries are added, creating the prometheus output by hand.

aelsabbahy commented 1 year ago

@petemounce would love your review on this since you spent a lot of time thinking about this problem.

I'm open to merging whatever the community agrees on.

aelsabbahy commented 1 year ago

I've tagged this issue in #607 in the hopes that it garners the attention of those watching that issue.

To be honest with you, not being a Prometheus expert myself, I'll have to depend on the community on this one.

Basically, whatever the community agrees on, I'll merge/release. The hope is that what's agreed upon is something that makes most (ideally all) happy and won't change much. Similar to the other output formats.

@petemounce would be ideal.

t2d commented 1 year ago

I would be happy to see this merged. It's basically what I asked for in https://github.com/aelsabbahy/goss/pull/607#issuecomment-1112366423

aelsabbahy commented 1 year ago

Thanks for the feedback @t2d !

Tagging a few of the active participants from the previous chat to get a better gauge on consensus on this PR. Glad to see movement on this again.

@harre-orz @Smithx10 @petemounce @timeu any thoughts?

Many thanks for doing this work and clear explanation of differences @mhuysamen

aelsabbahy commented 1 year ago

Ok, going to take the inverse approach. Last call to provide feedback on this before it goes out.

If I don't get an objection to this PR within a week, I'll merge/release it.

timeu commented 1 year ago

@aelsabbahy : I skimmed through the code and from my point if view it looks good. The change is relatively small and the added labels would help us a lot with the prometheus alerting because it would make it easier to track flapping goss tests which happens sometimes in our environment.

petemounce commented 1 year ago

Hey, long time! I've been dealing with some pretty hefty personal stuff.

I made some time to compare the two approaches in what I hope is an objective way:

row thing 771 607 comments references
1 function different different 771 uses gauges per iteration; 607 uses continuously incrementing counters. 607 means that functions like rate etc work. 771 means alerts would be (I think) harder to maintain since they'd be forced to be threshold-based and also they'd need to have their thresholds move in lock-step with the test-suite. (My assumption is that metrics exist to be alerted against, but that might not be everyone's starting point)
2 endpoint n y It's a prometheus-ism to host a /metrics endpoint - so I elected to do the same. If this is unwanted (it was in 2020), it's 1 line to remove in 607 Relevant discussion
3 "deals with new concept of 'metrics' as a new feature/paradigm" n n Was @aelsabbahy's feedback in 2020. If 771 is mergeable without this, then I think 607 is too? Relevant discussion
4 includes integration tests n y 771 doesn't, but easily could steal them from 607! Relevant diff
5 includes unit tests n y 771 doesn't. Some could be readily added via prometheus.Output asserting the contents of a buffer. (607 arguably needs fewer since it uses the upstream dependency and chooses to trust its own coverage - so only testing at the dependency-seam) 607 relevant diff
6 avoids missing metrics n y (note: 771 is "I think 'no' here; I could be mistaken?) Missing metrics happen when they're not initialised to 0; 771 is reliant on the inputs to be "complete", here. 607 relevant diff
7 upstream dependency n y The upstream dependency on the prometheus client means a) the endpoint content type; b) the output text format; should be "correct" (as in, need no maintenance beyond dependency bumps) This should mean less maintenance required over time due to upstream changes (though tbh prometheus is pretty damn settled at this point w.r.t. contract type things!)
8 process health metrics n y via the upstream dependency, goss serve also includes golang process health metrics relevant discussion

My thoughts:

So I think - if 771 is considered mergeable today, I think 607 should also be - and if both are mergeable, I would prefer 607 to be based on the comparison above.

timeu commented 1 year ago

@petemounce: Thanks for the detailed comparison. @aelsabbahy: Based on the above table and comparison I would also prefer #607 especially because it avoids missing metrics because I have run into issues when previous metrics didn't exist in regards to alerting. Also I think that a counter would make more sense than a guage for the reasons @petemounce mentioned.

timeu commented 1 year ago

@petemounce: One difference I noticed between your PR and this one is that this one additonally adds the Property and ResourceId as a label to the metrics. I remember we discussed this and considered of making it configurable. I guess this will be left for a potential followup ?

petemounce commented 1 year ago

@timeu yes, that would be my intention; more scope later.

aelsabbahy commented 1 year ago

@mhuysamen and @t2d do you agree with the above?

I know @petemounce time is limited, but if #607 is the way to go, it's a minor merge conflict away from being mergeable. I'm willing to drop the whole metrics part of the chat, it can be a future iteration/improvement.

Hopefully we're very close on this

mhuysamen commented 1 year ago

I don't understand the gauge vs counter comments. Counters imply that there is state stored between subsequent checks, which means it's only of use when goss is running in serve mode.

My patch only adds an output format for goss output. It's not meant to serve goss metrics (how many checks were done, total failed/succeeded, etc). This would have to be a specific implementation, exposing those generic metrics on a /metrics endpoint, and only useful in serve mode again, where counters do make sense.

Yes, I don't have tests, but as noted these can be added.

The missing values argument doesn't make sense either, since the output depends on which checks have been configured. We cannot determine which checks will be performed/configured. Dealing with missing metrics is more relevant in summaries, where counters for all the different types of checks are predetermined.

The gauge/counter selection could also be configurable, but adds complexity.

I don't particularly care which implementation is accepted, as long as someone is able to run goss and request a prometheus output. Either via serve, or via validate.

petemounce commented 1 year ago

My use-case is goss running in serve mode. It's used to implement health/readiness check endpoints for machines - the poller hits the endpoint every few seconds and the tests run. If tests fail 5x in a row, the machine is automatically killed and replaced.

607 returns the prometheus output for both serve (where it is available via the content-type negotiation that #609 implemented) and the one-shot validate command.

mhuysamen commented 1 year ago

Then #607 being more complete, should be the one to be merged.

aelsabbahy commented 1 year ago

607 has been merged, so I am closing this for now.

Next goss release will have #607 in it.

Follow up PRs welcome, glad to see this finally get completed =)

Thank you all for your patience and contribution.. this was a pretty large effort and lots of opinions on the matter.

Thank you again!