sensu / sensu-plugin-sdk

A framework for creating Sensu plugins
MIT License
7 stars 8 forks source link

Feature/add prometheus exposition #67

Closed c-kruse closed 2 years ago

c-kruse commented 2 years ago

What

First half of https://github.com/sensu/sensu-plugin-sdk/issues/65.

Adds a helper function for writing Sensu Metrics to the prometheus exposition format.

Testing

Was able to reproduce nearly identical* prometheus output text end to end by using a prom2json tool to jsonize the text so that it could be sorted consistently.

*sensu stores metric timestamps in seconds and prometheus uses milliseconds, so the result will have 3 decimals of precision dropped. Setup:

---
type: CheckConfig
api_version: core/v2
metadata:
  namespace: default
  name: sysinfo
spec:
  command: system-check
  handlers:
    - dump-to-prom
  interval: 30
  output_metric_format: "prometheus_text"
  publish: true
  runtime_assets:
  - sensu/system-check
  subscriptions:
  - coconut
---
type: Handler
api_version: core/v2
metadata:
  namespace: default
  name: dump-to-prom
spec:
  command: dump-to-prom
  type: pipe
  1. Wait for the check to run, produce an event and get handled by the dump-to-prom handler. Kill the agent.
  2. Pipe the dump-to-prom output through the prom2json tool https://github.com/prometheus/prom2json and then though jq sorting by metric family name.
  3. Get the event check output from sensuctl, pipe it through the prom2json tool, then through jq sorting the same way.
  4. Compare the two near identical outputs.
    » cat /tmp/promout.txt| prom2json | jq -r '. | sort_by(.name)' > result.json
    » s event info  << Your agent entity name >>   sysinfo --format json \
    | jq -r '.check.output' \
    | prom2json | jq -r '. | sort_by(.name)' > input.json
jspaleta commented 2 years ago

Ran a local test and the timestamps in the promout.txt produce by the dump-to-prom does not match the timestamps of the original output, nor does it match the timestamps in the event json captured by a jq pipe handler.

sysinfo.txt holds the original check command output and all prom data points have timestamp 1640302638451 event.json holds the sensu event with ingested metrics all sensu metrics points have timestamp 1640302638451 promout.txt holds the output of dump-to-prom and all prom data points have timestamp 1640301929000

timestamp 1640301929000 does not appear anywhere in event.json or sysinfo.txt

Multiple runs of the check and the timestamp in sysinfo.txt and event.json update as expected. promout.txt timestamps are always 1640301929000

Here are the relevant sensu resources using local binaries for commands I used

type: CheckConfig
api_version: core/v2
metadata:
  created_by: admin
  name: sysinfo
  namespace: default
spec:
  check_hooks: null
  command: /tmp/system-check | tee /tmp/sysinfo.txt
  env_vars: null
  handlers:
  - diagnostic
  - prom-dump
  high_flap_threshold: 0
  interval: 60
  low_flap_threshold: 0
  output_metric_format: prometheus_text
  output_metric_handlers: null
  pipelines: []
  proxy_entity_name: ""
  publish: false
  round_robin: false
  runtime_assets: null
  secrets: null
  stdin: false
  subdue: null
  subscriptions:
  - test
  timeout: 0
  ttl: 0
---
type: Handler
api_version: core/v2
metadata:
  created_by: admin
  labels:
    sensu.io/managed_by: sensuctl
  name: diagnostic
  namespace: default
spec:
  command: jq . > /tmp/event.json
  timeout: 0
  type: pipe
---
type: Handler
api_version: core/v2
metadata:
  created_by: admin
  name: prom-dump
  namespace: default
spec:
  command: /tmp/dump-to-prom
  timeout: 0
  type: pipe
jspaleta commented 2 years ago

Here's the problem... you can't assume the timestamp in the Sensu event is seconds since epoch. The precision of the Sensu event timestamp is undefined. You have to use a heuristc and figure out the precision.

Here's whats happening... sensu agent is ingesting prom metrics and storing the millisecond timestamp verbatim into the sensu event metrics points. But the new sdk functions assume sensu event metric point timestamps are second precision... this is the wrong assumption.

I worked around this in the sumologic pipe handler by using a heuristic to guess the time precision based on the size of the integer.

c-kruse commented 2 years ago

@jspaleta did a more thorough read-through this AM and I could use your help tracking down how you are getting sub-second precision in the event metric point timestamp field.

sensu agent is ingesting prom metrics and storing the millisecond timestamp verbatim into the sensu event metrics points.

Maybe we're looking at different codebases, but this sounds untrue. The agent prometheus transformer always truncates times to the second: https://github.com/sensu/sensu-go/blob/f9849e45c7936dcb1254f9450e8e6492a5a2c4c5/agent/transformers/prometheus.go#L47 and related promethus model go docs

A quick browse of the other agent metric format transformers didn't lead me to an obvious check output format I could use to test end to end.

jspaleta commented 2 years ago

@c-kruse If agent was intended to provide 1 second precision in all cases that's clearly not working, but that's immaterial. My test above produced sub second precision in the event metrics using the agent. The event.json I captured from my jq handler shows subsecond precision.

And its absolutely possible to produce events outside of agent using the api with subsecond precision. the handler functions shouldnt truncate it should provide prom millisecond precision using whatever timestamp the event has.

c-kruse commented 2 years ago

@jspaleta - I cannot reproduce what you ran into with the latest versions of sensu and the sensu/system-check check so it is difficult for me to re-test with the heuristic code end to end.

Circumventing the agent and posting directly to the sensu backend with a 13 digit (ms precision) timestamp it is looking okay to me now. Should be ready for you to re-test regardless of the check/agent behavior you are seeing that I cannot reproduce.

jspaleta commented 2 years ago

@c-kruse okay I just did a local test. latest agent is producing 1-second timestamp metrics for me as well.

But to complete the loop I did a local test with a hand rolled event with metric timestamps with varying number of digits 16 digit is interpreted as microsecond and chopped to 13 digits 1641349650354123 -> 1641349650354 13 digital is interpreted as millisecond and left alone 1641349650354 -> 1641349650354

Things get weird for integers with 14 and 15 digits... basically left alone with no chopping. Now this is a heuristic so I have no idea what the best action here really is. but it seems to me that 14 and 15 digit long integers should be treated as microsecond like 16 digit and chopped accordingly. I dont think its possible to interpret 14 and 15 any differently

For example: 164134965035412 I think should be chopped to 1641349650354 instead of outputting in the prom output as 164134965035412 (this would be interpreted as being several hundred years in the future compared to chopping it to 13 digits.

I know its a bit of a corner, that isn't going to come up much. But other than that. looks like its working.

c-kruse commented 2 years ago

@jspaleta went though and added a test to hopefully clear up some of the fuzziness and incidentally clear up some integer overflow and floating point gotchas. I bet your 14/15 digit timestamps are a little more well behaved this go around. As is, this handles nano micro milli and seconds between April 1970 and the year 2260. I don't feel there's much room for improvement messing with these bounds - This person's timestamp tool seems to follow the same heuristics: https://www.unixtimestamp.com/index.php

In your example where a 15 digit timestamp is provided 164134965035412, it now gets interpreted as a Micro Second and interpreted as "some time in 1975" - 164134965035 milliseconds since epoch. We could go about interpreting this as you had suggested - as 1/100,000 of a second (a christian-sec) instead of 1/1,000,000th (a micro second.) The main downside for me is I don't feel like anything is likely to expect or use this behavior. It would also cut our range of 1970-2200 down to 2001-2200 for the microsecond range - which sounds ay-okay to me.

jspaleta commented 2 years ago

Yeah because its a heuristic it will always have weird corners. The best we can do is do the least unexpected thing in the corners i guess. Let me do one more local test and I should be able to thumbs up.