Closed c-kruse closed 2 years ago
Ran a local test and the timestamps in the promout.txt produce by the dump-to-prom does not match the timestamps of the original output, nor does it match the timestamps in the event json captured by a jq pipe handler.
sysinfo.txt
holds the original check command output and all prom data points have timestamp 1640302638451
event.json
holds the sensu event with ingested metrics all sensu metrics points have timestamp 1640302638451
promout.txt
holds the output of dump-to-prom and all prom data points have timestamp 1640301929000
timestamp 1640301929000
does not appear anywhere in event.json
or sysinfo.txt
Multiple runs of the check and the timestamp in sysinfo.txt and event.json update as expected. promout.txt timestamps are always 1640301929000
Here are the relevant sensu resources using local binaries for commands I used
type: CheckConfig
api_version: core/v2
metadata:
created_by: admin
name: sysinfo
namespace: default
spec:
check_hooks: null
command: /tmp/system-check | tee /tmp/sysinfo.txt
env_vars: null
handlers:
- diagnostic
- prom-dump
high_flap_threshold: 0
interval: 60
low_flap_threshold: 0
output_metric_format: prometheus_text
output_metric_handlers: null
pipelines: []
proxy_entity_name: ""
publish: false
round_robin: false
runtime_assets: null
secrets: null
stdin: false
subdue: null
subscriptions:
- test
timeout: 0
ttl: 0
---
type: Handler
api_version: core/v2
metadata:
created_by: admin
labels:
sensu.io/managed_by: sensuctl
name: diagnostic
namespace: default
spec:
command: jq . > /tmp/event.json
timeout: 0
type: pipe
---
type: Handler
api_version: core/v2
metadata:
created_by: admin
name: prom-dump
namespace: default
spec:
command: /tmp/dump-to-prom
timeout: 0
type: pipe
Here's the problem... you can't assume the timestamp in the Sensu event is seconds since epoch. The precision of the Sensu event timestamp is undefined. You have to use a heuristc and figure out the precision.
Here's whats happening... sensu agent is ingesting prom metrics and storing the millisecond timestamp verbatim into the sensu event metrics points. But the new sdk functions assume sensu event metric point timestamps are second precision... this is the wrong assumption.
I worked around this in the sumologic pipe handler by using a heuristic to guess the time precision based on the size of the integer.
@jspaleta did a more thorough read-through this AM and I could use your help tracking down how you are getting sub-second precision in the event metric point timestamp field.
sensu agent is ingesting prom metrics and storing the millisecond timestamp verbatim into the sensu event metrics points.
Maybe we're looking at different codebases, but this sounds untrue. The agent prometheus transformer always truncates times to the second: https://github.com/sensu/sensu-go/blob/f9849e45c7936dcb1254f9450e8e6492a5a2c4c5/agent/transformers/prometheus.go#L47 and related promethus model go docs
A quick browse of the other agent metric format transformers didn't lead me to an obvious check output format I could use to test end to end.
@c-kruse If agent was intended to provide 1 second precision in all cases that's clearly not working, but that's immaterial. My test above produced sub second precision in the event metrics using the agent. The event.json I captured from my jq handler shows subsecond precision.
And its absolutely possible to produce events outside of agent using the api with subsecond precision. the handler functions shouldnt truncate it should provide prom millisecond precision using whatever timestamp the event has.
@jspaleta - I cannot reproduce what you ran into with the latest versions of sensu and the sensu/system-check check so it is difficult for me to re-test with the heuristic code end to end.
Circumventing the agent and posting directly to the sensu backend with a 13 digit (ms precision) timestamp it is looking okay to me now. Should be ready for you to re-test regardless of the check/agent behavior you are seeing that I cannot reproduce.
@c-kruse okay I just did a local test. latest agent is producing 1-second timestamp metrics for me as well.
But to complete the loop I did a local test with a hand rolled event with metric timestamps with varying number of digits 16 digit is interpreted as microsecond and chopped to 13 digits 1641349650354123 -> 1641349650354 13 digital is interpreted as millisecond and left alone 1641349650354 -> 1641349650354
Things get weird for integers with 14 and 15 digits... basically left alone with no chopping. Now this is a heuristic so I have no idea what the best action here really is. but it seems to me that 14 and 15 digit long integers should be treated as microsecond like 16 digit and chopped accordingly. I dont think its possible to interpret 14 and 15 any differently
For example: 164134965035412 I think should be chopped to 1641349650354 instead of outputting in the prom output as 164134965035412 (this would be interpreted as being several hundred years in the future compared to chopping it to 13 digits.
I know its a bit of a corner, that isn't going to come up much. But other than that. looks like its working.
@jspaleta went though and added a test to hopefully clear up some of the fuzziness and incidentally clear up some integer overflow and floating point gotchas. I bet your 14/15 digit timestamps are a little more well behaved this go around. As is, this handles nano micro milli and seconds between April 1970 and the year 2260. I don't feel there's much room for improvement messing with these bounds - This person's timestamp tool seems to follow the same heuristics: https://www.unixtimestamp.com/index.php
In your example where a 15 digit timestamp is provided 164134965035412
, it now gets interpreted as a Micro Second and interpreted as "some time in 1975" - 164134965035
milliseconds since epoch. We could go about interpreting this as you had suggested - as 1/100,000 of a second (a christian-sec) instead of 1/1,000,000th (a micro second.) The main downside for me is I don't feel like anything is likely to expect or use this behavior. It would also cut our range of 1970-2200 down to 2001-2200 for the microsecond range - which sounds ay-okay to me.
Yeah because its a heuristic it will always have weird corners. The best we can do is do the least unexpected thing in the corners i guess. Let me do one more local test and I should be able to thumbs up.
What
First half of https://github.com/sensu/sensu-plugin-sdk/issues/65.
Adds a helper function for writing Sensu Metrics to the prometheus exposition format.
Testing
Was able to reproduce nearly identical
*
prometheus output text end to end by using a prom2json tool to jsonize the text so that it could be sorted consistently.*
sensu stores metric timestamps in seconds and prometheus uses milliseconds, so the result will have 3 decimals of precision dropped. Setup:dump-to-prom
pipe handler that uses this PR to write event metrics as prometheus text to a file (/tmp/promout.txt) https://github.com/c-kruse/dump-to-prom/blob/main/main.gosysinfo
check that runs the sensu/system-check check, has prometheus_text as the metric format, and a is pointed at thedump-to-prom
handler.