sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.03k stars 174 forks source link

sensu-go agent does not handle non-zero metric values properly with graphite_plaintext formatter #3214

Open zenmetsu opened 5 years ago

zenmetsu commented 5 years ago

If a check produces metrics containing NaN, and the graphite_plaintext formatter is selected, the check will fail silently and appear to have not executed when examining the dashboard.

Expected Behavior

The agent should pass the NaN value as-is. It appears that this is a default setting within graphite/carbon. See : link

Current Behavior

Currently, if any metric value from the check contains NaN, the check fails to return data back to sensu-backend

Possible Solution

Steps to Reproduce (for bugs)

  1. Using the current sensu-plugins-cassandra, issue a metrics-cassandra-graphite.rb check and set the output formatter to "none"
  2. The resulting values will likely contain one or more NaN values, verify that this is the case via the dashboard.
  3. Set the output formatter to "graphite_plaintext"
  4. Verify that the check stops updating and no further data is received by the sensu-backend. You can do this by either checking for the last result of the check within the dashboard, or issue a sensuctl event info <hostname> <checkname> from the commandline.
  5. Modify the check to replace NaN with "0". I have an experimental release available for the plugin I am testing located here
  6. Verify that the plugin works as expected. sensuctl event info <hostname> <checkname> should show current data after removing the non-numerical values from the check.

Context

In its current state, the sensu-agent will fail silently and no notification will be sent that the check is failing.

Your Environment

jspaleta commented 5 years ago

Okay so this is probably going to impact other line formats. NaN Inf -Inf Infinity and -Infinity have special meaning in a lot of systems so we should probably be handling them in a consistent manner.

Strawman Similar to the graphite internal fix:

  1. silently ingest case insensitive nan as null value (not zero)
  2. silently ingest case insensitve inf and infinity as 1e9999
  3. silently ingest case insensitive -inf and -infinity as -1e9999
nikkictl commented 5 years ago

Hey @zenmetsu I'm so sorry you're experiencing this issue. To help assist you better, would you mind providing the output of the plugin that is being consumed by the check?

jspaleta commented 5 years ago

from the slack conversion:

stats.w51-90.cassandra.system_auth.read_latency NaN 1565988924
stats.w51-90.cassandra.system_auth.write_count 0 1565988924
stats.w51-90.cassandra.system_auth.write_latency NaN 1565988924
stats.w51-90.cassandra.system_auth.pending_flushes 0 1565988924

Ref: https://sensucommunity.slack.com/archives/C68LV5M9U/p1565988965279200?thread_ts=1565888902.269100&cid=C68LV5M9U

zenmetsu commented 5 years ago

stats.hostname.cassandra.system_auth.read_count 0 1565993185 stats.hostname.cassandra.system_auth.read_latency NaN 1565993185 stats.hostname.cassandra.system_auth.write_count 0 1565993185 stats.hostname.cassandra.system_auth.write_latency NaN 1565993185

This is just a snippet. The check could potentially return hundreds of lines of data.

nikkictl commented 5 years ago

Thanks so much for filing this @zenmetsu, it's actually stricken up a very interesting discussion internally with how a case such as this should be handled.

To explain the behavior you are seeing, Failed to obtain reader, Failed to marshal fields to JSON, json: unsupported value: NaN should be logged on the backend. The agent does not report an error because it's actually able to successfully parse NaN, +Inf, and -Inf as valid values. During serialization/deserialization (which is done with protobuf) those values fail to be rendered as JSON, and the event fails to update since there is no valid JSON to store. This error certainly manifested itself in an interesting way!

We discussed the following solutions:

The solution we decided to move forward with maintains data integrity and is backwards compatible:

zenmetsu commented 5 years ago

Any update on the fix? I'm finding that other plugins are exposing this bug.

Thanks!

tovern commented 4 years ago

As a temporary workaround for this you can use sed in your check:

metrics-cassandra-graphite.rb -s cassandra -c | sed 's/NaN/0/'

jspaleta commented 3 years ago

So update on for sensu go 6.2

graphite_plaintext and influxdb_line still appear to be impacted. but prometheus_text appears to handle the issue by just ignoring lines with NaN values entirely and not including them.

So its a bit of an inconsistent behavior now.

Testing sensu-backend metrics

For example on my system the sensu-backend metrics endpoint https://localhost:8080/metrics tends to have NaN values associated with the prometheus histogram metric named sensu_go_websocket_upgrade_duration . Testing sensu-agent ingesting the sensu-backend metrics confirms promethus_text is not impacted but other output formats are.

test: direct injestion with prometheus_text

Using wget with prometheus_text works and the offending lines in the raw metrics/ endpoint output are just silently ignored and not converted into points... and the event is processed as expected by the backend.

test: sensu-prometheus-collector with influx output

The collector converts metrics lines with NaN values into influxdb line protocol faithfully. Agent fails to ingest due to the NaN values in the influxdb_line

test: sensu-prometheus-collector with graphite output

The collector converts metrics lines with NaN values into graphite plaintext faithfully. Agent fails to ingest due to the NaN values in the graphite_plaintext