Open zenmetsu opened 5 years ago
Okay so this is probably going to impact other line formats. NaN Inf -Inf Infinity and -Infinity have special meaning in a lot of systems so we should probably be handling them in a consistent manner.
Strawman Similar to the graphite internal fix:
Hey @zenmetsu I'm so sorry you're experiencing this issue. To help assist you better, would you mind providing the output of the plugin that is being consumed by the check?
from the slack conversion:
stats.w51-90.cassandra.system_auth.read_latency NaN 1565988924
stats.w51-90.cassandra.system_auth.write_count 0 1565988924
stats.w51-90.cassandra.system_auth.write_latency NaN 1565988924
stats.w51-90.cassandra.system_auth.pending_flushes 0 1565988924
stats.hostname.cassandra.system_auth.read_count 0 1565993185 stats.hostname.cassandra.system_auth.read_latency NaN 1565993185 stats.hostname.cassandra.system_auth.write_count 0 1565993185 stats.hostname.cassandra.system_auth.write_latency NaN 1565993185
This is just a snippet. The check could potentially return hundreds of lines of data.
Thanks so much for filing this @zenmetsu, it's actually stricken up a very interesting discussion internally with how a case such as this should be handled.
To explain the behavior you are seeing, Failed to obtain reader, Failed to marshal fields to JSON, json: unsupported value: NaN
should be logged on the backend. The agent does not report an error because it's actually able to successfully parse NaN
, +Inf
, and -Inf
as valid values. During serialization/deserialization (which is done with protobuf) those values fail to be rendered as JSON, and the event fails to update since there is no valid JSON to store. This error certainly manifested itself in an interesting way!
We discussed the following solutions:
NaN
/Inf
and drop them (users lose data).NaN
/Inf
and replace with 0 (data is mutated).float64
to json.Number (backwards incompatible).The solution we decided to move forward with maintains data integrity and is backwards compatible:
string
field named Error
to the MetricPoint struct.NaN
/Inf
and replace with 0, storing a specific error in the Error
field.NaN
/Inf
or error out. This can configured with command line flags and handled on a case by case basis for each destination.Any update on the fix? I'm finding that other plugins are exposing this bug.
Thanks!
As a temporary workaround for this you can use sed in your check:
metrics-cassandra-graphite.rb -s cassandra -c | sed 's/NaN/0/'
So update on for sensu go 6.2
graphite_plaintext
and influxdb_line
still appear to be impacted. but prometheus_text
appears to handle the issue by just ignoring lines with NaN values entirely and not including them.
So its a bit of an inconsistent behavior now.
For example on my system the sensu-backend metrics endpoint https://localhost:8080/metrics
tends to have NaN values associated with the prometheus histogram metric named sensu_go_websocket_upgrade_duration
.
Testing sensu-agent ingesting the sensu-backend metrics confirms promethus_text
is not impacted but other output formats are.
Using wget
with prometheus_text
works and the offending lines in the raw metrics/ endpoint output are just silently ignored and not converted into points... and the event is processed as expected by the backend.
The collector converts metrics lines with NaN values into influxdb line protocol faithfully. Agent fails to ingest due to the NaN values in the influxdb_line
The collector converts metrics lines with NaN values into graphite plaintext faithfully. Agent fails to ingest due to the NaN values in the graphite_plaintext
If a check produces metrics containing
NaN
, and the graphite_plaintext formatter is selected, the check will fail silently and appear to have not executed when examining the dashboard.Expected Behavior
The agent should pass the
NaN
value as-is. It appears that this is a default setting within graphite/carbon. See : linkCurrent Behavior
Currently, if any metric value from the check contains
NaN
, the check fails to return data back to sensu-backendPossible Solution
Steps to Reproduce (for bugs)
sensuctl event info <hostname> <checkname>
from the commandline.NaN
with "0". I have an experimental release available for the plugin I am testing located heresensuctl event info <hostname> <checkname>
should show current data after removing the non-numerical values from the check.Context
In its current state, the sensu-agent will fail silently and no notification will be sent that the check is failing.
Your Environment