openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
190 stars 57 forks source link

Silent drop of optical power Events with "-inf" value from Juniper MPC7 line cards. #240

Closed flyboyjon closed 1 year ago

flyboyjon commented 1 year ago

Version 0.32.0

When subscribing to a Juniper MX router using path "/junos/system/linecard/optics/" to obtain transceiver optical information, the interfaces on this MPC7E line card report the metric "lane_laser_receiver_power_dbm" with a value "-inf". This is an unexpected value, normally this metric would contain a float value of 0 or possibly -40 when no light is received. That is the observed behaviour on different model of line cards in this same chassis.

The event is silently dropped (no error) when using file output with --format=event, when using --format prototext the metric is visible in the file.

The only way we discovered this is that the Prometheus output reports this error in the log file :

2023/10/05 22:06:36.648677 [prometheus_output:prometheus] failed to convert message to event: invalid character 'i' in numeric literal

There does not seem to be a similar validation error for the file output, which is why we didn't notice it.

I don't seem to be able to apply a Processor to normalise this value, so the whole metric is missing from output when we use --format=event in order to perform event filtering and processing.

update_message

So, a couple of observations:

  1. Metric was silently dropped with no error in the log file when using the file output
  2. Metric was dropped with an error in the logfile when using Prometheus or Influx outputs
  3. It was impossible to identify the offending metric in the incoming data without some extra print statements in prometheus_output.go:304 to dump the raw message and identify the value causing the issue.

Not sure what the fix could be here, since the device is not reporting a numerical value on this metric as it does on other line card/interfaces, but on the other hand, Juniper is a common enough platform that I suspect others will run into this issue too. At least the error needs to be logged somewhere for file outputs that use --format=event so that we know the metric was dropped.

karimra commented 1 year ago

The issue here is that -inf is not a valid json value, it has to be quoted "-inf". The standard go json lib cannot unmarshal it.

flyboyjon commented 1 year ago

Agreed, the device is sending bad data.

As far as I can see though, I cannot capture this and correct it in a Processor, because gNMIC is dropping the Event before it gets to the Processors ? Please confirm my understanding that Processors work only on Event formats. Thanks.

karimra commented 1 year ago

That's correct. Can the device send the same data with a different encoding?

flyboyjon commented 1 year ago

Nice, yes - I changed to use 'proto' encoding from the device and now I can apply a Starlark processor to fix up the value, thanks ! Happy to close this. You might consider adding the same validation/error message to the 'file' output like in the Prometheus, SNMP and Influx outputs.