influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

[inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8 #13928

Closed emalzer closed 1 year ago

emalzer commented 1 year ago

Relevant telegraf.conf

[[inputs.cisco_telemetry_mdt]]
  ## Telemetry transport can be "tcp" or "grpc".  TLS is only supported when
  ## using the grpc transport.
  transport = "tcp"

  ## Address and port to host telemetry listener
  service_address = "0.0.0.0:2015"

Logs from Telegraf

2023-09-15T10:06:42Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:07:12Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:07:42Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:08:12Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:08:42Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:09:12Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:09:45Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8
2023-09-15T10:10:12Z E! [inputs.cisco_telemetry_mdt] Error in plugin: failed to decode: string field contains invalid UTF-8

System info

Telegraf 1.28.1-1, Ubuntu 20.04

Docker

No response

Steps to reproduce

  1. Use provided telegraf config
  2. currently used sensor-paths on Cisco:
    telemetry model-driven
    destination-group PROD
    address-family ipv4 x.x.x.x port 2015
    encoding self-describing-gpb
    protocol tcp
    !
    !
    sensor-group SGRP_INTERFACES-30S
    sensor-path openconfig-interfaces:interfaces/interface/state
    sensor-path openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state
    sensor-path Cisco-IOS-XR-infra-statsd-oper:infra-statistics/interfaces/interface/latest/protocols/protocol
    !
    subscription SUBS_INTERFACES
    sensor-group-id SGRP_INTERFACES-30S sample-interval 30000
    destination-id PROD
    source-interface Loopback0
    !

Expected behavior

Decode all fields accordingly or skip only affected single metric.

Actual behavior

Metrics are decoded correct until invalid UTF-8 character is hit. All following metrics within that batch are lost. So we are missing metrics from certain interfaces and do not see some interfaces at all.

Additional info

I can provide tcpdumps with the telemetry traffic from the Cisco device to telegraf.

powersj commented 1 year ago

Hi,

When the plugin receives a message via gRPC it then passes that message off to the upstream protobuf library to unmarshall the received data. In this upstream library is where the error about invalid UTF-8 data is created. At this point telegraf gets the failed to decode error and bails attempting to create a metric as there may not be anything received to create metrics from.

If additional messages were received without errors then those messages would continue to get parsed.

It is not clear to me if there is anything better for us to do here, we can have a look, but I think this is working as expected.

emalzer commented 1 year ago

Hi,

we do not use gRPC, we use plain TCP as transport and self-describing-gpb as encoding.

I would need to identify exactly where this UTF-8 decoding problem is located - meaning is it on the receiving side with telegraf due to a bug or is it already on the sending side from the Cisco device.

powersj commented 1 year ago

we do not use gRPC, we use plain TCP as transport and self-describing-gpb as encoding.

The parsing of the metrics is the same see: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/cisco_telemetry_mdt/cisco_telemetry_mdt.go#L361-L366

meaning is it on the receiving side with telegraf due to a bug or is it already on the sending side from the Cisco device.

As I mentioned above, this error happens during parsing a message we received. Meaning the message was produced by your device.

emalzer commented 1 year ago

So, is there a way to get more debug logs to identify which invalid UTF-8 char it's complaining about? I still need to pinpoint the cause.

powersj commented 1 year ago

You could do a packet capture at the time of the error and see if you can look at the packet data. The other option is to possibly build a custom telegraf and log out the messages you are getting via MarshalTextString.

emalzer commented 1 year ago

hm, the packet capture is to huge as there are a lot of interfaces that even only this single sensor-paths exports... I cannot find / or its quite hard for me to find this needle in the haystack.

I will try to take a look into the custom plugin then.

emalzer commented 1 year ago

Hi!

I was successful identifying the issue. Thanks for your quick responses. Issue was a german special character which got wrong encoded due to whatever... :)

Just ss info if others encouter such an issue:

powersj commented 1 year ago

which got wrong encoded due to whatever... :)

ugh that is frustrating for both you as a user and me

I am going to keep this open and see if we can improve that message as you have done here.

powersj commented 1 year ago

I put up #13963, which uses the msg.String() method. Should do the same thing as what you did. The github.com/golang/protobuf library was superseded by the current library we use so this keeps us using that one.