influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.69k stars 5.59k forks source link

GNMI plugin crashes in telegraf 1.30.2 #15431

Closed akarneliuk closed 5 months ago

akarneliuk commented 5 months ago

Relevant telegraf.conf

# gNMI telemetry input plugin
[[inputs.gnmi]]
  addresses = ["nexus-lab:50051"]
  username = "***"
  password = "***"
  encoding = "proto"

  [[inputs.gnmi.subscription]]
    name = "oc-interfaces"
    origin = ""
    path = "/interfaces/interface/state"
    subscription_mode = "sample"
    sample_interval = "10s"

Logs from Telegraf

2024-05-30T12:39:44Z I! Loading config: ietc/telegraf/telegrof.conf
2024-05-30T12:39:44Z I! Starting Telegraf 1.30.2 brought to you by InfluxData the makers of InfluxDB
2024-05-30T12:39:44Z I! Available plugins: 233 inputs, 9 aggregators, 31 processors, 24 parsers, 60 outputs, 6 secret-stores 
2024-05-30T12:59:44Z I! Loaded inputs: gnmi (2x)
2024-05-30T12-39:44Z I! Loaded aggregators:
2024-05-30T12;39:44Z I! Loaded processors:
2024-05-30T12:39:44Z I! Loaded secretstores: 
2024-05-30T12:39:44Z I! Loaded outputs: kafka
2024-05-30T12:39:44Z I! Tags enabled: host=telegraf-nasa-10s-69c59f6976-z8n1r 
2024-05-30T12:39:44Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
2024-05-30T12:40:05Z W! [agent] ["outputs.kafka"] did not complete within its flush interval
2024-05-30T12:40:15Z W! [agent] ["outputs.kafka"] did not complete within its flush interval
2024-05-30T12:40:25Z W! [agent] ["outputs.kafka"] did not complete within its flush interval
2024-05-30T12:40:36Z W! [agent] ["outputs.kafka"] did not complete within its flush interval
2024-05-30T12:40:47Z W! [agent] ["outputs.kafka"] did not complete within its flush interval
2024-05-30T12:43:35Z W! [agent] [inputs.gnmi] Got empty metric-name for response, usually indicating
configuration issues as the response data cannot be related to any subscription.
Please open an issue on https://github.com/influxdata/telegraf including your
device model and the following response data:
update:{path:{elem:{}}}
This message is only printed once.

panic: runtime error: index out of range [-1]

goroutine 158 [running]:
github.com/influxdata/telegraf/plugins/inputs/gnmi.(*handler).handleSubscribeResponseUpdate(...)
  /go/src/github.com/influxdata/telegraf/plugins/inputs/gnmi/handler.go:256 +0x12f0
github.com/influxdata/telegraf/plugins/inputs/gnmi.(*handler).subscribeGNMI(...)
  /go/src/github.com/influxdata/telegraf/plugins/inputs/gnmi/handler.go:111 +0xa05
github.com/influxdata/telegraf/plugins/inputs/gnmi.(*GNMI).Start.func1({0x001a73ef1, 0x2b})
  /go/src/github.com/influxdata/telegraf/plugins/inputs/gnmi/gnmi.go:238 +0x591
created by github.com/influxdata/telegraf/plugins/inputs/gnmi.(*GNMI).Start in goroutine 1
  /go/src/github.com/influxdata/telegraf/plugins/inputs/gnmi/gnmi.go:221 +0x22e

System info

Telegraf 1.30.2, official container

Docker

No response

Steps to reproduce

  1. Collecting streaming telemetry using GNMI plugin from Cisco NX-OS 10.2(6)
  2. Using proto encoding
  3. After 3-4 mins of Telegraf container running, it crashes and restarts. Runs for another 3-4 mins and crashes/restarts again.
  4. Encoding json doesn't have this issue, but it has tons of other bugs related to way how Cisco makes encoding and, therefore, Cisco recommends using proto.

Expected behavior

Telegraf doesn't crash and process messages. If there are messages, which aren't related to any subscription, which seems to be the case time-to-time, they simply shall be dropped at flush time (I'm using namedrop=[""] on output plugin).

Actual behavior

Telegraf crashes every 3-4 minutes (perhaps, on receiving some GNMI messages)

Additional info

Issue is observed happening, when with Cisco NX-OS 10.2(6) using proto encoding.

powersj commented 5 months ago

This looks like the same issue as: https://github.com/influxdata/telegraf/issues/15257 which I fixed in https://github.com/influxdata/telegraf/pull/15259 that was released in v1.30.3.

Telegraf 1.30.2

Please update to 1.30.3 and let us know if you still get the crash.

akarneliuk commented 5 months ago

it looks like it works, thanks for quick turnaround @powersj

powersj commented 5 months ago

Thanks for confirming.