influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.55k stars 5.56k forks source link

SNMP panic #15200

Closed llamafilm closed 1 month ago

llamafilm commented 5 months ago

Relevant telegraf.conf

[agent]
  snmp_translator = "gosmi"

Logs from Telegraf

Starting Telegraf...
telegraf.service: Can't open PID file /run/telegraf.pid (yet?) after start: Operation not permitted
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/inputs_snmp_trap.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "timeout" of plugin "inputs.snmp_trap" deprecated since version 1.20.0 and will be removed in 2.0.0: unused option
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/netbox-cloud1.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_grafanacloud.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_loki.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/processors.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/telegraf.conf
2024-04-21T14:50:06Z I! Starting Telegraf 1.29.5-66b924ec (customized) brought to you by InfluxData the makers of InfluxDB
2024-04-21T14:50:06Z I! Available plugins: 17 inputs, 0 aggregators, 7 processors, 5 parsers, 3 outputs, 0 secret-stores
2024-04-21T14:50:06Z I! Loaded inputs: cpu disk diskio exec (2x) internal mem net net_response netstat ping (66x) prometheus (8x) snmp (19x) snmp_trap swap system systemd_units
2024-04-21T14:50:06Z I! Loaded aggregators:
2024-04-21T14:50:06Z I! Loaded processors: converter (4x) enum (3x) regex (3x) rename (2x) scale (2x) starlark strings
2024-04-21T14:50:06Z I! Loaded secretstores:
2024-04-21T14:50:06Z I! Loaded outputs: http loki
2024-04-21T14:50:06Z I! Tags enabled: host=cloud1
2024-04-21T14:50:06Z W! Deprecated inputs: 0 and 1 options
2024-04-21T14:50:06Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"cloud1", Flush Interval:1m30s
2024-04-21T14:50:06Z D! [agent] Initializing plugins
2024-04-21T14:50:06Z I! Config watcher started
Started Telegraf.
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:07Z I! [inputs.prometheus::9] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::36] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::47] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::62] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::78] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::79] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5844] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5915] Using the label selector:  and field selector:
2024-04-21T14:50:07Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z D! [agent] Connecting outputs
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.http::grafana-cloud]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.http::grafana-cloud
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.loki]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.loki
2024-04-21T14:50:07Z D! [agent] Starting service inputs
2024-04-21T14:50:07Z I! [inputs.snmp_trap] Listening on udp://:162
2024-04-21T14:50:07Z E! [inputs.snmp::70] Error in plugin: agent 10.92.177.47: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:11296->10.92.177.47:161: recvfrom: connection refused
2024-04-21T14:50:07Z E! [inputs.snmp::71] Error in plugin: agent 10.92.177.48: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:36407->10.92.177.48:161: recvfrom: connection refused
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Wrote batch of 1000 metrics in 237.499482ms
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Buffer fullness: 304 / 1000000 metrics
2024-04-21T14:50:07Z D! [processors.starlark] Removed string field: serverSoftwareVersion from snmp_adder_aim
panic: interface conversion: interface {} is nil, not []uint8
goroutine 642 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x90?, 0xc00221da20?)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc002112340, {0x0, 0x0}, {0x22a2cb2, 0x4}, {0x3939620, ...}, ...}, ...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:127
github.com/influxdata/telegraf/plugins/inputs/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004c6037?, {0xc002143230?, 0x265ed10?}, {0x0, 0x0}, 0x0)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/gosmi.go:139 +0x338
github.com/influxdata/telegraf/plugins/inputs/snmp.fieldConvert({0x267f628, 0x3939620}, {0xc0004c6037, 0x4}, {{0x0, 0x0}, {0xc002143230, 0x23}, 0x5})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:771 +0xb27
github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build({{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc0001fb208, 0x10, 0x10}, {0x0, ...}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:439 +0x685
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00047db00, {0x2699be0, 0xc0003b1e80}, {0x268e2d0, 0xc002396f20}, {{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:355 +0xa7
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc00257cfd0?, {0xc0004b0de1, 0xc})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:337 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 641
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:323 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.
telegraf.service: Consumed 1.379s CPU time.
telegraf.service: Scheduled restart job, restart counter is at 2.
Stopped Telegraf.
telegraf.service: Consumed 1.379s CPU time.

System info

Telegraf 1.29.5-66b924ec, Ubuntu 22.04.4

Docker

No response

Steps to reproduce

Unknown

Expected behavior

no crash

Actual behavior

Telegraf has been running for several days under systemd, and this weekend it crashed. Systemd tried to restart it several times, and it kept crashing repeatedly. This log snippet from journald shows a full cycle, beginning after the first crash, until it crashes again. My telegraf config is several thousand lines long, so I'm not sure which part is relevant here. I have dozens of different SNMP devices with different input configs and processors.

There was a power outage Saturday morning, about 24 hours before this crash occurred, so it's likely some of the SNMP devices were in a bad state, but I can't reproduce it. This morning after restarting the service it's working fine.

Additional info

I built this telegraf binary using the custom builder to reduce the input and output plugins. But I did not customize anything else. So it's weird that the log references lines that don't exist like snmp.go:323.

srebhan commented 5 months ago

@llamafilm could you please try to reproduce this with latest master and maybe only the SNMP (and a file output) plugin?!? We shifted code for SNMP quite a bit between v1.29 and v1.30...

Hipska commented 5 months ago

You need to look at the correct version of the source code, this is snmp.go:323: https://github.com/influxdata/telegraf/blob/v1.29.5/plugins/inputs/snmp/snmp.go#L323

Going though the stack trace, the panic actually happens here: https://github.com/sleepinggenius2/gosmi/blob/v0.4.4/models/format_bits.go#L31 Which should be replaced by

    octets := v.Bytes()
powersj commented 5 months ago

Thanks for calling that out. I guess the question then is if Telegraf should make a change as well? If the value is nil, should Telegraf even be calling the format value function?

Hipska commented 5 months ago

I would try to get that fixed upstream and see what the maintainers say.

powersj commented 5 months ago

I have put up issue https://github.com/sleepinggenius2/gosmi/issues/44 and a PR https://github.com/sleepinggenius2/gosmi/pull/45

Happy to have reviews or comments to those. I did not realize this library had not had a lot of updates in a while, so let's see if we get a response.

Hipska commented 5 months ago

It appears like the maintainer didn't do much anymore lately. Let's see indeed.

llamafilm commented 1 month ago

It seems like the upstream library has been abandoned. What should be done about this?

This same crash happened again today on version 1.30.3. Do you have any ideas how I could determine which SNMP device is the cause? It happens very intermittently, and I have hundreds of SNMP devices in the config, so I can't easily test them one by one.

llamafilm commented 1 month ago

Here's a more concise log output

Started Telegraf.
panic: interface conversion: interface {} is nil, not []uint8
goroutine 641 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x2?, 0xc000534680?)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc0032e7480, {0x0, 0x0}, {0x234a0fd, 0x4}, {0x3a331c0, ...}, ...}, ...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:127
github.com/influxdata/telegraf/internal/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004bcd17?, {0xc003c41920?, 0x271da20?}, {0x0, 0x0}, 0x0)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/translator_gosmi.go:68 +0x338
github.com/influxdata/telegraf/internal/snmp.(*Field).Convert(0xc003162b60, {{0x0, 0x0}, {0xc003c41920, 0x23}, 0x5})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/field.go:251 +0xab8
github.com/influxdata/telegraf/internal/snmp.Table.Build({{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc000113c08, 0x10, 0x10}, {0x0, ...}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/table.go:175 +0x66d
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00043bb00, {0x275aec0, 0xc000866b60}, {0x274ec40, 0xc001e71b80}, {{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:135 +0x87
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc001ecdfd0?, {0xc0004f4211, 0xc})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:117 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 608
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:103 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.
powersj commented 1 month ago

Hi,

We chatted about this briefly today and the next steps will involve looking to see what Telegraf can do about this. Either by dealing with the nil or adding some sort of other check. We will not be forking the upstream project unless we absolutely must do so.

powersj commented 1 month ago

I've put up https://github.com/influxdata/telegraf/pull/15743, but I'm not entirely sure if that resolves this or is the correct behavior. Essentially, I think your use-case is a nil value and we should return an empty string. Correct me if that is wrong.

srebhan commented 1 month ago

@llamafilm did you have any chance to test the mentioned PR? There is a release on Monday and we would really love to include this fix!

llamafilm commented 1 month ago

I haven't updated yet. The crash has not happened again since I last mentioned it a month ago. If the fix is low risk then I would suggest you go ahead and include it in the release. Then I'll upgrade and if it ever happens again I'll reopen this issue.