Closed llamafilm closed 1 month ago
@llamafilm could you please try to reproduce this with latest master and maybe only the SNMP (and a file output) plugin?!? We shifted code for SNMP quite a bit between v1.29 and v1.30...
You need to look at the correct version of the source code, this is snmp.go:323
: https://github.com/influxdata/telegraf/blob/v1.29.5/plugins/inputs/snmp/snmp.go#L323
Going though the stack trace, the panic actually happens here: https://github.com/sleepinggenius2/gosmi/blob/v0.4.4/models/format_bits.go#L31 Which should be replaced by
octets := v.Bytes()
Thanks for calling that out. I guess the question then is if Telegraf should make a change as well? If the value is nil, should Telegraf even be calling the format value function?
I would try to get that fixed upstream and see what the maintainers say.
I have put up issue https://github.com/sleepinggenius2/gosmi/issues/44 and a PR https://github.com/sleepinggenius2/gosmi/pull/45
Happy to have reviews or comments to those. I did not realize this library had not had a lot of updates in a while, so let's see if we get a response.
It appears like the maintainer didn't do much anymore lately. Let's see indeed.
It seems like the upstream library has been abandoned. What should be done about this?
This same crash happened again today on version 1.30.3. Do you have any ideas how I could determine which SNMP device is the cause? It happens very intermittently, and I have hundreds of SNMP devices in the config, so I can't easily test them one by one.
Here's a more concise log output
Started Telegraf.
panic: interface conversion: interface {} is nil, not []uint8
goroutine 641 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x2?, 0xc000534680?)
/home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc0032e7480, {0x0, 0x0}, {0x234a0fd, 0x4}, {0x3a331c0, ...}, ...}, ...)
/home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
/home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:127
github.com/influxdata/telegraf/internal/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004bcd17?, {0xc003c41920?, 0x271da20?}, {0x0, 0x0}, 0x0)
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/translator_gosmi.go:68 +0x338
github.com/influxdata/telegraf/internal/snmp.(*Field).Convert(0xc003162b60, {{0x0, 0x0}, {0xc003c41920, 0x23}, 0x5})
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/field.go:251 +0xab8
github.com/influxdata/telegraf/internal/snmp.Table.Build({{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc000113c08, 0x10, 0x10}, {0x0, ...}, ...}, ...)
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/internal/snmp/table.go:175 +0x66d
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00043bb00, {0x275aec0, 0xc000866b60}, {0x274ec40, 0xc001e71b80}, {{0x23499c1, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:135 +0x87
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc001ecdfd0?, {0xc0004f4211, 0xc})
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:117 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 608
/data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:103 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.
Hi,
We chatted about this briefly today and the next steps will involve looking to see what Telegraf can do about this. Either by dealing with the nil or adding some sort of other check. We will not be forking the upstream project unless we absolutely must do so.
I've put up https://github.com/influxdata/telegraf/pull/15743, but I'm not entirely sure if that resolves this or is the correct behavior. Essentially, I think your use-case is a nil value and we should return an empty string. Correct me if that is wrong.
@llamafilm did you have any chance to test the mentioned PR? There is a release on Monday and we would really love to include this fix!
I haven't updated yet. The crash has not happened again since I last mentioned it a month ago. If the fix is low risk then I would suggest you go ahead and include it in the release. Then I'll upgrade and if it ever happens again I'll reopen this issue.
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.29.5-66b924ec, Ubuntu 22.04.4
Docker
No response
Steps to reproduce
Unknown
Expected behavior
no crash
Actual behavior
Telegraf has been running for several days under systemd, and this weekend it crashed. Systemd tried to restart it several times, and it kept crashing repeatedly. This log snippet from journald shows a full cycle, beginning after the first crash, until it crashes again. My telegraf config is several thousand lines long, so I'm not sure which part is relevant here. I have dozens of different SNMP devices with different input configs and processors.
There was a power outage Saturday morning, about 24 hours before this crash occurred, so it's likely some of the SNMP devices were in a bad state, but I can't reproduce it. This morning after restarting the service it's working fine.
Additional info
I built this telegraf binary using the custom builder to reduce the input and output plugins. But I did not customize anything else. So it's weird that the log references lines that don't exist like
snmp.go:323
.