prometheus / snmp_exporter

SNMP Exporter for Prometheus
Apache License 2.0
1.68k stars 624 forks source link

Some 'context canceled' error in logs #487

Closed shdxiang closed 4 years ago

shdxiang commented 4 years ago

Host operating system: output of uname -a

Docker container: Linux d8827dda937a 4.19.95-flatcar #1 SMP Sat Feb 8 07:25:12 -00 2020 x86_64 GNU/Linux

snmp_exporter version: output of snmp_exporter -version

version=0.17.0, branch=HEAD, revision=f0ad4551a5c2023e383bc8dde2222f47dc760b83 f0ad4551a5c2023e383bc8dde2222f47dc760b83

What device/snmpwalk OID are you using?

generator.yml

nas:
    version: 2
    auth:
      community: public
    walk:
      - 1.3.6.1.4.1.24681.1.3.1 # systemCPU-UsageEX
      - 1.3.6.1.4.1.24681.1.3.11 # systemHdTableEX
      - 1.3.6.1.4.1.24681.1.3.15 # systemFanTable
      - 1.3.6.1.4.1.24681.1.3.17 # systemVolumeTable

If this is a new device, please link to the MIB(s).

Download from QNAP itself.

What did you do that produced an error?

I can get the metrics, but sometimes prometheus says that target is unreachable. And there are several error in snmp-exporter logs.

What did you expect to see?

No error in logs.

What did you see instead?

snmp-exporter_1       | level=info ts=2020-03-04T07:20:52.964Z caller=collector.go:216 module=dell_idrac target=172.30.60.220:161 msg="Error scraping target" err="error walking target 172.30.60.220: context canceled"
snmp-exporter_1       | level=info ts=2020-03-04T07:21:28.405Z caller=collector.go:216 module=nas target=172.30.50.104:161 msg="Error scraping target" err="error getting target 172.30.50.104: context canceled"
snmp-exporter_1       | level=info ts=2020-03-04T07:22:01.730Z caller=collector.go:216 module=dell_idrac target=172.30.60.209:161 msg="Error scraping target" err="error walking target 172.30.60.209: context canceled"
snmp-exporter_1       | level=info ts=2020-03-04T07:24:28.404Z caller=collector.go:216 module=nas target=172.30.50.104:161 msg="Error scraping target" err="error getting target 172.30.50.104: context canceled"
snmp-exporter_1       | level=info ts=2020-03-04T07:28:26.369Z caller=collector.go:216 module=dell_idrac target=172.30.60.211:161 msg="Error scraping target" err="error walking target 172.30.60.211: context canceled"
snmp-exporter_1       | level=info ts=2020-03-04T07:34:28.405Z caller=collector.go:216 module=nas target=172.30.50.104:161 msg="Error scraping target" err="error getting target 172.30.50.104: context canceled"
brian-brazil commented 4 years ago

That's a timeout, your device is presumably slow or far away from the exporter.

shdxiang commented 4 years ago

@brian-brazil They are in the same LAN, maybe the device sent response too slow.

SuperQ commented 4 years ago

@brian-brazil We could improve the error by checking if err == context.Canceled.

shdxiang commented 4 years ago

I have troubleshooted this issue for several days, noticed that it is not a simple timeout, my steps are:

  1. For debugging, I simplified the snmp.yml as:
qnap:
  get:
  - 1.3.6.1.4.1.24681.1.3.1.0
  version: 2
  auth:
    community: public

Start snmp_exporter on a Linux server(172.30.50.121) in the same LAN with target QNAP(172.30.50.104):

./snmp_exporter --log.level=debug

Run a script which request to snmp_exporter on 172.30.50.121:

exporter.sh

set -e
while true;
do
   curl -m 10 'http://localhost:9116/snmp?module=qnap&target=172.30.50.104'
   sleep 1
done

It will report timeout in minutes, and the snmp_exporter logged:

...
level=debug ts=2020-03-13T02:35:34.258Z caller=main.go:99 module=qnap target=172.30.50.104:161 msg="Starting scrape"
level=debug ts=2020-03-13T02:35:34.259Z caller=collector.go:132 module=qnap target=172.30.50.104:161 msg="Getting OIDs" oids=1
level=info ts=2020-03-13T02:36:34.259Z caller=collector.go:225 module=qnap target=172.30.50.104:161 msg="Error scraping target" err="scrape canceled (possible timeout) getting target 172.30.50.104"
level=debug ts=2020-03-13T02:36:34.259Z caller=main.go:110 module=qnap target=172.30.50.104:161 msg="Finished scrape" duration_seconds=60.000760423
  1. I tested the same OID with snmpget, also on 172.30.50.121:

snmpget.sh:

set -e
while true;
do
   snmpget -t 5 -v 2c -c public 172.30.50.104 1.3.6.1.4.1.24681.1.3.1.0
   sleep 1
done

But this script never timeout.

  1. I captured their packets with tcpdump, attached them, please help to analyze: tcpdump.zip

.

brian-brazil commented 4 years ago

Those look identical on the wire, so this is probably some networking issue on the box itself. I'd suggest checking your routing setup and which interfaces are being used.

shdxiang commented 4 years ago

@brian-brazil I am confused, If the NAS did not work well, snmpget should timeout too, right?

brian-brazil commented 4 years ago

netsnmp does some non-standard network stuff iirc, something about not checking that packets are coming back from where you sent them to. So it may work in situations where your routing is actually broken.

shdxiang commented 4 years ago

Actually the test server and the QNAP NAS are connected to the same switch, so I think there should not have routing issue.

brian-brazil commented 4 years ago

I'd suggest using strace then to see if the packets are making it to the snmp exporter.

shdxiang commented 4 years ago

I have run tcpdump on the NAS itself, and it captured the response, but the tcpdump on the Linux server which run snmp_exporter did not get that response, so this should be a network issue. Thanks.