prometheus / snmp_exporter

SNMP Exporter for Prometheus
Apache License 2.0
1.63k stars 611 forks source link

Scans are slow every 5 minutes, with a single retry #1156

Open desultory opened 4 months ago

desultory commented 4 months ago

snmp_exporter version: output of snmp_exporter -version

snmp_exporter | ts=2024-04-11T15:09:38.160Z caller=main.go:194 level=info msg="Starting snmp_exporter" version="(version=0.25.0, branch=HEAD, revision=9c42d6c874d479314e612bca69558c81f8e26287)" concurrency=1 snmp_exporter | ts=2024-04-11T15:09:38.160Z caller=main.go:195 level=info build_context="(go=go1.21.5, platform=linux/amd64, user=root@880115266f70, date=20231210-10:05:18, tags=netgo)"

What device/snmpwalk OID are you using?

IF-MIB:

oid=1.3.6.1.2.1.2

If this is a new device, please link to the MIB(s).

It's a Netgear GS324TP S350

What did you do that produced an error?

Nothing different than what doesn't produce an error.

What did you expect to see?

snmp_exporter  | ts=2024-04-11T15:11:24.535Z caller=collector.go:460 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Starting scrape"
snmp_exporter  | ts=2024-04-11T15:11:24.535Z caller=collector.go:214 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.2
snmp_exporter  | ts=2024-04-11T15:11:24.731Z caller=collector.go:228 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.2 duration_seconds=195.774683ms
snmp_exporter  | ts=2024-04-11T15:11:24.731Z caller=collector.go:214 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.31.1.1
snmp_exporter  | ts=2024-04-11T15:11:24.917Z caller=collector.go:228 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.31.1.1 duration_seconds=186.542915ms
snmp_exporter  | ts=2024-04-11T15:11:24.935Z caller=collector.go:464 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Finished scrape" duration_seconds=0.400332611

What did you see instead?

# HELP snmp_scrape_duration_seconds Total SNMP time scrape took (walk and processing).
# TYPE snmp_scrape_duration_seconds gauge
snmp_scrape_duration_seconds{module="if_mib"} 5.410868688
# HELP snmp_scrape_packets_retried Packets retried for get, bulkget, and walk.
# TYPE snmp_scrape_packets_retried gauge
snmp_scrape_packets_retried{module="if_mib"} 1
# HELP snmp_scrape_packets_sent Packets sent for get, bulkget, and walk; including retries.
# TYPE snmp_scrape_packets_sent gauge
snmp_scrape_packets_sent{module="if_mib"} 58
# HELP snmp_scrape_pdus_returned PDUs returned from get, bulkget, and walk.
# TYPE snmp_scrape_pdus_returned gauge
snmp_scrape_pdus_returned{module="if_mib"} 1366
# HELP snmp_scrape_walk_duration_seconds Time SNMP walk/bulkwalk took.
# TYPE snmp_scrape_walk_duration_seconds gauge
snmp_scrape_walk_duration_seconds{module="if_mib"} 5.392635394

This result was taken from a different response, so the times differ a bit, but each query with a spike says one retried packet.

snmp_exporter  | ts=2024-04-11T15:14:24.534Z caller=collector.go:460 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Starting scrape"
snmp_exporter  | ts=2024-04-11T15:14:24.534Z caller=collector.go:214 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.2
snmp_exporter  | ts=2024-04-11T15:10:29.763Z caller=collector.go:228 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.2 duration_seconds=5.228701568s
snmp_exporter  | ts=2024-04-11T15:10:29.763Z caller=collector.go:214 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.31.1.1
snmp_exporter  | ts=2024-04-11T15:10:29.962Z caller=collector.go:228 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.31.1.1 duration_seconds=199.234426ms
snmp_exporter  | ts=2024-04-11T15:10:29.980Z caller=collector.go:464 level=debug auth=netgear_switch target=172.16.99.2 module=if_mib msg="Finished scrape" duration_seconds=5.446228759

When I check the response, I can see that it says it retied a single packet, but it's not clear what was retried. The logs don't make any mention of what failed. Similar behavior is being observed with a Lenovo Xclarity BMC, also happening every 5 minutes with a single retry.

desultory commented 4 months ago

image it could be something related to the device? maybe it does something on 5m intervals that slow things down. I've been watching the network and I don't see anything unusual happening at that time, and mostly see the exporters talking to the device. There's a small firmware revision available for that device, but it has no mention of SNMP changes or anything that could affect this.

SuperQ commented 4 months ago

Yes, that very much seems like something about the device.

Best you can do is maybe tcpdump/wireshark the scrape to compare send and receive times to look for a response latency problem with the device.

desultory commented 4 months ago

Yes, that very much seems like something about the device.

Best you can do is maybe tcpdump/wireshark the scrape to compare send and receive times to look for a response latency problem with the device.

I agree, but debug info about which oid was retried would be great info here.

desultory commented 4 months ago

I'm using a cut down version of the IF-MIB config, not sure if that is impacting this somehow:

  if_mib:
    walk:
    - 1.3.6.1.2.1.2
    - 1.3.6.1.2.1.31.1.1
    metrics:
    - name: ifNumber
      oid: 1.3.6.1.2.1.2.1
      type: gauge
      help: The number of network interfaces (regardless of their current state) present
        on this system. - 1.3.6.1.2.1.2.1
    - name: ifIndex
      oid: 1.3.6.1.2.1.2.2.1.1
      type: gauge
      help: A unique value, greater than zero, for each interface - 1.3.6.1.2.1.2.2.1.1
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifType
      oid: 1.3.6.1.2.1.2.2.1.3
      type: EnumAsInfo
      help: The type of interface - 1.3.6.1.2.1.2.2.1.3
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
      enum_values:
        1: other
        2: regular1822
        3: hdh1822
        4: ddnX25
        5: rfc877x25
        6: ethernetCsmacd
        7: iso88023Csmacd
        8: iso88024TokenBus
        9: iso88025TokenRing
        10: iso88026Man
        11: starLan
        12: proteon10Mbit
        13: proteon80Mbit
        14: hyperchannel
        15: fddi
        16: lapb
        17: sdlc
        18: ds1
        19: e1
        20: basicISDN
        21: primaryISDN
        22: propPointToPointSerial
        23: ppp
        24: softwareLoopback
        25: eon
        26: ethernet3Mbit
        27: nsip
        28: slip
        29: ultra
        30: ds3
        31: sip
        32: frameRelay
        33: rs232
        34: para
        35: arcnet
        36: arcnetPlus
        37: atm
        38: miox25
        39: sonet
        40: x25ple
        41: iso88022llc
        42: localTalk
        43: smdsDxi
        44: frameRelayService
        45: v35
        46: hssi
        47: hippi
        48: modem
        49: aal5
        50: sonetPath
        51: sonetVT
        52: smdsIcip
        53: propVirtual
        54: propMultiplexor
        55: ieee80212
        56: fibreChannel
        57: hippiInterface
        58: frameRelayInterconnect
        59: aflane8023
        60: aflane8025
        61: cctEmul
        62: fastEther
        63: isdn
        64: v11
        65: v36
        66: g703at64k
        67: g703at2mb
        68: qllc
        69: fastEtherFX
        70: channel
        71: ieee80211
        72: ibm370parChan
        73: escon
        74: dlsw
        75: isdns
        76: isdnu
        77: lapd
        78: ipSwitch
        79: rsrb
        80: atmLogical
        81: ds0
        82: ds0Bundle
        83: bsc
        84: async
        85: cnr
        86: iso88025Dtr
        87: eplrs
        88: arap
        89: propCnls
        90: hostPad
        91: termPad
        92: frameRelayMPI
        93: x213
        94: adsl
        95: radsl
        96: sdsl
        97: vdsl
        98: iso88025CRFPInt
        99: myrinet
        100: voiceEM
        101: voiceFXO
        102: voiceFXS
        103: voiceEncap
        104: voiceOverIp
        105: atmDxi
        106: atmFuni
        107: atmIma
        108: pppMultilinkBundle
        109: ipOverCdlc
        110: ipOverClaw
        111: stackToStack
        112: virtualIpAddress
        113: mpc
        114: ipOverAtm
        115: iso88025Fiber
        116: tdlc
        117: gigabitEthernet
        118: hdlc
        119: lapf
        120: v37
        121: x25mlp
        122: x25huntGroup
        123: transpHdlc
        124: interleave
        125: fast
        126: ip
        127: docsCableMaclayer
        128: docsCableDownstream
        129: docsCableUpstream
        130: a12MppSwitch
        131: tunnel
        132: coffee
        133: ces
        134: atmSubInterface
        135: l2vlan
        136: l3ipvlan
        137: l3ipxvlan
        138: digitalPowerline
        139: mediaMailOverIp
        140: dtm
        141: dcn
        142: ipForward
        143: msdsl
        144: ieee1394
        145: if-gsn
        146: dvbRccMacLayer
        147: dvbRccDownstream
        148: dvbRccUpstream
        149: atmVirtual
        150: mplsTunnel
        151: srp
        152: voiceOverAtm
        153: voiceOverFrameRelay
        154: idsl
        155: compositeLink
        156: ss7SigLink
        157: propWirelessP2P
        158: frForward
        159: rfc1483
        160: usb
        161: ieee8023adLag
        162: bgppolicyaccounting
        163: frf16MfrBundle
        164: h323Gatekeeper
        165: h323Proxy
        166: mpls
        167: mfSigLink
        168: hdsl2
        169: shdsl
        170: ds1FDL
        171: pos
        172: dvbAsiIn
        173: dvbAsiOut
        174: plc
        175: nfas
        176: tr008
        177: gr303RDT
        178: gr303IDT
        179: isup
        180: propDocsWirelessMaclayer
        181: propDocsWirelessDownstream
        182: propDocsWirelessUpstream
        183: hiperlan2
        184: propBWAp2Mp
        185: sonetOverheadChannel
        186: digitalWrapperOverheadChannel
        187: aal2
        188: radioMAC
        189: atmRadio
        190: imt
        191: mvl
        192: reachDSL
        193: frDlciEndPt
        194: atmVciEndPt
        195: opticalChannel
        196: opticalTransport
        197: propAtm
        198: voiceOverCable
        199: infiniband
        200: teLink
        201: q2931
        202: virtualTg
        203: sipTg
        204: sipSig
        205: docsCableUpstreamChannel
        206: econet
        207: pon155
        208: pon622
        209: bridge
        210: linegroup
        211: voiceEMFGD
        212: voiceFGDEANA
        213: voiceDID
        214: mpegTransport
        215: sixToFour
        216: gtp
        217: pdnEtherLoop1
        218: pdnEtherLoop2
        219: opticalChannelGroup
        220: homepna
        221: gfp
        222: ciscoISLvlan
        223: actelisMetaLOOP
        224: fcipLink
        225: rpr
        226: qam
        227: lmp
        228: cblVectaStar
        229: docsCableMCmtsDownstream
        230: adsl2
        231: macSecControlledIF
        232: macSecUncontrolledIF
        233: aviciOpticalEther
        234: atmbond
        235: voiceFGDOS
        236: mocaVersion1
        237: ieee80216WMAN
        238: adsl2plus
        239: dvbRcsMacLayer
        240: dvbTdm
        241: dvbRcsTdma
        242: x86Laps
        243: wwanPP
        244: wwanPP2
        245: voiceEBS
        246: ifPwType
        247: ilan
        248: pip
        249: aluELP
        250: gpon
        251: vdsl2
        252: capwapDot11Profile
        253: capwapDot11Bss
        254: capwapWtpVirtualRadio
        255: bits
        256: docsCableUpstreamRfPort
        257: cableDownstreamRfPort
        258: vmwareVirtualNic
        259: ieee802154
        260: otnOdu
        261: otnOtu
        262: ifVfiType
        263: g9981
        264: g9982
        265: g9983
        266: aluEpon
        267: aluEponOnu
        268: aluEponPhysicalUni
        269: aluEponLogicalLink
        270: aluGponOnu
        271: aluGponPhysicalUni
        272: vmwareNicTeam
        277: docsOfdmDownstream
        278: docsOfdmaUpstream
        279: gfast
        280: sdci
        281: xboxWireless
        282: fastdsl
        283: docsCableScte55d1FwdOob
        284: docsCableScte55d1RetOob
        285: docsCableScte55d2DsOob
        286: docsCableScte55d2UsOob
        287: docsCableNdf
        288: docsCableNdr
        289: ptm
        290: ghn
        291: otnOtsi
        292: otnOtuc
        293: otnOduc
        294: otnOtsig
        295: microwaveCarrierTermination
        296: microwaveRadioLinkTerminal
        297: ieee8021axDrni
        298: ax25
        299: ieee19061nanocom
        300: cpri
        301: omni
        302: roe
        303: p2pOverLan
    - name: ifPhysAddress
      oid: 1.3.6.1.2.1.2.2.1.6
      type: PhysAddress48
      help: The interface's address at its protocol sub-layer - 1.3.6.1.2.1.2.2.1.6
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifAdminStatus
      oid: 1.3.6.1.2.1.2.2.1.7
      type: EnumAsStateSet
      help: The desired state of the interface - 1.3.6.1.2.1.2.2.1.7
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
      enum_values:
        1: up
        2: down
        3: testing
    - name: ifOperStatus
      oid: 1.3.6.1.2.1.2.2.1.8
      type: EnumAsStateSet
      help: The current operational state of the interface - 1.3.6.1.2.1.2.2.1.8
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
      enum_values:
        1: up
        2: down
        3: testing
        4: unknown
        5: dormant
        6: notPresent
        7: lowerLayerDown
    - name: ifLastChange
      oid: 1.3.6.1.2.1.2.2.1.9
      type: gauge
      help: The value of sysUpTime at the time the interface entered its current operational
        state - 1.3.6.1.2.1.2.2.1.9
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifInOctets
      oid: 1.3.6.1.2.1.2.2.1.10
      type: counter
      help: The total number of octets received on the interface, including framing
        characters - 1.3.6.1.2.1.2.2.1.10
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifInDiscards
      oid: 1.3.6.1.2.1.2.2.1.13
      type: counter
      help: The number of inbound packets which were chosen to be discarded even though
        no errors had been detected to prevent their being deliverable to a higher-layer
        protocol - 1.3.6.1.2.1.2.2.1.13
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifInErrors
      oid: 1.3.6.1.2.1.2.2.1.14
      type: counter
      help: For packet-oriented interfaces, the number of inbound packets that contained
        errors preventing them from being deliverable to a higher-layer protocol -
        1.3.6.1.2.1.2.2.1.14
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifOutOctets
      oid: 1.3.6.1.2.1.2.2.1.16
      type: counter
      help: The total number of octets transmitted out of the interface, including
        framing characters - 1.3.6.1.2.1.2.2.1.16
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifOutDiscards
      oid: 1.3.6.1.2.1.2.2.1.19
      type: counter
      help: The number of outbound packets which were chosen to be discarded even
        though no errors had been detected to prevent their being transmitted - 1.3.6.1.2.1.2.2.1.19
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifOutErrors
      oid: 1.3.6.1.2.1.2.2.1.20
      type: counter
      help: For packet-oriented interfaces, the number of outbound packets that could
        not be transmitted because of errors - 1.3.6.1.2.1.2.2.1.20
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifOutQLen
      oid: 1.3.6.1.2.1.2.2.1.21
      type: gauge
      help: The length of the output packet queue (in packets). - 1.3.6.1.2.1.2.2.1.21
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
    - name: ifConnectorPresent
      oid: 1.3.6.1.2.1.31.1.1.1.17
      type: EnumAsStateSet
      help: This object has the value 'true(1)' if the interface sublayer has a physical
        connector and the value 'false(2)' otherwise. - 1.3.6.1.2.1.31.1.1.1.17
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
      enum_values:
        1: "true"
        2: "false"
    - name: ifCounterDiscontinuityTime
      oid: 1.3.6.1.2.1.31.1.1.1.19
      type: gauge
      help: The value of sysUpTime on the most recent occasion at which any one or
        more of this interface's counters suffered a discontinuity - 1.3.6.1.2.1.31.1.1.1.19
      indexes:
      - labelname: ifIndex
        type: gauge
      lookups:
      - labels:
        - ifIndex
        labelname: ifAlias
        oid: 1.3.6.1.2.1.31.1.1.1.18
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifDescr
        oid: 1.3.6.1.2.1.2.2.1.2
        type: DisplayString
      - labels:
        - ifIndex
        labelname: ifName
        oid: 1.3.6.1.2.1.31.1.1.1.1
        type: DisplayString
SuperQ commented 4 months ago

We would need to have retry callback hooks in the SNMP library in order to get more per walk debugging. It may be possible, I would have to look at the library code.

desultory commented 4 months ago

We would need to have retry callback hooks in the SNMP library in order to get more per walk debugging. It may be possible, I would have to look at the library code.

if this is possible, it would be great, I'm not sure how else I could easily determine which oid was failing. I may try to decode some of the snmp3 packets to try to see what was requested twice.

SuperQ commented 4 months ago

Can you give this PR a try? https://github.com/prometheus/snmp_exporter/pull/1157

You can find the output build artifacts in CircleCI.

desultory commented 4 months ago

Can you give this PR a try? #1157

You can find the output build artifacts in CircleCI.

Thank you very much, I was able to build that branch with "go build" and can run the output with "./snmp_exporter --snmp.debug-packets --log.level=debug".

I've been running this build on my desktop for a while, which is accessing that problem device over a VPN, and in 10 minutes I haven't been able to record a duration greater than 600ms in the last 10 minutes. I have curls running with "watch -n 60" so it does one query a minute.

Do you know what I could filter the output by to only see that particular new debug message? That may make it easier to catch this issue with this improved debugging.

SuperQ commented 4 months ago

I would look for Retry number and request timeout in the logs.

desultory commented 4 months ago

For whatever reason I'm not able to reproduce this on my desktop:

> grep "Retry number" log.txt
> grep "request timeout" log.txt
> grep "Finished scrape" log.txt
ts=2024-04-12T16:22:10.522Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.647985982
ts=2024-04-12T16:22:26.162Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.634859661
ts=2024-04-12T16:22:41.775Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.607646046
ts=2024-04-12T16:22:57.411Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.630984308
ts=2024-04-12T16:23:13.042Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.626529731
ts=2024-04-12T16:23:28.661Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.613124682
ts=2024-04-12T16:23:44.301Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.634122488
ts=2024-04-12T16:23:59.934Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.627908045
ts=2024-04-12T16:24:15.549Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.610465316
ts=2024-04-12T16:24:31.166Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.611277663
ts=2024-04-12T16:24:46.799Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.628145301
ts=2024-04-12T16:25:02.422Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.61721881
ts=2024-04-12T16:25:18.060Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.632497305
ts=2024-04-12T16:25:33.698Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.633533768
ts=2024-04-12T16:25:49.326Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.622159064
ts=2024-04-12T16:26:04.950Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.618739644
ts=2024-04-12T16:26:20.579Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.623703638
ts=2024-04-12T16:26:36.203Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.618068765
ts=2024-04-12T16:26:51.821Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.613324038
ts=2024-04-12T16:27:07.453Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.625984028
ts=2024-04-12T16:27:23.092Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.633752763
ts=2024-04-12T16:27:38.735Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.638352054
ts=2024-04-12T16:27:54.368Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.626199045
ts=2024-04-12T16:28:10.009Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.636024974
ts=2024-04-12T16:28:25.630Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.615045888
ts=2024-04-12T16:28:41.265Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.630309173
ts=2024-04-12T16:28:56.892Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.621721412
ts=2024-04-12T16:29:12.546Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.64898922
ts=2024-04-12T16:29:28.179Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.627764007
ts=2024-04-12T16:29:43.820Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.635411245
ts=2024-04-12T16:29:59.454Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.627675049
ts=2024-04-12T16:30:15.075Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.616801067
ts=2024-04-12T16:30:30.718Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.63589801
ts=2024-04-12T16:30:46.359Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.635852961
ts=2024-04-12T16:31:02.014Z caller=collector.go:448 level=debug auth=netgear_switch target=172.16.99.2 worker=0 module=if_mib msg="Finished scrape" duration_seconds=0.649401456
desultory commented 4 months ago

image I'm still observing that behavior on the existing docker container. I'll try to run this patched code directly on that server next.

SuperQ commented 4 months ago

Yea, I typically don't recommend running SNMP over VPN/WAN links. SNMP, being UDP-based, is very sensitive to packet loss and latency.

My standard recommendation is to run the snmp_exporter as close to the target devices as possible.

desultory commented 4 months ago

Yea, I typically don't recommend running SNMP over VPN/WAN links. SNMP, being UDP-based, is very sensitive to packet loss and latency.

My standard recommendation is to run the snmp_exporter as close to the target devices as possible.

The server with the issues is as close as it can be, it's under a router and then that router is attached to this switch.

What is strange is that I'm not experiencing the issue over a VPN.

The topology where the problem is observed: Docker container -> core router -> switch (snmp) The docker host is not NATing or anything, the core router is aware of the docker subnets on that host.

The topology where no issue is observed: My desktop -> vpn -> management router -> switch (snmp)

A different router is being used, but I've checked the core router and it doesn't report any dropped/error packets.