influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.43k stars 5.54k forks source link

Unable to get SNMP plugin work with version 3 authentication #3655

Closed ashuw018 closed 6 years ago

ashuw018 commented 6 years ago

Directions

I am trying to get telegraf working with SNMP plugin, everything works fine if the device ip in agent section is local network IP, but it does not work if i try same with any external networks device IP.

**Everything is working with snmpwalk, snmptable with local and external network device, but through telegraf it doesn't work for external device.

Relevant telegraf.conf:

[[inputs.snmp]] agents = [ "x.x.x.x:161" ] version = 3

sec_name = "myuser" auth_protocol = "SHA"
auth_password = "my_Auth" sec_level = "authPriv"
context_name = "" priv_protocol = "AES"
priv_password = "my_PWD"

name = "system" [[inputs.snmp.field]] name = "hostname" oid = "RFC1213-MIB::sysName.0" is_tag = true

[[inputs.snmp.table]] name = "snmp" inherit_tags = [ "hostname" ] oid = "IF-MIB::ifXTable"

[[inputs.snmp.table.field]]
  name = "ifName"
  oid = "IF-MIB::ifName"
  is_tag = true

System info:

Telegraf version = v1.5.0 Net snmp Windows 10 and windows 2008 R2

Steps to reproduce:

  1. ...
  2. ...

Expected behavior:

It should work as it is working with local network devices

Actual behavior:

It is working with local device but not with external device, also it should give some exact logs that exactly whats going wrong.

Additional info:

Errors which i am getting

2018-01-09T09:19:24Z E! Error in plugin [inputs.snmp]: agent x.x.x.x:161: performing get on field hostname: Request timeout (after 3 retries) 2018-01-09T09:19:34Z E! Error in plugin [inputs.snmp]: agent x.x.x.x:161: gathering table snmp: performing bulk walk for field ifName: Request timeout (after 3 retries)

danielnelson commented 6 years ago

What do you mean by external device, are you referring to another host on the same subnet or to a device on a routed network? There is a known issue where the response must be received from the same target host, I wonder if you are running into this: https://github.com/influxdata/telegraf/issues/3320

The logs seem to show what is going wrong, no response was received and request timeout, so I don't think they can be improved.

ashuw018 commented 6 years ago

Hi Danielnelson,

By external device, i mean the device is not is our premises, it resides outside of our network. net-snmp and other tools like snmp-exporter + prometheus works fine with it. But through telegraf it doesn't work.

Might be i am running into #3320, not sure, but yes it works through telegraf as well if i try to perform same things on any local/on premise device.

Thanks,

danielnelson commented 6 years ago

That is interesting that snmp-exporter is working, since we are both using the same gosnmp library. @phemmer do you have any idea what could cause this?

phemmer commented 6 years ago

Without seeing packet captures, not really. The only immediate thought is the default 5s timeout.

danielnelson commented 6 years ago

@ashuw018 Do you think you could collect a packet dump? There is an example of how to do it on unix systems here https://github.com/soniah/gosnmp#packet-captures. It would be helpful to collect it once with snmpget, and once using telegraf --input-filter snmp --test.

@phemmer Do you think it would be useful if I wrote a clone of snmpget/snmpwalk using the gosnmp library? We could add debugging code as needed and then users could run this when they have a problem.

ashuw018 commented 6 years ago

Hi Daniel and Phemmer,

Thanks for taking efforts. Here is another update from my end. I am now more surprised as now i have installed telegraf and netsnmp on one of the external server which is internal to the external snmp device regarding which this issue has been created. Unfortunately it is not working from there as well. And the issue seems to be SNMPv3 specific as from there it ia working fine with SNMPv2. Don't know what i am doing wrong with SNMPv3 as i get data over SNMPv3 by using netsnmp snmpwalk snmpget snmptable, with snmp exporter as well but not with telegraf.

This might change our investigation direction with this issue.. to me as of now snmpv2 is working and snmpv3 is not.

Thanks,

danielnelson commented 6 years ago

This might be https://github.com/soniah/gosnmp/issues/95

rrasale commented 6 years ago

Can you run a snmpwalk to your destination device from your telegraf server ? Do you see any output ?

ashuw018 commented 6 years ago

@rrasale Yes i do see valid output from snmpwalk over version 3. in fact using snmp exporter and prometheus it works perfect over snmpv3.

glazzari commented 6 years ago

Hi, I'm getting the same error. I can snmpwalk to the production server from my machine, but telegraf is returning request timeout. For example:

$ snmpwalk -mALL -On -v2c -cintcacti <target> 1.3.6.1.4.1.35750.10.2.1.1.1

.1.3.6.1.4.1.35750.10.2.1.1.1.0 = Counter64: 768886
$ sudo tcpdump -s 0 -i eno1 host [target] and port 161

17:35:43.398564 IP [src].51646 > [target].snmp:  C="intcacti" GetNextRequest(33)  E:35750.10.2.1.1.1
17:35:43.425641 IP [target].snmp > [src].51646:  C="intcacti" GetResponse(37)  E:35750.10.2.1.1.1.0=769168
17:35:43.425777 IP [src].51646 > [target].snmp:  C="intcacti" GetNextRequest(34)  E:35750.10.2.1.1.1.0
17:35:43.451014 IP [target].snmp > [src].51646:  C="intcacti" GetResponse(38)  E:35750.10.2.1.1.2.0=240797616

I'm running the official docker image of telegraf, which exposes the following ports:

8125/udp 8092/udp 8094

SNMP plugin is configured as:

[[inputs.snmp]]
  agents = [ "<target>" ]
  version = 2
  community = "cintcacti"
  name = "system"

  [[inputs.snmp.field]]
    name = "total"
    oid = "1.3.6.1.4.1.35750.10.2.1.1.1"
telegraf    | 2018-01-17T19:46:20Z E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (10s)
telegraf    | 2018-01-17T19:46:20Z E! Error in plugin [inputs.snmp]: agent [target]: performing get on field total: Request timeout (after 3 retries)
$ telegraf --input-filter snmp --test

2018/01/17 19:46:34 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.snmp, Collection 1
2018-01-17T19:46:44Z E! Error in plugin [inputs.snmp]: agent [target]: performing get on field total: Request timeout (after 3 retries)
$ sudo tcpdump -s 0 -i eno1 host [target] and port 161

17:41:00.012307 IP [src].34905 > [target].snmp:  C="cintcacti" GetRequest(33)  E:35750.10.2.1.1.1
17:41:01.262567 IP [src].34905 > [target].snmp:  C="cintcacti" GetRequest(33)  E:35750.10.2.1.1.1
17:41:02.512643 IP [src].34905 > [target].snmp:  C="cintcacti" GetRequest(33)  E:35750.10.2.1.1.1
17:41:03.762995 IP [src].34905 > [target].snmp:  C="cintcacti" GetRequest(33)  E:35750.10.2.1.1.1
danielnelson commented 6 years ago

@glazzari Can you test with this field:

[[inputs.snmp.field]]
  name = "hostname"
  oid = ".1.3.6.1.2.1.1.5.0"

and save the capture with:

sudo tcpdump -s 0 -i eno1 -w test.pcap host [target] and port 161

Capturing both commands would be helpful:

snmpget -v2c -c public 10.79.40.63:161 .1.3.6.1.2.1.1.5.0
telegraf --input-filter snmp --test
glazzari commented 6 years ago

@danielnelson same results. I can snmpwalk correctly, but the plugins fails with request timeout.

$ snmpwalk -On -v2c -cintcacti [target] .1.3.6.1.2.1.1.5.0

.1.3.6.1.2.1.1.5.0 = STRING: [returns sysName]
$ sudo tcpdump -s 0 -i eno1 host [target] and port 161             

09:44:05.601227 IP [src].59290 > [target].snmp:  C="intcacti" GetNextRequest(28)  system.sysName.0
09:44:05.629768 IP [target].snmp > [src].59290:  C="intcacti" GetResponse(52)  system.sysLocation.0="[returns sysLocation]"
09:44:05.629909 IP [src].59290 > [target].snmp:  C="intcacti" GetRequest(28)  system.sysName.0
09:44:05.659471 IP [target].snmp > [src].59290:  C="intcacti" GetResponse(35)  system.sysName.0="[returns sysName]"
$ telegraf --input-filter snmp --test
2018/01/18 11:51:47 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.snmp, Collection 1
2018-01-18T11:51:57Z E! Error in plugin [inputs.snmp]: agent [target]: performing get on field hostname: Request timeout (after 3 retries)
$ sudo tcpdump -s 0 -i eno1 host [target] and port 161
10:12:30.008255 IP [src].41406 > [target].snmp:  C="cintcacti" GetRequest(28)  system.sysName.0
10:12:31.258367 IP [src].41406 > [target].snmp:  C="cintcacti" GetRequest(28)  system.sysName.0
10:12:32.508635 IP [src].41406 > [target].snmp:  C="cintcacti" GetRequest(28)  system.sysName.0
10:12:33.758892 IP [src].41406 > [target].snmp:  C="cintcacti" GetRequest(28)  system.sysName.0
glazzari commented 6 years ago

Not sure if it's related to #3320.

danielnelson commented 6 years ago

If the response source matches the request target then I don't think it is #3320

I notice that tcpdump reports the community name differently:

- C="intcacti"
+ C="cintcacti"
glazzari commented 6 years ago

Good catch! I've just noticed the community name in telegraf.conf was including the 'c' as part of its name. This is probably a copy/paste error, because I was running snmpwalk with "-cintcacti" instead of "-c intcacti". Note the space after the "-c".

[[inputs.snmp]]
  agents = [ "target:161" ]
  version = 2
  community = "intcacti"
danielnelson commented 6 years ago

Did fixing that help?

glazzari commented 6 years ago

Yes! Thank you very much for your help.

danielnelson commented 6 years ago

@ashuw018 Could you upload your working snmp_exporter configuration for comparison against the Telegraf plugin?

ashuw018 commented 6 years ago

@danielnelson Requested is attached.

prometheus.txt snmpexporter.txt

danielnelson commented 6 years ago

@ashuw018 Is this the agent you are unable to connect to? I notice the priv_protocl is DES while above in Telegraf it was set to AES.

if_mib:

  version: 3
  auth:
    username: user
    password: password
    #
    auth_protocol: SHA
    security_level: authPriv
    priv_protocol: DES
    priv_password: Passwordpriv
ashuw018 commented 6 years ago

Hi Daniel

Thats just a typo while filling dummy data for posting.

I am using AES in both. And tried with DES also.

danielnelson commented 6 years ago

We also have the report about complex passwords causing problems, would it be possible to test, temporarily of course, if it helps to use a weak password with only ascii letters?

ashuw018 commented 6 years ago

Hi Daniel,

Sorry for the delayed response. Actually we are going to testing phase with influxdb so the password were already kept simple. No character variations have been used.

danielnelson commented 6 years ago

@phemmer Do you know if it is possible to collect packet captures of version 3? I assume we would need sec_level = NoAuthNoPriv?

phemmer commented 6 years ago

oh, hrm. It's been so long since I've worked with anything using v3 I'm not sure what the protocol looks like. But to be safe yes, NoAuthNoPriv should ensure that you can view all request/response fields.

danielnelson commented 6 years ago

@ashuw018 Would it be possible to use NoAuthNoPriv temporarily? If you could capture the packets like this for both Telegraf and snmpget:

sudo tcpdump -s 0 -i [interface] -w test.pcap host [target] and port 161

BTW I looked through the snmp_exporter code for usage differences, but didn't see anything that stood out.

ashuw018 commented 6 years ago

@danielnelson I will reach to our NetSec support this Monday and will check if they can make such changes to get the required. If they allow me. I will definitely do that.

Also is there any net snmp compatible command for above?. As i do not have any linux box out there. All are windows.

Thanks.

danielnelson commented 6 years ago

It should be possible to use this windows version: https://www.winpcap.org/windump/, though I have not tested it. You run the tcpdump command in one shell while in another you run the net-snmp command, then you stop the tcpdump command and it should print that it captured some packets.

ashuw018 commented 6 years ago

@danielnelson I approached to our support staff with this request but they denied to make config with NoAuthNoPriv as it is against there data center norms they said to me. Unfortunately in my local office i do not have any device which can snmp v3 so that i can get this tested here.

danielnelson commented 6 years ago

Understandable, I will try to setup a v3 device for testing but it might take me awhile.

cpajr commented 6 years ago

Any update on this issue? I'm encountering the same problem.

danielnelson commented 6 years ago

@cpajr I did do a sanity test using snmp v3 to a net-snmp server and didn't have any trouble. Are you also using Windows and what kind of device are you querying?

cpajr commented 6 years ago

@danielnelson I'm running on Centos 7, Telegraf v1.5.3. I'm trying to query a Cisco ASA via SNMPv3. Oddly enough, I'm able poll other Cisco devices via SNMPv3 but I'm only encountering issues with the ASA. I can successfully perform an snmpwalk without issue on the ASA.

danielnelson commented 6 years ago

@cpajr Can you create a packet capture doing a snmpget and an equivalent query with Telegraf (with a single top level field). I think even with v3 security enabled it may be of some use, here is an example tcpdump:

sudo tcpdump -s 0 -i eth0 -w telegraf.pcap host 203.50.251.17 and port 161

Then upload these files along with your Telegraf snmp configuration (don't forget to remove your passwords or use a testing password).

ashuw018 commented 6 years ago

Hi Just FYI. Mine device was also cisco ASA 5515.

cpajr commented 6 years ago

@danielnelson I'll work to get this. It also appears that we have a commonality on the trouble device: Cisco ASA.

danielnelson commented 6 years ago

I just found this https://github.com/soniah/gosnmp/pull/108, I'll update our gosnmp dependency if you both can test it out.

cpajr commented 6 years ago

Let me know what needs to be updated and I will test it.

danielnelson commented 6 years ago

Here are some builds with the updated gosnmp for testing:

cpajr commented 6 years ago

That did the trick. Thank you.

danielnelson commented 6 years ago

I'll include this change in 1.6.0-rc3