influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

SNMP input: error in unmarshalResponse #9917

Open ADovgalyuk opened 3 years ago

ADovgalyuk commented 3 years ago

On the server I have application with snmp interface support. Snmpwalk example:

[root@SOR-01 telegraf.d]# sudo -u telegraf snmpwalk -mALL -v 2c -c public 127.0.0.1:4162 1.3.6.1.4.1.20873
PROTEI-SG-DIAM-MIB::pcsmIndex.1 = INTEGER: 1
PROTEI-SG-DIAM-MIB::pcsmIndex.2 = INTEGER: 2
PROTEI-SG-DIAM-MIB::pcsmIndex.3 = INTEGER: 3
PROTEI-SG-DIAM-MIB::pcsmIndex.4 = INTEGER: 4
PROTEI-SG-DIAM-MIB::pcsmCA.1 = STRING: "Sg.DIAM.PCSM.3"
PROTEI-SG-DIAM-MIB::pcsmCA.2 = STRING: "Sg.DIAM.PCSM.2"
PROTEI-SG-DIAM-MIB::pcsmCA.3 = STRING: "Sg.DIAM.PCSM.1"
PROTEI-SG-DIAM-MIB::pcsmCA.4 = STRING: "Sg.DIAM.PCSM.0"
PROTEI-SG-DIAM-MIB::pcsmOSTATE.1 = STRING: "ACTIVATE"
PROTEI-SG-DIAM-MIB::pcsmOSTATE.2 = STRING: "ACTIVATE"
PROTEI-SG-DIAM-MIB::pcsmOSTATE.3 = STRING: "ACTIVATE"
PROTEI-SG-DIAM-MIB::pcsmOSTATE.4 = STRING: "ACTIVATE"
PROTEI-AD-OMI-MIB::ad-omi-acclIndex.1 = INTEGER: 1
PROTEI-AD-OMI-MIB::ad-omi-acclIndex.2 = INTEGER: 2
PROTEI-AD-OMI-MIB::ad-omi-acclIndex.3 = INTEGER: 3
PROTEI-AD-OMI-MIB::ad-omi-acclCA.1 = STRING: "Ad.OMI.ACCL.2"
PROTEI-AD-OMI-MIB::ad-omi-acclCA.2 = STRING: "Ad.OMI.ACCL.3"
PROTEI-AD-OMI-MIB::ad-omi-acclCA.3 = STRING: "Ad.OMI.ACCL.1"
PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE.1 = STRING: "ACTIVATE"
PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE.2 = STRING: "ACTIVATE"
PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE.3 = STRING: "ACTIVATE"

And the following conf file:

[[inputs.snmp]]
  agents = [ "127.0.0.1:4162" ]
  version = 2
  community = "public"
  tagexclude = ["agent_host"]

[inputs.snmp.tags]
     app="Protei_RG_LTE"
     Metrics_type="app"

[[inputs.snmp.table]]
    name = "OMI_ACCL"
    inherit_tags = [ "app","Metrics_type" ]

        [[inputs.snmp.table.field]]
        name = "CA"
        is_tag = true
        oid = "PROTEI-AD-OMI-MIB::ad-omi-acclCA"

        [[inputs.snmp.table.field]]
        name = "state"
        oid = "PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE"

[[inputs.snmp.table]]
    name = "OMI_ASCL"
    inherit_tags = [ "app","Metrics_type" ]

        [[inputs.snmp.table.field]]
        name = "CA"
        is_tag = true
        oid = "PROTEI-AD-OMI-MIB::ad-omi-asclCA"

        [[inputs.snmp.table.field]]
        name = "state"
        oid = "PROTEI-AD-OMI-MIB::ad-omi-asclOSTATE"

[[processors.enum]]
  [[processors.enum.mapping]]
    field = "state"
#    default = 0

    ## Table of mappings
    [processors.enum.mapping.value_mappings]
      ACTIVATE = 1
      FAIL = 0

At least on version 1.14.5 - everything works as expected:

[root@SOR-01 telegraf.d]# sudo -u telegraf telegraf --test --config Test.conf --debug
2021-10-13T10:12:12Z I! Starting Telegraf 1.14.5
2021-10-13T10:12:12Z D! [agent] Initializing plugins
2021-10-13T10:12:12Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclCA"
2021-10-13T10:12:12Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE"
2021-10-13T10:12:12Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclCA"
2021-10-13T10:12:12Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclOSTATE"
> OMI_ACCL,CA=Ad.OMI.ACCL.3,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state="ACTIVATE" 1634119933000000000
> OMI_ACCL,CA=Ad.OMI.ACCL.1,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state="ACTIVATE" 1634119933000000000
> OMI_ACCL,CA=Ad.OMI.ACCL.2,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state="ACTIVATE" 1634119933000000000

But on version 1.20 and the same config we've faced an error:

[root@SOR-02 telegraf.d]# sudo -u telegraf telegraf --test --config RG_LTE_omi.conf --debug
ERRO[0000]log.go:120 gosnowflake.(*defaultLogger).Errorf failed to create cache directory. /etc/telegraf/.cache/snowflake, err: mkdir /etc/telegraf/.cache: permission denied. ignored 
ERRO[0000]log.go:120 gosnowflake.(*defaultLogger).Errorf failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such file or directory 
2021-10-13T10:23:48Z I! Starting Telegraf 1.20.0
2021-10-13T10:23:48Z D! [agent] Initializing plugins
2021-10-13T10:23:48Z D! [agent] Starting service inputs
2021-10-13T10:23:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclCA"
2021-10-13T10:23:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE"
2021-10-13T10:23:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclCA"
2021-10-13T10:23:49Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclOSTATE"
2021-10-13T10:23:49Z E! [inputs.snmp] Error in plugin: agent 127.0.0.1:4162: gathering table OMI_ACCL: performing bulk walk for field CA: error in unmarshalResponse: error parsing OID Value: invalid OID length
2021-10-13T10:23:49Z E! [inputs.snmp] Error in plugin: agent 127.0.0.1:4162: gathering table OMI_ASCL: performing bulk walk for field CA: error in unmarshalResponse: error parsing OID Value: invalid OID length
2021-10-13T10:23:49Z D! [agent] Stopping service inputs
2021-10-13T10:23:49Z D! [agent] Input channel closed
2021-10-13T10:23:49Z D! [agent] Processor channel closed
2021-10-13T10:23:49Z D! [agent] Stopped Successfully
2021-10-13T10:23:49Z E! [telegraf] Error running agent: input plugins recorded 2 errors

I've checked on 1.19.3 and it works there as well. So I can only take a guess that it has something to do with new GO version implemented in 1.20

reimda commented 3 years ago

We don't have a way to reproduce this error because we don't have your snmp device, but I can share a few ideas about what might be related.

It's possible that building with a new go version could cause this change but I think it's very unlikely. I would expect a compiler problem to affect more than just snmp walk.

In the error log line, everything after performing bulk walk for field CA: comes from the gosnmp module That leads me to think that this could be the result of a change in the gosnmp module.

I understand from the issue description that Telegraf 1.14.5 and 1.19.3 both work and 1.20.0 doesn't work. Telegraf uses the same version of the gosnmp library in 1.19.3 and 1.20.0, so we can rule out a change in the gosnmp module:

$ git show v1.20.0:go.mod |grep snmp
    github.com/gosnmp/gosnmp v1.32.0
$ git show v1.19.3:go.mod |grep snmp
    github.com/gosnmp/gosnmp v1.32.0

I noticed that in your tests you used two different telegraf.conf files. The first is telegraf 1.14.5 with Test.conf and the second is 1.20.0 with RG_LTE_omi.conf. I wonder if the difference in behavior is related to a difference in configuration. It's hard to learn much from a test when two variables are changing.

Could you repeat your test with versions 1.20.0 and 1.19.3 but using the same config file. Please also post the config you use so we know for sure what telegraf is being asked to do.

ADovgalyuk commented 3 years ago

Config was the same, I was trying to remove some company specific info for more simple example, but missed second case, sorry. Good news, I was able to solve the problem on my proprietary application by tweaking option related to CyclicTreeWalk.

But if we still want to investigate a problem. Seems I was wrong about 19.3 being ok with that, maybe a miss during multiple versions changing. I've return all params as it was before and checked all major versions starting from 1.14.5 to 1.20, this case was broken somewhere between 1.18 and 1.19. Nothing was done between checks beside installing new version.

[root@SOR-01 telegraf.d]# sudo -u telegraf telegraf --test --config RG_LTE_omi.conf --debug
2021-10-15T09:52:34Z I! Starting Telegraf 1.18.0
2021-10-15T09:52:34Z D! [agent] Initializing plugins
2021-10-15T09:52:34Z D! [agent] Starting service inputs
2021-10-15T09:52:34Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclCA"
2021-10-15T09:52:34Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE"
2021-10-15T09:52:34Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclCA"
2021-10-15T09:52:34Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclOSTATE"
> OMI_ACCL,CA=Ad.OMI.ACCL.3,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state=1i 1634291555000000000
> OMI_ACCL,CA=Ad.OMI.ACCL.2,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state=1i 1634291555000000000
> OMI_ACCL,CA=Ad.OMI.ACCL.1,Metrics_type=app,app=Protei_RG_LTE,host=SOR-01 state=1i 1634291555000000000
2021-10-15T09:52:34Z D! [agent] Stopping service inputs
2021-10-15T09:52:34Z D! [agent] Input channel closed
2021-10-15T09:52:34Z D! [agent] Processor channel closed
2021-10-15T09:52:34Z D! [agent] Stopped Successfully

And 1.19

[root@SOR-01 telegraf.d]# sudo -u telegraf telegraf --test --config RG_LTE_omi.conf --debug
ERRO[0000]log.go:120 gosnowflake.(*defaultLogger).Errorf failed to create cache directory. /etc/telegraf/.cache/snowflake, err: mkdir /etc/telegraf/.cache: permission denied. ignored 
ERRO[0000]log.go:120 gosnowflake.(*defaultLogger).Errorf failed to open. Ignored. open /etc/telegraf/.cache/snowflake/ocsp_response_cache.json: no such file or directory 
2021-10-15T09:52:52Z I! Starting Telegraf 1.19.3
2021-10-15T09:52:52Z D! [agent] Initializing plugins
2021-10-15T09:52:52Z D! [agent] Starting service inputs
2021-10-15T09:52:52Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclCA"
2021-10-15T09:52:52Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-acclOSTATE"
2021-10-15T09:52:52Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclCA"
2021-10-15T09:52:52Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "PROTEI-AD-OMI-MIB::ad-omi-asclOSTATE"
2021-10-15T09:52:52Z E! [inputs.snmp] Error in plugin: agent 172.26.17.12:4162: gathering table OMI_ACCL: performing bulk walk for field CA: error in unmarshalResponse: error parsing OID Value: invalid OID length
2021-10-15T09:52:53Z E! [inputs.snmp] Error in plugin: agent 172.26.17.12:4162: gathering table OMI_ASCL: performing bulk walk for field CA: error in unmarshalResponse: error parsing OID Value: invalid OID length
2021-10-15T09:52:53Z D! [agent] Stopping service inputs
2021-10-15T09:52:53Z D! [agent] Input channel closed
2021-10-15T09:52:53Z D! [agent] Processor channel closed
2021-10-15T09:52:53Z D! [agent] Stopped Successfully
2021-10-15T09:52:53Z E! [telegraf] Error running agent: input plugins recorded 2 errors

Is there any additional info I can provide?

Hipska commented 3 years ago

@reimda gosnmp v1.33.0 has some fixed regarding parsing lengths..

reimda commented 2 years ago

I updated gosnmp to 1.33.0 in https://github.com/influxdata/telegraf/pull/10076. When the CI build completes for that PR, could you test that change and see whether the error that started happening in telegraf 1.20 still happens? Thanks!

reimda commented 2 years ago

The build is here: https://github.com/influxdata/telegraf/pull/10076#issuecomment-963626781. Click the dropdown triangle and choose your platform.

Hipska commented 2 years ago

Resolved by #10076 and included in telegraf 1.20, pleas reopen/comment if this did not solve the current issue.

ADovgalyuk commented 2 years ago

Hello @Hipska, @reimda Sorry for no testing previously, Is it too late to reopen this issue? It's still represented in latest version. Was mentioned again in this issue, but no answer. snmpwalk:

[root@XXX1 telegraf.d]# snmpwalk -v2c -Cc -cpublic 127.0.0.1:3161 .
DIAM-MIB::pcsmIndex.1 = INTEGER: 1
DIAM-MIB::pcsmCA.1 = STRING: "Sg.DIAM.PCSM.2"
DIAM-MIB::pcsmOSTATE.1 = STRING: "ACTIVATE"
End of MIB
[root@XXX1 telegraf.d]# snmpwalk -m ./ -v2c -Cc -cpublic 127.0.0.1:3161 .
Failed to parse MIB file ./
iso.3.6.1.4.1.20873.2.30.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.20873.2.30.1.1.100.1 = STRING: "Sg.DIAM.PCSM.2"
iso.3.6.1.4.1.20873.2.30.1.1.4096.1 = STRING: "ACTIVATE"

Config:

[[inputs.snmp]]
    agents = [ "127.0.0.1:3161" ]
    version = 2
    community = "public"
    tagexclude = ["agent_host"]

   [[inputs.snmp.table]]
        name = "Diameter"
        inherit_tags = [ "app","Metrics_type" ]

        [[inputs.snmp.table.field]]
        name = "CA"
        is_tag = true
        oid = "DIAM-MIB::pcsmCA"

        [[inputs.snmp.table.field]]
        name = "State"
        oid = "DIAM-MIB::pcsmOSTATE"

Testing on 1.18:

[root@MME1 telegraf.d]# telegraf --test --config diam.conf --debug
2022-03-18T08:31:57Z I! Starting Telegraf 1.18.1
2022-03-18T08:31:57Z D! [agent] Initializing plugins
2022-03-18T08:31:57Z D! [agent] Starting service inputs
2022-03-18T08:31:57Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "DIAM-MIB::pcsmCA"
2022-03-18T08:31:57Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "DIAM-MIB::pcsmOSTATE"
2022-03-18T08:31:57Z D! [agent] Stopping service inputs
2022-03-18T08:31:57Z D! [agent] Input channel closed
2022-03-18T08:31:57Z D! [agent] Processor channel closed
2022-03-18T08:31:57Z D! [agent] Stopped Successfully
> Diameter,CA=Sg.DIAM.PCSM.2,host=MME1 State=1i 1647592317000000000

Testing on 21.4:

[root@MME1 telegraf.d]# telegraf --test --config MME_diam.conf 
2022-03-18T08:35:24Z I! Starting Telegraf 1.21.4
2022-03-18T08:35:24Z I! Loaded inputs: snmp
2022-03-18T08:35:24Z I! Loaded aggregators: 
2022-03-18T08:35:24Z I! Loaded processors: 
2022-03-18T08:35:24Z W! Outputs are not used in testing mode!
2022-03-18T08:35:24Z I! Tags enabled: host=MME1
2022-03-18T08:35:24Z E! [inputs.snmp] Error in plugin: agent 127.0.0.1:3161: gathering table Diameter: performing bulk walk for field CA: error in unmarshalResponse: error parsing OID Value: invalid OID length
2022-03-18T08:35:24Z E! [telegraf] Error running agent: input plugins recorded 1 errors
Hipska commented 2 years ago

Okay, will reopen, please report this back faster next time.

The issue you linked is this exact same issue. So did you try to link another issue? Please let us know which one so we can have a look as well..

ADovgalyuk commented 2 years ago

Thanks a lot. It was mentioned here: https://github.com/influxdata/telegraf/issues/10654

ADovgalyuk commented 1 year ago

Hi Team, Any news or can't be reproduced and no ideas then? Still relevant on latest telegraf versions.