influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.51k stars 5.55k forks source link

New snmp plugin a bit slow #1665

Closed StianOvrevage closed 6 years ago

StianOvrevage commented 8 years ago

I have a few problems with performance of the new SNMP plugin.

When doing snmpwalk of EtherLike-MIB::dot3StatsTable and IF-MIB::ifXTable and IF-MIB::ifTable on a Cisco router they complete in ~2, ~3 and ~3.3 seconds respectively (8.3 sec combined +/- 10%).

When polling with the snmp plugin it takes 17-19 seconds for a single run.

I'm unsure if the snmp plugins polls every host in parallel or in sequence. I only have one host to test against and even when I put each of the three tables in separate [[inputs.snmp]] sections they are polled sequentially and not in parallel.

Our needs are polling hundreds of devices with hundreds of interfaces every 5 or 10 seconds (which collectd and libsnmp does easily).

phemmer commented 8 years ago

How many records do you have in those tables?

It is true that the plugin doesn't do multiple agents in parallel, but the old one didn't either. Did the old version of the plugin perform faster? Or did you not use it? Doing multiple agents in parallel would be rather easy to implement. It also might be possible to do some parallelization for multiple fields/tables within a host (without making the code stupidly complex), but this will be a little challenging due to limitations with the snmp library the plugin uses. But that said, I'd be interested in increasing performance in serial runs before trying to parallelize. Parallelization would just hide the underlying issue without fixing it.

StianOvrevage commented 8 years ago

Not many. 6 interfaces only. I never tried the old plugin since I saw there was a new one around the corner.

I agree that increasing serial performance is important to be able to query a single host/table fast enough. But at some point I think parallelizing would become necessary to be able to query enough hosts within the allotted interval. Of course a workaround would be to split up the config and run dozens of telegrafs simultaneous.

phemmer commented 8 years ago

Hrm, is this a wan link then? I'm just trying to figure out why it would be slow. Like even 2-3 seconds for snmpwalk is slow. I was just assuming it was due to massive amounts of data.

Oh, I'm not saying we shouldn't do parallelization, just that fixing the serial performance should be prioritized.

StianOvrevage commented 8 years ago

Agreed.

Yes, this is over a WAN link so that is why even snmpwalk is rather slow.

phemmer commented 8 years ago

Thanks, I'll look into simulating a high latency link and getting the performance on par with the net-snmp tools.

StianOvrevage commented 8 years ago

Great. I will hopefully have access to the low-latency environment where we will be using it next week and give you some performance numbers from there as soon as I can.

phemmer commented 8 years ago

I've done some experimentation, and while I'm not sure how snmpwalk is faster than this plugin, I do have a few ideas which might speed things up for you. Try setting these parameters:

max_repetitions = 10
timeout = "10s"
retries = 1

These settings should work better than the defaults on a high latency link. You might also be able to tweak them some more to get even better performance for your specific link. And changing the timeout does have a performance impact, as a retry is sent every $timeout / ( $retries + 1 ).

However I do have some code change ideas to speed things up which I'm trying out right now.

@jwilder I wouldn't consider this a bug. Everything works as it's supposed to. This is just a request to make it faster. Nor is more info needed. Thanks :-)

Will-Beninger commented 7 years ago

My case will of course be atypical from most, but I'm polling roughly ~600 clients at a time and pulling maybe 3-4 tables and a few odd OIDs. The Plugin is completely too slow to accomplish this. I've had to fall back to a suite of BASH scripts making forked snmpget/snmptable calls to make up the difference.

Just for a comparison between the two, I'm using BASH to call snmptable on 2 tables with roughly 8 columns each as well as pulling down 7 OIDs using snmpget for 10 hosts. It's pulled together into InfluxDB line protocol and echoed back. Unfortunately I can't release the data being pulled but I could potentially release the code being called if interested.

# /usr/bin/time telegraf -input-filter exec -test
<redacted>
1.60user 0.17system 0:00.62elapsed 285%CPU (0avgtext+0avgdata 14924maxresident)k
0inputs+0outputs (0major+253427minor)pagefaults 0swaps

Using the plugin to do exactly the same: My Config:

# /usr /bin/time telegraf -input-plugin snmp -test > /dev/null
<redacted>
28.02user 0.31system 0:28.42elapsed 99%CPU (0avgtext+0avgdata 19604maxresident)k
0inputs+0outputs (0major+5607minor)pagefaults 0swaps

When I look through the plugin code, I see some attempts to use an SNMP library for some calls but then the much faster C built utilities in Linux are used as well. If the goal was to limit dependencies, it didn't work. Not to mention, the SNMP project seems to be relatively in it's infancy and probably not well suited for production collection.

A lot of the slow downs in the code are caused by executing all operations serially. Why are channels/parallel functions not being used?

Will-Beninger commented 7 years ago

I was able to get an improvement of almost 1/3 of the time simply by parallelizing that first piece of agent code in the Gather() function.

# /root/go/bin/telegraf -config /root/go/bin/telegraf.snmp -test
* Plugin: inputs.snmp, Collection 1
<redacted>
31.80user 0.12system 0:09.51elapsed 335%CPU (0avgtext+0avgdata 25592maxresident)k
0inputs+0outputs (0major+26600minor)pagefaults 0swaps

Code that I changed:

# git diff master snmpTest
diff --git a/plugins/inputs/snmp/snmp.go b/plugins/inputs/snmp/snmp.go
index cc750e7..3cac1fa 100644
--- a/plugins/inputs/snmp/snmp.go
+++ b/plugins/inputs/snmp/snmp.go
@@ -9,6 +9,7 @@ import (
        "strconv"
        "strings"
        "time"
+       "sync"

        "github.com/influxdata/telegraf"
        "github.com/influxdata/telegraf/internal"
@@ -372,6 +373,33 @@ func (s *Snmp) Description() string {
        return description
 }

+func(s *Snmp) cleanGather(acc telegraf.Accumulator, agent string, wg *sync.WaitGroup) error {
+               defer wg.Done()
+      gs, err := s.getConnection(agent)
+      if err != nil {
+         acc.AddError(Errorf(err, "agent %s", agent))
+         return nil
+      }
+
+      // First is the top-level fields. We treat the fields as table prefixes with an empty index.
+      t := Table{
+         Name:   s.Name,
+         Fields: s.Fields,
+      }
+      topTags := map[string]string{}
+      if err := s.gatherTable(acc, gs, t, topTags, false); err != nil {
+         acc.AddError(Errorf(err, "agent %s", agent))
+      }
+
+      // Now is the real tables.
+      for _, t := range s.Tables {
+         if err := s.gatherTable(acc, gs, t, topTags, true); err != nil {
+            acc.AddError(Errorf(err, "agent %s", agent))
+         }
+      }
+       return nil
+}
+
 // Gather retrieves all the configured fields and tables.
 // Any error encountered does not halt the process. The errors are accumulated
 // and returned at the end.
@@ -380,30 +408,12 @@ func (s *Snmp) Gather(acc telegraf.Accumulator) error {
                return err
        }

+       var wg sync.WaitGroup
        for _, agent := range s.Agents {
-               gs, err := s.getConnection(agent)
-               if err != nil {
-                       acc.AddError(Errorf(err, "agent %s", agent))
-                       continue
-               }
-
-               // First is the top-level fields. We treat the fields as table prefixes with an empty index.
-               t := Table{
-                       Name:   s.Name,
-                       Fields: s.Fields,
-               }
-               topTags := map[string]string{}
-               if err := s.gatherTable(acc, gs, t, topTags, false); err != nil {
-                       acc.AddError(Errorf(err, "agent %s", agent))
-               }
-
-               // Now is the real tables.
-               for _, t := range s.Tables {
-                       if err := s.gatherTable(acc, gs, t, topTags, true); err != nil {
-                               acc.AddError(Errorf(err, "agent %s", agent))
-                       }
-               }
+               wg.Add(1)
+               go s.cleanGather(acc,agent,&wg)
        }
+       wg.Wait()

        return nil
 }
phemmer commented 7 years ago

A lot of the slow downs in the code are caused by executing all operations serially. Why are channels/parallel functions not being used?

Because the underlying gosnmp library does not support it. We would have to spawn of dozens of copies of it to achieve parallelism. And doing so in such a manner that is controllable is difficult. We'd basically have to create a pool. I've attempted to make the gosnmp library able to handle parallel requests, but design issues in the library have made this very difficult.

Will-Beninger commented 7 years ago

@phemmer Seems we're both looking at this. See my obviously quick + dirty test code above. We don't need to necessarily parallelize the gosnmp library but all the calls that are happening serially and being waited on.

phemmer commented 7 years ago

You code will cause problems because you are reusing the same gosnmp object. It is not parallel safe. Doing so will result in receive errors.

Will-Beninger commented 7 years ago

Posting the full code this time instead of the diffs... but no, I'm instantiating a separate gosnmp object in each parallel call:

func(s *Snmp) cleanGather(acc telegraf.Accumulator, agent string, wg *sync.WaitGroup) error {
      defer wg.Done()
      gs, err := s.getConnection(agent)
      if err != nil {
         acc.AddError(Errorf(err, "agent %s", agent))
         return nil
      }

      // First is the top-level fields. We treat the fields as table prefixes with an empty index.
      t := Table{
         Name:   s.Name,
         Fields: s.Fields,
      }
      topTags := map[string]string{}
      if err := s.gatherTable(acc, gs, t, topTags, false); err != nil {
         acc.AddError(Errorf(err, "agent %s", agent))
      }

      // Now is the real tables.
      for _, t := range s.Tables {
         if err := s.gatherTable(acc, gs, t, topTags, true); err != nil {
            acc.AddError(Errorf(err, "agent %s", agent))
         }
      }
   return nil
}

// Gather retrieves all the configured fields and tables.
// Any error encountered does not halt the process. The errors are accumulated
// and returned at the end.
func (s *Snmp) Gather(acc telegraf.Accumulator) error {
   if err := s.init(); err != nil {
      return err
   }

   var wg sync.WaitGroup
   for _, agent := range s.Agents {
      wg.Add(1)
      go s.cleanGather(acc,agent,&wg)
   }
   wg.Wait()

   return nil
}
phemmer commented 7 years ago

Yes, that should in theory not cause any problems. But it is not how I would recommend addressing the issue. Much better results can be obtained by sending multiple simultaneous requests per-agent. For people requesting a large number of OIDs from one agent, your change won't help. The only way to send parallel requests per agent is to either create multiple gosnmp objects, or fix the gosnmp library so it's parallel safe. The latter is a much better solution as it scales far better than a pool.

Will-Beninger commented 7 years ago

Agreed, but I'm on the scale of using ~600 agents so parallelizing this makes a huge difference.

One question, why are you using this gosnmp library? The code makes calls to net-snmp-utils programs already for snmptranslate/snmptable/etc, why not just use them throughout? Making parallel calls to these programs would be parallel safe.

The only reason I can think to stay with the gosnmp library would be to reduce dependencies however the dependencies are already implicit by using the aforementioned programs.

phemmer commented 7 years ago

The code makes calls to net-snmp-utils programs already for snmptranslate/snmptable/etc, why not just use them throughout?

These utilities are optional. They add additional functionality to the plugin, but the plugin does not require them. They are basically just used for parsing MIB files. But yes, the ultimate reason is so that telegraf can be used without having to install external dependencies.

toni-moreno commented 7 years ago

Hi @StianOvrevage we are working in a snmp colector tool for influxdb that has a good behaviour with lots of metrics.

Its different from telegraph because it is focused only on snmp devicss and It has also a web-ui interface which help us to configure in a easy way.

Perhaps would you like to test its performance.

https://github.com/toni-moreno/snmpcollector

Thank you and sorry for the spam

StianOvrevage commented 7 years ago

@toni-moreno Awesome! I will have a look at it when I have time. I would love to give you some feedback and performance numbers from real-world testing at a few different setups I have available.

sparrc commented 7 years ago

@phemmer we have a customer who is polling hundreds of SNMP agents. Like @Will-Beninger, parallelizing just the agent connections would make a huge difference in performance.

Currently they workaround this by simply creating a different [[inputs.snmp]] instance for each agent, but it's a bit unwieldly because the config file is hundreds of thousands of lines long, and it has also created problems because it means that each instance independently tries to translate/lookup OIDs.

If you think that @Will-Beninger's change could help with multiple agents without causing parallel access issues, I'd like to get it into a PR.

@Will-Beninger do you think you could submit a PR with the diff that you've copied above?

phemmer commented 7 years ago

If you think that @Will-Beninger's change could help with multiple agents without causing parallel access issues, I'd like to get it into a PR.

If we want to put that in as a temporary solution until the underlying gosnmp issues are fixed, I would be ok with that. Just as long as we rip it out once gosnmp is parallelized.

~Just note that it is kinda dangerous in large scale. Because each instance of the snmp object consumes memory, launching many hundreds of them could result in telegraf exhausting the box's memory. It could also result in running out of file descriptors as is the issue in #2104 since each snmp object instance would consume a file descriptor.~

Edit: Nevermind, brain not working right since I just woke up. I don't think it would result in any more resource utilization. I'd have to look at it in detail, but the resources would still be allocated, this just causes them to be used at the same time instead of serially.

Will-Beninger commented 7 years ago

@sparrc unfortunately I can't open a PR due to some of the IP issues surrounding signing the CLA. Would appreciate if you/ @phemmer could open on my behalf.

We're trying to get internal approval from company legal to release some of our bug fixes/plugins in Telegraf but it's still a fairly long way off. Thanks!

jasonkeller commented 7 years ago

I'm likely to encounter this same issue in my upcoming looking glass project as well. Going to be querying every device (1800) every 150s roughly.

So +1 or +1800 from me for this 👍

tardoe commented 7 years ago

Same here, I'm seeing very weird spikes on snmp counters when a given number of devices is added . I increase the polling interval and the spikes stop occuring. I suspect that run N+1 is starting before N has completed, resulting in high or low counter deltas.

willemdh commented 7 years ago

Same issue here, we are only polling 8 switches every 10 seconds. Every switch has it's own config file in /etc/telegraf/telegraf.d with separate snmp input. The load / queue length on this server is already more then 4 and we need to monitor 400 more switches. Seems rather unusable at this time for large setups..

willemdh commented 7 years ago

The server running Telegraf is monitored by Nagios, which is showing me a very weird difference in load..

image

As said before, this server has 6 cpu's, how is it possible Telegraf, which is only monitoring 10 switches now, sometimes doesn't cause such a big load (1-2) and sometimes a very high load (6+).

image

/etc/telegraf/telegraf.conf

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "1s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["https://influxpr:8086"]
  database = "db_net_02"
  retention_policy = "rp_net_02"
  write_consistency = "any"
  timeout = "5s"
  username = "writer"
  password = "password"
  user_agent = "telegraf"
  insecure_skip_verify = false
  namepass = ["system*","interfaces*"]

An example of 1 switch snmp config: /etc/telegraf/telegraf.d/switch01.conf

[[inputs.ping]]
  urls = ["switch01"]
  interval = "5s"
  count = 1
  timeout = 1.0

[[inputs.snmp]]
  agents = ["switch01"]
  interval = "10s"
  name = "system"
  version = 3
  sec_name = "bla"
  sec_level = "authPriv"
  auth_protocol = "sha"
  auth_password = "zfezefzfe"
  priv_protocol = "des"
  priv_password = "efzefezfzef"

  [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "cpu"
    oid = "1.3.6.1.4.1.6486.801.1.2.1.16.1.1.1.1.1.15.0"

  [[inputs.snmp.table]]
    name = "interfaces" 
    inherit_tags = [ "hostname" ]

    [[inputs.snmp.table.field]]
      name = "ifIndex"
      oid = "IF-MIB::ifIndex"
      is_tag = true
    [[inputs.snmp.table.field]]
      name = "ifAlias"
      oid = "IF-MIB::ifAlias"
      is_tag = true
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifHCInOctets"
      oid = "IF-MIB::ifHCInOctets"
    [[inputs.snmp.table.field]]
      name = "ifHCOutOctets"
      oid = "IF-MIB::ifHCOutOctets"

    [[inputs.snmp.table.field]]
      name = "ifHCInBroadcastPkts"
      oid = "IF-MIB::ifHCInBroadcastPkts"
    [[inputs.snmp.table.field]]
      name = "ifHCOutBroadcastPkts"
      oid = "IF-MIB::ifHCOutBroadcastPkts"

    [[inputs.snmp.table.field]]
      name = "ifHCInUcastPkts"
      oid = "IF-MIB::ifHCInUcastPkts"
    [[inputs.snmp.table.field]]
      name = "ifHCOutUcastPkts"
      oid = "IF-MIB::ifHCOutUcastPkts"

    [[inputs.snmp.table.field]]
      name = "ifHCInMulticastPkts"
      oid = "IF-MIB::ifHCInMulticastPkts"
    [[inputs.snmp.table.field]]
      name = "ifHCOutMulticastPkts"
      oid = "IF-MIB::ifHCOutMulticastPkts"
toni-moreno commented 7 years ago

Hi @willemdh.

I suggest to test snmpcollector (https://github.com/toni-moreno/snmpcollector) we are gathering 200k metrics by one minute from close to 300 devices with only one agent and very low cpu ( less than 10%) in a little vm with only 8 cores.

I would like to get some more feedback about the performance of this tool.

Thank you very much.

phemmer commented 7 years ago

No offense, but why is it that every single ticket that is opened that mentions the snmp plugin gets an advertisement for snmpcollector?

willemdh commented 7 years ago

Imho I also prefer to get this working in Telegraf itself. Network monitoring is an important piece of any monitoring tool and should work with reasonable load with Telegraf.

If anyone can give me a suggestion to improve my posted Telegraf configuration? Or explain why the load is going up and down?

phemmer commented 7 years ago

@willemdh I would open up a new issue. Your problem is not what this ticket is about. I would also suspect your config is a lot more complex than what you show, as the config you provided cannot account for that much CPU usage.

willemdh commented 7 years ago

@phemmer Thanks for commenting and acknowledging this is not normal behaviour. I'll asap make some time to thoroughly document the setup in a new issue. (The config I provided really is the relevant part of my setup, except that I have 10 configuration files in telegraf.d, each file for 1 switch.)

ayounas commented 7 years ago

same issue here, i want to poll few hundred snmp network devices using telegraf snmp input plugin, every minute. But initial setup has shown that the plugin takes 15 seconds only to poll 3 devices, adding 20 more means telegraf wont finish poll before the next poll. Will be good to poll multiple devices in parallel, as people above have suggested. Thanks

wang1219 commented 7 years ago

Same issue here. I use SNMP input plugin collected 500 devices, each device 60 metrics, a total of 10 minutes ... but my demand is one minute @phemmer Is there any solution?

JerradGit commented 7 years ago

Just wanted to share my experience in case it helps anyone else out

We run all of our collection using the official telegraf docker image and up until I started to run into issues we ran everything within a single container. My CPU wasn't necessarily overly high, but I would notice that my graphs started to look very sporadic with high/low spikes rather than a smooth line like I was expecting. This started to get worse as I kept adding more new devices to be polled, I could see that the time stamps stored in influx were not consistently 1 minute apart, so due to the varying collection intervals functions like non_negative_derivative would report values out of range.

Example

image

Since we build a custom docker image from the telegraf as the base image, I elected to move a number of my snmp configs into separate containers. So rather than one container polling 25 devices, I broke things down into more device role type containers. e.g. Firewalls, Routers, Switches etc... the only extra work this required was a few extra Dockerfiles and updating my Makefile to produce different container names for these new roles (each container only had a copy of the config files for the devices which fell into that role)

After doing this my graphs immediately corrected themselves

image

I would obviously prefer to manage a single container for all devices, but this turned out to require very little effort to achieve similar results.

phemmer commented 7 years ago

Yeah, this issue, and everything else in this ticket boils down to the fact that the SNMP plugin runs serially. But the root issue keeping this from being addressed is the underlying SNMP library the plugin is using. It does not properly support parallel requests. Meaning you'd have to create multiple instances of the plugin running in memory. The memory usage of the plugin is rather high (due to buffering and such). Some users have thousands of network devices they want to poll, thus we cannot do this or the memory overhead would become huge. The solution is to fix the SNMP library to handle parallelization. I tried to tackle this back when I first wrote the SNMP plugin, but unfortunately the way the library supports SNMPv3 requires a massive redesign to support it. SNMPv2 works fine, just not v3. Discussion on this subject can be found here: https://github.com/soniah/gosnmp/issues/70#issuecomment-244584387

Will-Beninger commented 7 years ago

@phemmer My work situation has changed and I'm considering contributing to the project in my free time. I'm still seeing a notification every few weeks on this so it's apparently still an issue.

I'm able to open a PR and contribute my earlier code (once I've updated it) that "fixed" some of the parallelization issues we saw. Are you okay with proceeding with it as a workaround until the goSNMP project can be fixed?

I started deep-diving the goSNMP project and it's a bit of a mess. It almost needs to be rebuilt from the RFCs up. Interested in how you'd recommend tackling it.

toni-moreno commented 7 years ago

Hi @Will-Beninger , @phemmer . Sorry for my ignorance related with the snmp protocol and , parallelization issues.

I would like to know the reason why are you telling that gosnmp can not "handle" parallelization . I've been doing some test working with multiple snmp paral.lel handlers with gonsmp, and working fine for me (https://github.com/soniah/gosnmp/issues/64#issuecomment-291121905) also fixed some performance issues detected while doing these parallelizations (https://github.com/soniah/gosnmp/pull/102)

I'm confused, I hope you can give me some light over the lack of snmp plugin to handle parallel request and its relation with ability to do this in the base library gosnmp.

Thank you very much

Will-Beninger commented 7 years ago

@toni-moreno the gosnmp plugin is built in such a way that each remote server is hardcoded into the base object. Looking at your parallel scripts like this, you're only attempting to poll 1 device (and the loopback address at that) and mainly just parallelize the oid walks that you're doing. What this plugin is attempting to leverage is the polling of hundreds of different devices with potentially different OIDs. (My original use case had 500+ devices pulling hundreds of similar OIDs each)

This leaves us with 2 choices:

  1. Parallel instantiate hundreds of gosnmp instances
  2. Serially reset the underlying gosnmp data to the next device and move on

As to @phemmer 's concerns, I don't have a GREAT understanding of the underlying gosnmp library and would prefer he address that. I'm reading through but I see some areas where you'll see parallelized slowdowns and wait times for sending out requests such as the sendOneRequest() function and the send() function in marshal.go. There's a full pause, wait for retries, and check only at the beginning of the loop function for exceeding the retry timer.

Honestly, I don't know the best way to solve this use case. Appreciate input from both of you.

danielnelson commented 6 years ago

@Will-Beninger If you could open a pull request with the parallel execution that would be very much appreciated.

jasonkeller commented 6 years ago

A little something to add to this - we're currently bumping into this error in telegraf "socket: too many open files". This is related to us instantiating a separate input.snmp instance in our configuration file per device (they all have different community strings). We have about 1800 devices in total right now.

I have the clause LimitNOFILE=infinity set in the [Service] section of /usr/lib/systemd/system/telegraf.service to alleviate this (the base system ulimit has been raised to 16384 as well), however the last yum update from 1.3.5 to 1.4.0-1 managed to hammer over this file and I ended up losing a lot of data points over the night. I just noticed 1.4.1-1 dropped and once again, hammered over the file (this time I caught it before reload).

I bring this up as I'm unsure if this parallelization effort will also end up running into this wall when many devices are being polled.

phemmer commented 6 years ago

No, parallelization will make the issue worse, which is why I'm not fond of it. The issue really needs to be fixed within the gosnmp lib.

@jasonkeller See also https://www.freedesktop.org/software/systemd/man/systemd.unit.html (search for "drop-in") about how to alleviate your issue with package upgrades clobbering your LimitNOFILE override.

jasonkeller commented 6 years ago

Thanks @phemmer ! I had begun to wonder about how to keep those local overrides in but that link spells it out quite plainly (and I now have it integrated properly). Saved me loads of searching - thank you again.

danielnelson commented 6 years ago

Shouldn't the number of open sockets remain the same since we currently keep all sockets open between gathers?

danielnelson commented 6 years ago

Support for concurrently gathering across agents has been merged into master and should show up in the nightly builds in the next 24 hours. I expect this should help significantly if you have many agents.

I would appreciate any testing and feedback on how well this works in practice, we can determine if this issue can be closed based on what we learn.

ayounas commented 6 years ago

Thanks @danielnelson Just tried the latest nightly build and it is a huge improvment

Time taken to poll 10 device on latest nightly real 0m4.696s user 0m0.363s sys 0m0.098s

Time taken to poll same 10 devices with stable

real 0m16.728s user 0m0.377s sys 0m0.116s

I will add more devices and report times

justindiaw commented 6 years ago

I am having trouble to monitor when there are errors for some devices. When I check the log, it seems that the snmp plugin is trying for a long time on each field, and the other's have to wait. I guess this must because I am putting all IP address in one agent list. When some error happen to one device, the others have to wait. Which means, if I want to avoid this, I have to seperate all device by copying the same config. Then it would be a long config file for big property that I monitor. Is there anyway to let snmp plugin work asynchronously for different IP listed in the agent list? If so, user will save a lot of time to create a snmp config file.

danielnelson commented 6 years ago

@justindiaw What you are experiencing should be addressed in the 1.5 release, could you try the nightly build and let me know if it is working well for you.

justindiaw commented 6 years ago

@danielnelson Thanks for the fast reply. Good to know that. I'm going to try the new release.

danielnelson commented 6 years ago

Just to be clear, the 1.5 release with the change is not yet released, if you are able to help with testing you will need to use a nightly build or compile from source.

danielnelson commented 6 years ago

Should be a big improvement in 1.5, I'm closing this issue and we can open more targeted issues if needed.

zzcpower commented 2 years ago

Hi there, anyone still face this slow collection in 2022? One switch with around 3000 indexes(port) cost me 3mins to collect. My telegraf version is 1.15.4