influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.68k stars 5.59k forks source link

[inputs.ping] Native method gives higher results comparing to exec #9729

Open manio opened 3 years ago

manio commented 3 years ago

Native method gives wrong results comparing to exec mode.

Relevant telegraf.conf:

[[inputs.ping]]
  urls = [
    "host1",
    "host2",
    "host3",
    "host4",
    "host5",
  ]

System info:

Telegraf 1.19.3 (git: HEAD a799489f) Debian Linux x86_64

Steps to reproduce:

  1. Collect ping data using preferred native method
  2. Collect ping data using the exec method
  3. Compare the results

Expected behavior:

The ping reply time should be at least comparable

Actual behavior:

Now it look like this image

You probably can see on this graph at which time I changed from native to exec :)

Additional info:

No idea why the ping is collecting the data wrongly in native mode - just guessing:

  1. spawning new ping thread overhead?
  2. thread's priority?
alkuzad commented 3 years ago

1ms means you are pinging something really close. Maybe go-ping does have weird behavior in this case. I was testing few sites using this gist, which compiles options used by the input plugin, and it's comparable to what system ping shows

manio commented 3 years ago

Yes, I am testing it on LAN host to have as much as reliable testing environment as I can. You can check it on your local gateway and try to compare it.

I built your gist, adjusted the host to mine and I can clearly see that the values are higher and are drifting much more then a regular ping tool. Here are the stats from the ping tool:

285 packets transmitted, 285 received, 0% packet loss, time 286287ms
rtt min/avg/max/mdev = 0.507/0.665/0.980/0.062 ms

while the go-ping average seems to be about 100ms higher (calling it multiple times with sending single request as in your code) and it's not uncommon to have values above 1ms which doesn't happen at all with a ping tool.

I was trying to raise the nice priority of the telegraf tool and all its thread - no change noticed. I also tested this on the collectd and ping module - results are same as using a ping utility.

alkuzad commented 3 years ago

@manio I looked into sources of go-ping and Linux curl, and it seems to be the difference in core mechanism:

It's accurate enough for long periods of time, but it could be not so accurate for small ones. Also, the generation of time for sent packets is not done ideally before a packet is sent. Maybe this introduces further delay.

Hipska commented 2 years ago

I just came to the same/similar problem of this issue. On my system I'm having almost 1700 ping configs (mostly with 1 IP in urls) and the native method results in response times of a few hundreds of ms, while the exec method is only a few ms:

2022-03-07T09:55:44Z I! Loaded inputs: ping (1692x)
> ping,host=xxx,url=x.x.x.x average_response_ms=536.686787,maximum_response_ms=720.719722,minimum_response_ms=115.140587,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=226.911318,ttl=61i 1646646949000000000
...
image

When only running this one ping config:

> ping,host=xxx,url=x.x.x.x average_response_ms=7.628867,maximum_response_ms=8.917635,minimum_response_ms=6.887803,packets_received=5i,packets_transmitted=5i,percent_packet_loss=0,result_code=0i,standard_deviation_ms=0.821779,ttl=61i 1646647152000000000

I was surprised very much when looking at my results and seeing this section in the documentation:

This plugin has two main methods of operation: exec and native. The recommended method is native, which has greater system compatibility and performance. However, for backwards compatibility the exec method is the default.

alkuzad commented 2 years ago

@Hipska docs is right, native version is written in Go, which executes faster and does not need to do any subprocesses spawn. It also provides easy interface without having to parse a lot of different versions of ping outputs.

Linux ping relies on Kernel, and gets time values directly from ICMP packets. Go version uses Time measurements, which won't be as accurate but it does not rely on things only available on one system. The only improvement to that would be using specific per-system code to get timestamps of packets.

The only thing I can suggest is to do docs update to show this "problem".

manio commented 2 years ago

FWIW after a longer testing period i finally just switched to native (having a "wrong" results) because using a exec methods gives glitches - I mean it probably parsing the results wrong (especially when there was a ping timeout) and I have eg. negative ping values which totally messed up my plots...

Hipska commented 2 years ago

@alkuzad Yes I know it means the overhead of launching external commands, but that should not result in less reliable data? In my example setup it did 3 seconds longer to process 1k7 urls, but the data was more correct. So I'm not convinced.

Also, time.Now() should have nanosecond precision, so that also doesn't explain these differences. There must be something else wrong in the library or telegraf's implementation..

alkuzad commented 2 years ago

@Hipska time.Now() precision is irrelevant if goroutines are waiting for free resources to run on. 3s longer means that golang just waited more for all of these processes to close but it does not impact measurement as system ping does not use time at all.

My simple pinger gist (from above) gives me very similar results locally for every ping to native binary, but I call 1 adress with 1 packet (default). It's the same result even for master branch. However, you are doing 1700urls and each of it does 5 packets. This is different situation as goroutines will not start/complete at Check with different number of urls each time and compare the stddev between this and non-native. You can use this simplified go program to remove impact of telegraf itself and test it there, it could be quite easy to port the channels/waiting group to do similar thing.

Hipska commented 2 years ago

I'm sorry, I'm not completely following. Are you saying my problem is different as the one from @manio and thus should be a separate issue?

On the other hand, I see you say it is a goroutine resources issue. Is there any way to tweak this so telegraf has more resources to handle it?

alkuzad commented 2 years ago

@Hipska yes these look like two different core issues. I don't think you can raise goroutines limit as they are set around max cpu cores and this makes sense. I would focus on finding out limits first and experiment a bit to see what seem to be an issue.

mandud commented 6 months ago

Hi Everyone,

Sorry bumping this thread, getting same issue when switching from exec into native (I'm using telegraf on windows) ping result looks good (getting specific decimal ms), but the result always higher than before, this confusing another team that will monitoring the networks

image

Thanks

Hipska commented 6 months ago

To be complete, I solved my issue with adding a jitter to the ping plugins. I have interval at 60s and jitter is 45s, so it will spread all ping requests over 45s instead of trying to launch them all at the start of the minute.

mandud commented 6 months ago

Hi @Hipska

So, after you reconfigure both parameter (jitter & interval) the result getting better? there's no mismatch measurement anymore?

If you don't mind, can share your latest configuration for ping plugin? I also still looking for best configuration ping monitoring for hundred devices.

Thank You

Regards, Mandud

Hipska commented 1 month ago
# Ping given url(s) and return statistics
[[inputs.ping]]
  alias = {{item.hostname|tojson}}
{% if item.interval is defined and item.interval != 60 %}
  interval = "{{item.interval}}s"
{% endif %}

  ## Collection jitter is used to jitter the collection by a random amount.
  ## The plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things at the same time, 
  ## which can have a measurable effect on the system.
  collection_jitter = "45s"

  ## Hosts to send ping packets to.
  urls = ["{{item.address}}"{% if item.fallback is defined %}, "{{item.fallback}}"{% endif %}]

  ## Method used for sending pings, can be either "exec" or "native".  When set
  ## to "exec" the systems ping command will be executed.  When set to "native"
  ## the plugin will send pings directly.
  ##
  ## While the default is "exec" for backwards compatibility, new deployments
  ## are encouraged to use the "native" method for improved compatibility and
  ## performance.
  method = "native"

  ## Number of ping packets to send per interval.  Corresponds to the "-c"
  ## option of the ping command.
  count = 5

And this repeated for every host in my system..