influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Gaps in all telegraf ping metrics when only one host is down #11145

Closed eklein closed 1 year ago

eklein commented 2 years ago

I have telegraf running to check latency of 6 different hosts. Some are inside my network and others are outside my network. When I experience an internet outage, pings drop to my local hosts as well. I can only imagine that this is because the ping processes are run serially and it runs out of time. Is there a way to run this so that the ping input still runs against hosts that are still up while being able to see that some are down? I tried doing some searching but couldn't find anything, my apologies if this is super obvious or even documented somewhere.

powersj commented 2 years ago

Hi,

Can you share your telegraf config (with any secrets removed) and your telegraf logs if possible? It would be nice to see how you have telegraf configured, given there are a number of config values for the ping plugin that may affect the behavior of the plugin, and see what logs you are seeing.

Additionally, the ping plugin will go through and create a separate go routine for every URL specified by the ping plugin. So I would have expected it to collect the data from other hosts still.

Thanks

eklein commented 2 years ago

I've attached a grafana dashboard showing the behavior along with the corresponding logs from the 13:00 timeframe. In the image, mikrotik and cablemodem are internal and are in the configuration as IP addresses. Thank you for your help!

Config:

[global_tags]
  # no configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  username = <username>
  password = <password>
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
  # no configuration
[[inputs.kernel]]
  # no configuration
[[inputs.mem]]
  # no configuration
[[inputs.processes]]
  # no configuration
[[inputs.swap]]
  # no configuration
[[inputs.system]]
  # no configuration
[[inputs.ping]]
  urls = ["hosta","hostb","internal-hosta","internal-hostb"]
  count = 3

Logs:

May 19 13:15:00 beast docker[2368]: 2022-05-19T19:15:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:10 beast docker[2368]: 2022-05-19T19:15:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:15:30 beast docker[2368]: 2022-05-19T19:15:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:40 beast docker[2368]: 2022-05-19T19:15:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:48 beast docker[2368]: 2022-05-19T19:15:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:15:48 beast docker[2368]: 2022-05-19T19:15:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:16:00 beast docker[2368]: 2022-05-19T19:16:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:10 beast docker[2368]: 2022-05-19T19:16:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:20 beast docker[2368]: 2022-05-19T19:16:20Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:23 beast docker[2368]: 2022-05-19T19:16:23Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:16:40 beast docker[2368]: 2022-05-19T19:16:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:00 beast docker[2368]: 2022-05-19T19:17:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:10 beast docker[2368]: 2022-05-19T19:17:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:18 beast docker[2368]: 2022-05-19T19:17:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:17:18 beast docker[2368]: 2022-05-19T19:17:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:17:30 beast docker[2368]: 2022-05-19T19:17:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:40 beast docker[2368]: 2022-05-19T19:17:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:48 beast docker[2368]: 2022-05-19T19:17:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:17:50 beast docker[2368]: 2022-05-19T19:17:50Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:58 beast docker[2368]: 2022-05-19T19:17:58Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:18:10 beast docker[2368]: 2022-05-19T19:18:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:30 beast docker[2368]: 2022-05-19T19:18:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:40 beast docker[2368]: 2022-05-19T19:18:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:48 beast docker[2368]: 2022-05-19T19:18:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:18:48 beast docker[2368]: 2022-05-19T19:18:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:19:00 beast docker[2368]: 2022-05-19T19:19:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:10 beast docker[2368]: 2022-05-19T19:19:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:18 beast docker[2368]: 2022-05-19T19:19:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:19:18 beast docker[2368]: 2022-05-19T19:19:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:19:30 beast docker[2368]: 2022-05-19T19:19:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:40 beast docker[2368]: 2022-05-19T19:19:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:48 beast docker[2368]: 2022-05-19T19:19:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:19:48 beast docker[2368]: 2022-05-19T19:19:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:20:00 beast docker[2368]: 2022-05-19T19:20:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:10 beast docker[2368]: 2022-05-19T19:20:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:18 beast docker[2368]: 2022-05-19T19:20:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:20:18 beast docker[2368]: 2022-05-19T19:20:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:20:30 beast docker[2368]: 2022-05-19T19:20:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:40 beast docker[2368]: 2022-05-19T19:20:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:48 beast docker[2368]: 2022-05-19T19:20:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:20:48 beast docker[2368]: 2022-05-19T19:20:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:21:00 beast docker[2368]: 2022-05-19T19:21:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:10 beast docker[2368]: 2022-05-19T19:21:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:18 beast docker[2368]: 2022-05-19T19:21:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:21:18 beast docker[2368]: 2022-05-19T19:21:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:21:30 beast docker[2368]: 2022-05-19T19:21:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:40 beast docker[2368]: 2022-05-19T19:21:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:48 beast docker[2368]: 2022-05-19T19:21:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:21:48 beast docker[2368]: 2022-05-19T19:21:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:22:00 beast docker[2368]: 2022-05-19T19:22:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 22:39:30 beast docker[2368]: 2022-05-20T04:39:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s

Screen Shot 2022-05-20 at 11 01 12 AM

powersj commented 2 years ago

May 19 13:15:10 beast docker[2368]: 2022-05-19T19:15:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2

You have two things going on in the above log messages.

1) your collection is taking longer than 10 seconds (your interval time), which may end up causing the next interval to skip 2) the reason for the long collection is you seem to be having DNS issues, which probably are timing out.

My suggestion to you is to 1) understand why you are having DNS resolution issues and 2) re-evaluate your collection interval for the ping.

I don't see either of these being telegraf issues/bugs.

eklein commented 2 years ago

In this specific case, the dns resolution failure was due to an internet outage. I would expect my internal IP based hosts to still collect, no?

eklein commented 2 years ago

The collection interval error only comes up when I have an internet outage, so I assume it’s due to the resolution issues.

powersj commented 2 years ago

The collection interval error only comes up when I have an internet outage, so I assume it’s due to the resolution issues.

Correct

In this specific case, the dns resolution failure was due to an internet outage. I would expect my internal IP based hosts to still collect, no?

With your config, telegraf is using the native "ping" on your system.

urls = ["hosta","hostb","internal-hosta","internal-hostb"]

Assuming these are actually IP addresses in your config and not DNS names then maybe, but I do not know enough about the ping implementation.

eklein commented 2 years ago

I've tried everything I can think of now - using exec, adding a timeout of 1 second, reducing the number of pings to 1.. in all cases, I get gaps in the graph. I have a separate window pinging the one host that should always be up (from the host where telegraf is running, no less) and it keeps going no problem, despite the gaps appearing in the graphs like it's down.

Could this be somehow due to running telegraf in a docker container?

powersj commented 2 years ago

Could this be somehow due to running telegraf in a docker container?

As you add layers it absolutely could be that something interrupting the network. Have you tried running outside the container?

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

m13253 commented 11 months ago

I am able to reproduce the same issue with my config:

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "1s"
  flush_jitter = "0s"
  precision = "1ns"
[[outputs.influxdb_v2]]
  urls = ["http://localhost:8086"]
  token = "$INFLUX_TOKEN"
  organization = "org"
  bucket = "ping"
[[inputs.ping]]
  ## Hosts to send ping packets to.
  urls = [
    "8.8.8.8",
    "8.8.4.4",
    "1.1.1.1",
    "1.0.0.1",
    "www.google.com",
    "www.cloudflare.com",
  ]
  method = "native"
from(bucket: "ping")
  |> range(start: v.timeRangeStart, stop:v.timeRangeStop)
  |> filter(fn: (r) =>
    r._measurement == "ping" and
    r._field == "maximum_response_ms" and
    r.url == "${targetHost}"
  )

Screenshot 2023-11-13 at 22-45-40 Ping - Dashboards - Grafana

I receive the following message multiple times when the gap happens:

W! [inputs.ping] Colection took longer than expected; not complete after interval of 1s.

I guess, maybe one of the packet to one host is lost, causing the whole collection to take longer, blocking the next measurement to all hosts to start.

One important factor of this setup is: ping deadline >= collection interval. In my setup, ping deadline is default (10s), collection interval is 1s. In the original post, ping deadline is default (10s), collection interval is also 10s.

Most modern ping implementations (except for Windows’s ping.exe) should continuously send packets at a fixed interval (1s) and expect the responses to appear at any time and in any order. Out-of-order response is very common in wireless networks. However, the source code shows, Telegraf waits for all hosts to respond before sending another batch of pings.

I believe Telegraf’s logic has a mistake and needs to be fixed if my reasoning is correct.

Need information: Can anyone tell me whether my reasoning is correct that the timeout of one measurement will indeed block the next one to start?

m13253 commented 11 months ago

I made a better Ping plugin (using execd) in the past few days: https://github.com/m13253/telegraf-better-ping

Unlike the stock Ping plugin, it can still work even if some hosts are down. It also supports Pinging a large amount of hosts or multicast addresses. ~Also it correctly calls destination “destination” instead of “URL”.~

Please try it and let me know if it works for you or not.

Screenshot

opello commented 10 months ago

@powersj

Hi, I'm having this problem too.

  1. your collection is taking longer than 10 seconds (your interval time), which may end up causing the next interval to skip

It actually causes the next interval for all hosts to skip, which seems like the wrong kind of coupling to have for this tool.

The analysis by @m13253 matches my data:

I guess, maybe one of the packet to one host is lost, causing the whole collection to take longer, blocking the next measurement to all hosts to start.

Is it reasonable that while configuring telegraf to ping 2 hosts like this:

[[inputs.ping]]
  urls = ["192.168.1.1","1.1.1.1"]
  method = "native"
  ping_interval = 60.0
  timeout = 15.0
  deadline = 30

that when my Internet uplink is down, but local LAN is not, meaning I can easily reach 192.168.1.1 but 1.1.1.1 will timeout, that probing of 192.168.1.1 will essentially only happen every other ping_interval? It does not seem reasonable to me, but I am not sure I fully appreciate all of the issues involved. If it's unreasonable, I think this issue should be reopened, or a replacement, whatever the project workflow prefers.

If it's reasonable, it seems like there's a documentation deficiency because the "batching" behavior of the list of hosts isn't described anywhere that I can see in the README.md.

powersj commented 10 months ago

that when my Internet uplink is down, but local LAN is not, meaning I can easily reach 192.168.1.1 but 1.1.1.1 will timeout

It would be advisable to split your IPs between two ping plugins then: one for your local network and one for remote.

If it's reasonable, it seems like there's a documentation deficiency because the "batching" behavior of the list of hosts isn't described anywhere that I can see in the README.md.

This behavior is done at the telegraf agent level, not at the plugin level. The agent itself will call out to each plugin and expects it to complete before the next interval. If the plugin is not done running, then we skip the interval to avoid causing even more issues as we assume if the plugin is not done then something is either timing out or in an error state already.

There is work we want to get done to better cancel plugins that are running, but the agent behavior is not something that will fundamentally change.

opello commented 10 months ago

Thanks for the detail, I didn't have any appreciation for the scheduling out from the agent to the plugins. It makes sense to not "pile up" (compound a problem) in that way.

m13253 commented 10 months ago

It would be advisable to split your IPs between two ping plugins then: one for your local network and one for remote.

It would only work if we split all IPs into separate plugins, because a packet loss to any IP would block the whole plugin. Two is not enough.

It makes sense to not "pile up" (compound a problem) in that way.

I agree. So my telegraf-better-ping program has a different approach than this stock one. It sends out one ping each second, but use another thread to capture response packets and produce InfluxDB entries. There is no waiting, so there won’t be any “pile-up”. This approach is actually the same as Linux’s ping command. (But different from Windows’s ping, which waits for the response before proceeding.)

The only downside is, packet drop rate requires a post-processing step. (See my Readme for details.)

I would invite this stock ping plugin to implement my two-thread approach as an option.

m13253 commented 10 months ago

I think, the difference between the two implementation approaches are caused by two usage scenarios:

  1. Burst mode: For every minute, Telegraf sends out a bunch of pings, maybe 5 pings per host, then wait for 30 seconds, and collect the RTT and loss rate stats.
  2. Continuous mode: For every second, Telegraf sends out 1 ping per host and expect the responses to come at any time in any order. The stats are produced only upon receiving the arrival of each response.

The stock plugin is designed for the burst mode, but in order to monitor the reliability of home Internet, we want the continuous mode.