Closed eklein closed 1 year ago
Hi,
Can you share your telegraf config (with any secrets removed) and your telegraf logs if possible? It would be nice to see how you have telegraf configured, given there are a number of config values for the ping plugin that may affect the behavior of the plugin, and see what logs you are seeing.
Additionally, the ping plugin will go through and create a separate go routine for every URL specified by the ping plugin. So I would have expected it to collect the data from other hosts still.
Thanks
I've attached a grafana dashboard showing the behavior along with the corresponding logs from the 13:00 timeframe. In the image, mikrotik and cablemodem are internal and are in the configuration as IP addresses. Thank you for your help!
Config:
[global_tags]
# no configuration
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
[[outputs.influxdb]]
username = <username>
password = <password>
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
# no configuration
[[inputs.kernel]]
# no configuration
[[inputs.mem]]
# no configuration
[[inputs.processes]]
# no configuration
[[inputs.swap]]
# no configuration
[[inputs.system]]
# no configuration
[[inputs.ping]]
urls = ["hosta","hostb","internal-hosta","internal-hostb"]
count = 3
Logs:
May 19 13:15:00 beast docker[2368]: 2022-05-19T19:15:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:10 beast docker[2368]: 2022-05-19T19:15:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:15:30 beast docker[2368]: 2022-05-19T19:15:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:40 beast docker[2368]: 2022-05-19T19:15:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:48 beast docker[2368]: 2022-05-19T19:15:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:15:48 beast docker[2368]: 2022-05-19T19:15:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:16:00 beast docker[2368]: 2022-05-19T19:16:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:10 beast docker[2368]: 2022-05-19T19:16:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:20 beast docker[2368]: 2022-05-19T19:16:20Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:16:23 beast docker[2368]: 2022-05-19T19:16:23Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:16:40 beast docker[2368]: 2022-05-19T19:16:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:00 beast docker[2368]: 2022-05-19T19:17:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:10 beast docker[2368]: 2022-05-19T19:17:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:18 beast docker[2368]: 2022-05-19T19:17:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:17:18 beast docker[2368]: 2022-05-19T19:17:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:17:30 beast docker[2368]: 2022-05-19T19:17:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:40 beast docker[2368]: 2022-05-19T19:17:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:48 beast docker[2368]: 2022-05-19T19:17:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:17:50 beast docker[2368]: 2022-05-19T19:17:50Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:17:58 beast docker[2368]: 2022-05-19T19:17:58Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:18:10 beast docker[2368]: 2022-05-19T19:18:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:30 beast docker[2368]: 2022-05-19T19:18:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:40 beast docker[2368]: 2022-05-19T19:18:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:18:48 beast docker[2368]: 2022-05-19T19:18:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:18:48 beast docker[2368]: 2022-05-19T19:18:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:19:00 beast docker[2368]: 2022-05-19T19:19:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:10 beast docker[2368]: 2022-05-19T19:19:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:18 beast docker[2368]: 2022-05-19T19:19:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:19:18 beast docker[2368]: 2022-05-19T19:19:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:19:30 beast docker[2368]: 2022-05-19T19:19:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:40 beast docker[2368]: 2022-05-19T19:19:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:19:48 beast docker[2368]: 2022-05-19T19:19:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:19:48 beast docker[2368]: 2022-05-19T19:19:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:20:00 beast docker[2368]: 2022-05-19T19:20:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:10 beast docker[2368]: 2022-05-19T19:20:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:18 beast docker[2368]: 2022-05-19T19:20:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:20:18 beast docker[2368]: 2022-05-19T19:20:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:20:30 beast docker[2368]: 2022-05-19T19:20:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:40 beast docker[2368]: 2022-05-19T19:20:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:20:48 beast docker[2368]: 2022-05-19T19:20:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:20:48 beast docker[2368]: 2022-05-19T19:20:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:21:00 beast docker[2368]: 2022-05-19T19:21:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:10 beast docker[2368]: 2022-05-19T19:21:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:18 beast docker[2368]: 2022-05-19T19:21:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:21:18 beast docker[2368]: 2022-05-19T19:21:18Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:21:30 beast docker[2368]: 2022-05-19T19:21:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:40 beast docker[2368]: 2022-05-19T19:21:40Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:21:48 beast docker[2368]: 2022-05-19T19:21:48Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
May 19 13:21:48 beast docker[2368]: 2022-05-19T19:21:48Z E! [inputs.ping] Error in plugin: host hostb: ping: hostb: Temporary failure in name resolution, exit status 2
May 19 13:22:00 beast docker[2368]: 2022-05-19T19:22:00Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 22:39:30 beast docker[2368]: 2022-05-20T04:39:30Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
May 19 13:15:10 beast docker[2368]: 2022-05-19T19:15:10Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s May 19 13:15:18 beast docker[2368]: 2022-05-19T19:15:18Z E! [inputs.ping] Error in plugin: host hosta: ping: hosta: Temporary failure in name resolution, exit status 2
You have two things going on in the above log messages.
1) your collection is taking longer than 10 seconds (your interval time), which may end up causing the next interval to skip 2) the reason for the long collection is you seem to be having DNS issues, which probably are timing out.
My suggestion to you is to 1) understand why you are having DNS resolution issues and 2) re-evaluate your collection interval for the ping.
I don't see either of these being telegraf issues/bugs.
In this specific case, the dns resolution failure was due to an internet outage. I would expect my internal IP based hosts to still collect, no?
The collection interval error only comes up when I have an internet outage, so I assume it’s due to the resolution issues.
The collection interval error only comes up when I have an internet outage, so I assume it’s due to the resolution issues.
Correct
In this specific case, the dns resolution failure was due to an internet outage. I would expect my internal IP based hosts to still collect, no?
With your config, telegraf is using the native "ping" on your system.
urls = ["hosta","hostb","internal-hosta","internal-hostb"]
Assuming these are actually IP addresses in your config and not DNS names then maybe, but I do not know enough about the ping implementation.
I've tried everything I can think of now - using exec, adding a timeout of 1 second, reducing the number of pings to 1.. in all cases, I get gaps in the graph. I have a separate window pinging the one host that should always be up (from the host where telegraf is running, no less) and it keeps going no problem, despite the gaps appearing in the graphs like it's down.
Could this be somehow due to running telegraf in a docker container?
Could this be somehow due to running telegraf in a docker container?
As you add layers it absolutely could be that something interrupting the network. Have you tried running outside the container?
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!
I am able to reproduce the same issue with my config:
[agent]
interval = "1s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "1s"
flush_jitter = "0s"
precision = "1ns"
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
token = "$INFLUX_TOKEN"
organization = "org"
bucket = "ping"
[[inputs.ping]]
## Hosts to send ping packets to.
urls = [
"8.8.8.8",
"8.8.4.4",
"1.1.1.1",
"1.0.0.1",
"www.google.com",
"www.cloudflare.com",
]
method = "native"
from(bucket: "ping")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) =>
r._measurement == "ping" and
r._field == "maximum_response_ms" and
r.url == "${targetHost}"
)
I receive the following message multiple times when the gap happens:
W! [inputs.ping] Colection took longer than expected; not complete after interval of 1s.
I guess, maybe one of the packet to one host is lost, causing the whole collection to take longer, blocking the next measurement to all hosts to start.
One important factor of this setup is: ping deadline >= collection interval. In my setup, ping deadline is default (10s), collection interval is 1s. In the original post, ping deadline is default (10s), collection interval is also 10s.
Most modern ping implementations (except for Windows’s ping.exe) should continuously send packets at a fixed interval (1s) and expect the responses to appear at any time and in any order. Out-of-order response is very common in wireless networks. However, the source code shows, Telegraf waits for all hosts to respond before sending another batch of pings.
I believe Telegraf’s logic has a mistake and needs to be fixed if my reasoning is correct.
Need information: Can anyone tell me whether my reasoning is correct that the timeout of one measurement will indeed block the next one to start?
I made a better Ping plugin (using execd) in the past few days: https://github.com/m13253/telegraf-better-ping
Unlike the stock Ping plugin, it can still work even if some hosts are down. It also supports Pinging a large amount of hosts or multicast addresses. ~Also it correctly calls destination “destination” instead of “URL”.~
Please try it and let me know if it works for you or not.
@powersj
Hi, I'm having this problem too.
- your collection is taking longer than 10 seconds (your interval time), which may end up causing the next interval to skip
It actually causes the next interval for all hosts to skip, which seems like the wrong kind of coupling to have for this tool.
The analysis by @m13253 matches my data:
I guess, maybe one of the packet to one host is lost, causing the whole collection to take longer, blocking the next measurement to all hosts to start.
Is it reasonable that while configuring telegraf to ping 2 hosts like this:
[[inputs.ping]]
urls = ["192.168.1.1","1.1.1.1"]
method = "native"
ping_interval = 60.0
timeout = 15.0
deadline = 30
that when my Internet uplink is down, but local LAN is not, meaning I can easily reach 192.168.1.1 but 1.1.1.1 will timeout, that probing of 192.168.1.1 will essentially only happen every other ping_interval? It does not seem reasonable to me, but I am not sure I fully appreciate all of the issues involved. If it's unreasonable, I think this issue should be reopened, or a replacement, whatever the project workflow prefers.
If it's reasonable, it seems like there's a documentation deficiency because the "batching" behavior of the list of hosts isn't described anywhere that I can see in the README.md.
that when my Internet uplink is down, but local LAN is not, meaning I can easily reach 192.168.1.1 but 1.1.1.1 will timeout
It would be advisable to split your IPs between two ping plugins then: one for your local network and one for remote.
If it's reasonable, it seems like there's a documentation deficiency because the "batching" behavior of the list of hosts isn't described anywhere that I can see in the README.md.
This behavior is done at the telegraf agent level, not at the plugin level. The agent itself will call out to each plugin and expects it to complete before the next interval. If the plugin is not done running, then we skip the interval to avoid causing even more issues as we assume if the plugin is not done then something is either timing out or in an error state already.
There is work we want to get done to better cancel plugins that are running, but the agent behavior is not something that will fundamentally change.
Thanks for the detail, I didn't have any appreciation for the scheduling out from the agent to the plugins. It makes sense to not "pile up" (compound a problem) in that way.
It would be advisable to split your IPs between two ping plugins then: one for your local network and one for remote.
It would only work if we split all IPs into separate plugins, because a packet loss to any IP would block the whole plugin. Two is not enough.
It makes sense to not "pile up" (compound a problem) in that way.
I agree. So my telegraf-better-ping program has a different approach than this stock one.
It sends out one ping each second, but use another thread to capture response packets and produce InfluxDB entries. There is no waiting, so there won’t be any “pile-up”.
This approach is actually the same as Linux’s ping
command. (But different from Windows’s ping
, which waits for the response before proceeding.)
The only downside is, packet drop rate requires a post-processing step. (See my Readme for details.)
I would invite this stock ping plugin to implement my two-thread approach as an option.
I think, the difference between the two implementation approaches are caused by two usage scenarios:
The stock plugin is designed for the burst mode, but in order to monitor the reliability of home Internet, we want the continuous mode.
I have telegraf running to check latency of 6 different hosts. Some are inside my network and others are outside my network. When I experience an internet outage, pings drop to my local hosts as well. I can only imagine that this is because the ping processes are run serially and it runs out of time. Is there a way to run this so that the ping input still runs against hosts that are still up while being able to see that some are down? I tried doing some searching but couldn't find anything, my apologies if this is super obvious or even documented somewhere.