Ping plugin providing incorrect values

mabushey commented 9 years ago

2015-09-22_ping_time

From /etc/opt/telegraf/telegraf.conf:

[ping]
        # urls to ping
        urls = ["www.google.com"] # required
        # number of pings to send (ping -c <COUNT>)
        count = 1 # required
        # interval, in s, at which to ping. 0 == default (ping -i <PING_INTERVAL>)
        ping_interval = 0.0
        # ping timeout, in s. 0 == no timeout (ping -t <TIMEOUT>)
        timeout = 0.0
        # interface to send ping from (ping -I <INTERFACE>)
        interface = ""

I'm not sure if this is a bug in Telegraf or the cloud provider has some machine that's responding locally to ping requests to google.com. When I hover over the purple area at the bottom 90% of the way to the right they are mostly 1ms with an occasional 2ms or 4ms.

sparrc commented 9 years ago

@mabushey could you provide output from the ping command on that server? try running ping -c 10 www.google.com and show me the output.

Personally I frequently get 4ms pings from google.com with regular consumer-class comcast service, so this doesn't sound crazily out of the ordinary to me

mabushey commented 9 years ago

# ping -c 10 www.google.com
PING www.google.com (173.194.210.147) 56(84) bytes of data.
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=1 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=2 ttl=44 time=37.0 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=3 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=4 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=5 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=6 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=7 ttl=44 time=37.0 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=8 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=9 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=10 ttl=44 time=36.9 ms

--- www.google.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9010ms
rtt min/avg/max/mdev = 36.913/36.973/37.097/0.199 ms

sparrc commented 9 years ago

Do you have ping output from when this was running? What makes you think these values are incorrect?

sparrc commented 9 years ago

Also, it seems to consistently drop across almost all of your servers, which indicates to me that telegraf is getting the correct values.

(with the exception of the "green" server, which appears to be somewhat independent)

mabushey commented 9 years ago

Can I specifically log the results of the pings? The log files are almost empty:

# zcat /var/log/telegraf/telegraf.log.1.gz
2015/09/23 00:49:46 Starting Telegraf (version 0.1.8)
2015/09/23 00:49:46 Loaded outputs: influxdb
2015/09/23 00:49:46 Loaded plugins: cpu disk haproxy io mem net nginx ping swap system
2015/09/23 00:49:46 Tags enabled: host=www7
2015/09/23 02:05:36 Error in plugin [ping]: PING www.google.com (173.194.218.105) 56(84) bytes of data.

--- www.google.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms, exit status 1

sparrc commented 9 years ago

@mabushey, There currently isn't, but you bring up a good point, we probably need multiple levels of debug logging in order to track down problems like this. Instead of just -debug, we should have verbosity levels like -v, -vv, etc.

mabushey commented 9 years ago

Also limiting the debug to a plugin would be helpful, something like -debug -filter ping

sparrc commented 9 years ago

@mabushey one thing you could do in the meantime would be to just run ping in the background and see if it corroborates, like:

nohup ping -i <INTERVAL> -D www.google.com > /tmp/pingoutput.log &

-D adds unix timestamps

mabushey commented 9 years ago

I ran a ping above and waited until I saw the issue in grafana (all servers except for one report 1ms ping time to google). 2015-09-25_ping_grafana

The log file, on a server that shows 1ms in the screenshot attached, shows only ping times between 38 to 41ms.

[1443134268.006129] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=887 ttl=53 time=39.7 ms
[1443134278.016644] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=888 ttl=53 time=40.3 ms
[1443134288.018696] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=889 ttl=53 time=40.6 ms
[1443134298.020866] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=890 ttl=53 time=41.5 ms
[1443134308.021290] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=891 ttl=53 time=40.4 ms
[1443134318.022837] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=892 ttl=53 time=40.7 ms
[1443134328.023950] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=893 ttl=53 time=40.5 ms
[1443134338.025345] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=894 ttl=53 time=40.7 ms
[1443134348.026521] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=895 ttl=53 time=40.6 ms
[1443134358.027565] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=896 ttl=53 time=40.5 ms
[1443134368.029110] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=897 ttl=53 time=40.8 ms

sparrc commented 9 years ago

@mabushey I haven't been able to reproduce any issues with the ping plugin locally, so I'd like to take Grafana out of the mix as well.

You can run the query locally or remotely using a command like this: influx -database 'telegraf' -host 'localhost' -port '8086' -execute 'SELECT * FROM ping_average_response_ms'

This will give me the actual data in InfluxDB along with the UNIX timestamps, which can be matched up with the ping output as you've shown above

> SELECT * FROM ping_average_response_ms
name: ping_average_response_ms
------------------------------
time            host    url     value
1443141757000000000 tyrion  www.google.com  5.757
1443141762000000000 tyrion  www.google.com  5.353
1443141767000000000 tyrion  www.google.com  5.414
1443141772000000000 tyrion  www.google.com  5.41

sparrc commented 9 years ago

WHERE clause could help to find the low readings:

influx -database 'telegraf' -host 'localhost' -port '8086' -execute 'SELECT value FROM ping_average_response_ms WHERE value < 4'

mabushey commented 9 years ago

mysql> SELECT FROM_UNIXTIME(1443134160) - TIMEDIFF(NOW(), UTC_TIMESTAMP) as UTCtime;                                                           
+----------------+
| UTCtime        |
+----------------+
| 20150924223600 |
+----------------+

root@influxserver:~# /opt/influxdb/influx -database 'telegraf' -execute 'SELECT * FROM ping_average_response_ms WHERE time >  1443134160 AND value < 4 LIMIT 5'                                                                                                                                         
name: ping_average_response_ms
------------------------------
name: ping_average_response_ms
------------------------------
time                    host    url             value
2015-09-18T18:40:50Z    nottheserverIwant     www.google.com  1.903
2015-09-18T18:41:00Z    nottheserverIwant     www.google.com  1.413
2015-09-18T18:41:10Z    nottheserverIwant     www.google.com  1.481
2015-09-18T18:41:20Z    nottheserverIwant     www.google.com  0.739
2015-09-18T18:41:30Z    nottheserverIwant     www.google.com  1.762
...

Apparently InfluxDB timestamps are from a timezone 6 days off.

I have 30 servers, can someone help me translate AND host="server1" from SQL into "SQL Like"? This fake SQL syntax makes me sad. At least I have grep.

/opt/influxdb/influx -database 'telegraf' -execute 'SELECT * FROM ping_average_response_ms WHERE time > 1443109080 AND value < 4' | grep theserverIwant 
...
2015-09-24T20:07:26Z    theserverIwant     www.google.com  0.804
2015-09-24T22:36:36Z    theserverIwant     www.google.com  0.722
2015-09-24T22:36:46Z    theserverIwant     www.google.com  0.764
2015-09-24T22:36:56Z    theserverIwant     www.google.com  0.666
2015-09-24T22:37:06Z    theserverIwant     www.google.com  0.682
2015-09-24T22:37:16Z    theserverIwant     www.google.com  0.771
2015-09-24T22:37:26Z    theserverIwant     www.google.com  0.663
2015-09-24T22:37:36Z    theserverIwant     www.google.com  0.709
2015-09-24T22:37:46Z    theserverIwant     www.google.com  0.72
2015-09-24T22:37:56Z    theserverIwant     www.google.com  0.615
2015-09-24T22:38:06Z    theserverIwant     www.google.com  0.642
2015-09-24T22:38:16Z    theserverIwant     www.google.com  0.757
2015-09-24T22:38:26Z    theserverIwant     www.google.com  0.708
2015-09-24T22:38:36Z    theserverIwant     www.google.com  1.016
2015-09-24T22:38:46Z    theserverIwant     www.google.com  0.636
2015-09-24T22:38:56Z    theserverIwant     www.google.com  0.918
2015-09-24T22:39:06Z    theserverIwant     www.google.com  0.648
2015-09-24T22:39:16Z    theserverIwant     www.google.com  0.658
2015-09-24T22:39:26Z    theserverIwant     www.google.com  0.691
2015-09-24T22:39:36Z    theserverIwant     www.google.com  0.647
2015-09-24T22:39:46Z    theserverIwant     www.google.com  0.683
2015-09-24T22:39:56Z    theserverIwant     www.google.com  1.661
2015-09-24T22:40:06Z    theserverIwant     www.google.com  0.769
2015-09-24T22:40:17Z    theserverIwant     www.google.com  0.789
2015-09-24T22:40:26Z    theserverIwant     www.google.com  1.017
2015-09-24T22:40:36Z    theserverIwant     www.google.com  0.711
2015-09-24T22:40:46Z    theserverIwant     www.google.com  0.779
2015-09-24T22:40:56Z    theserverIwant     www.google.com  0.656
2015-09-25T09:21:16Z    theserverIwant     www.google.com  0.462
...

The time 2015-09-24T22:36:36Z is a match against the PDT value of 15:36 on 2015-09-24 displayed in Grafana. Telegraf is logging incorrect data for ping times.

sparrc commented 9 years ago

@mabushey re host filtering, the Influx QL sees double-quotes as an "identifier", strings need to be single-quoted like:

> SELECT * FROM ping_average_response_ms WHERE value < 5 AND host='tyrion'

see https://influxdb.com/docs/v0.9/query_language/data_exploration.html

Timestamps are also in unix nanoseconds, that is why your "InfluxDB timestamps are from a timezone 6 days off."

Re: the original issue, I agree it looks like something is off but unfortunately I still haven't been able to reproduce locally, I'll see if I can get unit tests to reproduce something

sparrc commented 9 years ago

@mabushey would it be possible for you to drop in a special binary that logs pings where they're less than 2ms? You can find it here:

https://circle-artifacts.com/gh/influxdb/telegraf/591/artifacts/0/tmp/circle-artifacts.SN2Ya2A/telegraf-linux-amd64

You can drop it in directly like this:

wget -O /opt/telegraf/telegraf https://circle-artifacts.com/gh/influxdb/telegraf/591/artifacts/0/tmp/circle-artifacts.SN2Ya2A/telegraf-linux-amd64
chmod +x /opt/telegraf/telegraf

sparrc commented 9 years ago

ps, if you want to see where this is coming from, it's our CI build page, see the "Artifacts" tab here: https://circleci.com/gh/influxdb/telegraf/591

mabushey commented 9 years ago

OK, I moved the symlink out of the way and grabbed the version posted.

#tail -f /var/log/telegraf/telegraf.log
2015/09/25 20:22:07 Starting Telegraf (version v0.1.9-18-gb3c2c8d)
...

mabushey commented 9 years ago

Another strange thing I've noticed that may or may not be related is:

I have 25 servers reporting ping times
9 of these servers are running Nginx (with the telegraf nginx plugin) and php-fpm
1 server is running Nginx (with the telegraf nginx plugin) and hhvm

The nine Nginx / php-fpm servers only log pings every 30 seconds. All the other servers, Including the Nginx/hhvm machine, log pings every 10 seconds. The config is managed by Ansible so things are consistent. I'm attaching screenshots from the same graph, I'm just selecting a single server from the legend (an Nginx/phpfpm server and the Nginx/hhvm server). 2015-09-25_nginx_hhvm_ping 2015-09-25_nginx_phpfpm_ping

mabushey commented 9 years ago

# ping google.com
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=106 ttl=56 time=0.797 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=107 ttl=56 time=0.657 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=108 ttl=56 time=0.708 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=109 ttl=56 time=0.796 ms

# ping www.google.com
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=21 ttl=44 time=43.0 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=22 ttl=44 time=43.1 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=23 ttl=44 time=43.1 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=24 ttl=44 time=43.0 ms

Strange that there is such different ping times between www and not having www, but the responses are from different servers. I wonder if DNS is changing...

sparrc commented 9 years ago

@mabushey were you able to get any useful logging output from the special binary? FWIW, I also saw that happen with google.com vs www.google.com, although right now they are identical for me

mabushey commented 9 years ago

It appears the problem I've been seeing is DNS related, basically flipping between www.google.com and google.com. I wonder if 8.8.8.8 would better... I'm using google.com now and all seems good. I set the Decimals to 3 under Legend Values in Grafana and now it show values like "0.689" instead of "1".

I do have a weird issue where some of my servers report their ping time every 10 seconds as per the conf file, but some of them only report every 30 seconds. The conf files are managed by Ansible so they are identical other than host name.

2015-09-29_ping_interval .

sparrc commented 9 years ago

@mabushey that's strange, I'm not sure how to explain that. If you are able to build from tip I have a change that increases the logging of telegraf, which would help to debug this.

Binaries can also be found in the Artifacts tab here: https://circleci.com/gh/influxdb/telegraf/598

mabushey commented 9 years ago

I'd rather do a git clone and compile than figure out circleci's nonsensical UI. I'm not sure what 'tip' (I don't see a branch with that name) is. Does the master branch of telegraf have the increased logging?

Thanks for all the help, I've replaced collectd on all my servers as I believe that telegraf is a better solution.

sparrc commented 9 years ago

sorry, that was a term from my mercurial days, I meant HEAD instead of tip. This is the direct link to the binary: https://circle-artifacts.com/gh/influxdb/telegraf/598/artifacts/0/tmp/circle-artifacts.bpCA9GO/telegraf-linux-amd64

And yes, the master branch has the increased logging

jginfluxdata commented 9 years ago

@mabushey - thanks for feedback on telegraf. Let me know if you'd like a hoodie and stickers sent your way as a thank you. jimmy at influxdb.com

mabushey commented 9 years ago

@sparrc You're awesome! This was reported in the log:

2015/09/30 19:40:41 Error in plugin [haproxy]: Unable to connect to haproxy server 'http://user:pass@www7.example.com:480': Get http://www7.example.com:480/;csv: dial tcp 1.2.3.4:480: i/o timeout
2015/09/30 19:40:41 Cranking default (10s) interval, gathered 78 metrics from 10 plugins in 30.012961989s

I realized I had a hardcoded hostname in the telegraf.conf Ansible template so IPTables was blocking the request. It worked on the one web node because the firewall was fine locally, and Telegraf worked correctly on the rest of the servers because Ansible didn't include the haproxy part. Thanks for the better logging!

I like this much better:

2015/09/30 19:50:55 Cranking default (10s) interval, gathered 196 metrics from 10 plugins in 16.552711ms
2015/09/30 19:51:05 Cranking default (10s) interval, gathered 196 metrics from 10 plugins in 18.087127ms

I think logging this automatically anytime the metric gathering time is greater than the interval time is critical. I added learning Go to my todo list, it seems like an awesome language.

mabushey commented 9 years ago

The server I installed the logging version of Telegraph stopped reporting mem and cpu statistics. I reinstalled the .deb of the release version, restarted, and the cpu and mem metrics came back. I hope this is only a bug in the version with the extra logging.

mabushey commented 9 years ago

I think some keys changed a bit in this version, I noticed some series now contain a port number.

mabushey commented 9 years ago

All looks good, the changing ping time does not appear to be an issue with Telegraf. Thanks again for all your debug help. :)

sparrc commented 9 years ago

@mabushey yes I've changed some series names for the sake of consistency. See the CHANGELOG.md for more details.

Sorry for the inconvenience, there will probably be more breaking changes coming in the 0.1.X release as I get some of the standards and rules for plugins cemented.

mabushey commented 9 years ago

I like making things cleaner/better and appreciate consistency. Thanks. :)

iamaliyousefi commented 7 years ago

@mabushey How could you ping multiple hosts (probably with different interfaces)? how does your telegraf configurations looks like? I tried several states but I failed. I use telegraf 1.2.0. Thanks in advance.

iamaliyousefi commented 7 years ago

I finally overcome it! Apparently there is a bug in Telegraf that does not read configurations from /etc/telegraf/telegraf.d/ directory. I restarted telegraf service, even run the command with -config-directory manually but it does not read the config directory. I put exactly the config lines in /etc/telegraf/telegraf.d/ to /etc/telegraf/telegraf.conf and everything is OK now! @mabushey Sorry for any inconvenience I caused.

influxdata / telegraf

Ping plugin providing incorrect values #222