Closed mabushey closed 9 years ago
@mabushey could you provide output from the ping command on that server? try running ping -c 10 www.google.com
and show me the output.
Personally I frequently get 4ms pings from google.com with regular consumer-class comcast service, so this doesn't sound crazily out of the ordinary to me
# ping -c 10 www.google.com
PING www.google.com (173.194.210.147) 56(84) bytes of data.
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=1 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=2 ttl=44 time=37.0 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=3 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=4 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=5 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=6 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=7 ttl=44 time=37.0 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=8 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=9 ttl=44 time=36.9 ms
64 bytes from vn-in-f147.1e100.net (173.194.210.147): icmp_seq=10 ttl=44 time=36.9 ms
--- www.google.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9010ms
rtt min/avg/max/mdev = 36.913/36.973/37.097/0.199 ms
Do you have ping output from when this was running? What makes you think these values are incorrect?
Also, it seems to consistently drop across almost all of your servers, which indicates to me that telegraf is getting the correct values.
(with the exception of the "green" server, which appears to be somewhat independent)
Can I specifically log the results of the pings? The log files are almost empty:
# zcat /var/log/telegraf/telegraf.log.1.gz
2015/09/23 00:49:46 Starting Telegraf (version 0.1.8)
2015/09/23 00:49:46 Loaded outputs: influxdb
2015/09/23 00:49:46 Loaded plugins: cpu disk haproxy io mem net nginx ping swap system
2015/09/23 00:49:46 Tags enabled: host=www7
2015/09/23 02:05:36 Error in plugin [ping]: PING www.google.com (173.194.218.105) 56(84) bytes of data.
--- www.google.com ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms, exit status 1
@mabushey, There currently isn't, but you bring up a good point, we probably need multiple levels of debug logging in order to track down problems like this. Instead of just -debug
, we should have verbosity levels like -v
, -vv
, etc.
Also limiting the debug to a plugin would be helpful, something like -debug -filter ping
@mabushey one thing you could do in the meantime would be to just run ping in the background and see if it corroborates, like:
nohup ping -i <INTERVAL> -D www.google.com > /tmp/pingoutput.log &
-D adds unix timestamps
I ran a ping above and waited until I saw the issue in grafana (all servers except for one report 1ms ping time to google).
The log file, on a server that shows 1ms in the screenshot attached, shows only ping times between 38 to 41ms.
[1443134268.006129] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=887 ttl=53 time=39.7 ms
[1443134278.016644] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=888 ttl=53 time=40.3 ms
[1443134288.018696] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=889 ttl=53 time=40.6 ms
[1443134298.020866] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=890 ttl=53 time=41.5 ms
[1443134308.021290] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=891 ttl=53 time=40.4 ms
[1443134318.022837] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=892 ttl=53 time=40.7 ms
[1443134328.023950] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=893 ttl=53 time=40.5 ms
[1443134338.025345] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=894 ttl=53 time=40.7 ms
[1443134348.026521] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=895 ttl=53 time=40.6 ms
[1443134358.027565] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=896 ttl=53 time=40.5 ms
[1443134368.029110] 64 bytes from server-in-f4.1e100.net (216.58.219.228): icmp_seq=897 ttl=53 time=40.8 ms
@mabushey I haven't been able to reproduce any issues with the ping plugin locally, so I'd like to take Grafana out of the mix as well.
You can run the query locally or remotely using a command like this:
influx -database 'telegraf' -host 'localhost' -port '8086' -execute 'SELECT * FROM ping_average_response_ms'
This will give me the actual data in InfluxDB along with the UNIX timestamps, which can be matched up with the ping output as you've shown above
> SELECT * FROM ping_average_response_ms
name: ping_average_response_ms
------------------------------
time host url value
1443141757000000000 tyrion www.google.com 5.757
1443141762000000000 tyrion www.google.com 5.353
1443141767000000000 tyrion www.google.com 5.414
1443141772000000000 tyrion www.google.com 5.41
WHERE clause could help to find the low readings:
influx -database 'telegraf' -host 'localhost' -port '8086' -execute 'SELECT value FROM ping_average_response_ms WHERE value < 4'
mysql> SELECT FROM_UNIXTIME(1443134160) - TIMEDIFF(NOW(), UTC_TIMESTAMP) as UTCtime;
+----------------+
| UTCtime |
+----------------+
| 20150924223600 |
+----------------+
root@influxserver:~# /opt/influxdb/influx -database 'telegraf' -execute 'SELECT * FROM ping_average_response_ms WHERE time > 1443134160 AND value < 4 LIMIT 5'
name: ping_average_response_ms
------------------------------
name: ping_average_response_ms
------------------------------
time host url value
2015-09-18T18:40:50Z nottheserverIwant www.google.com 1.903
2015-09-18T18:41:00Z nottheserverIwant www.google.com 1.413
2015-09-18T18:41:10Z nottheserverIwant www.google.com 1.481
2015-09-18T18:41:20Z nottheserverIwant www.google.com 0.739
2015-09-18T18:41:30Z nottheserverIwant www.google.com 1.762
...
Apparently InfluxDB timestamps are from a timezone 6 days off.
I have 30 servers, can someone help me translate AND host="server1"
from SQL into "SQL Like"? This fake SQL syntax makes me sad. At least I have grep.
/opt/influxdb/influx -database 'telegraf' -execute 'SELECT * FROM ping_average_response_ms WHERE time > 1443109080 AND value < 4' | grep theserverIwant
...
2015-09-24T20:07:26Z theserverIwant www.google.com 0.804
2015-09-24T22:36:36Z theserverIwant www.google.com 0.722
2015-09-24T22:36:46Z theserverIwant www.google.com 0.764
2015-09-24T22:36:56Z theserverIwant www.google.com 0.666
2015-09-24T22:37:06Z theserverIwant www.google.com 0.682
2015-09-24T22:37:16Z theserverIwant www.google.com 0.771
2015-09-24T22:37:26Z theserverIwant www.google.com 0.663
2015-09-24T22:37:36Z theserverIwant www.google.com 0.709
2015-09-24T22:37:46Z theserverIwant www.google.com 0.72
2015-09-24T22:37:56Z theserverIwant www.google.com 0.615
2015-09-24T22:38:06Z theserverIwant www.google.com 0.642
2015-09-24T22:38:16Z theserverIwant www.google.com 0.757
2015-09-24T22:38:26Z theserverIwant www.google.com 0.708
2015-09-24T22:38:36Z theserverIwant www.google.com 1.016
2015-09-24T22:38:46Z theserverIwant www.google.com 0.636
2015-09-24T22:38:56Z theserverIwant www.google.com 0.918
2015-09-24T22:39:06Z theserverIwant www.google.com 0.648
2015-09-24T22:39:16Z theserverIwant www.google.com 0.658
2015-09-24T22:39:26Z theserverIwant www.google.com 0.691
2015-09-24T22:39:36Z theserverIwant www.google.com 0.647
2015-09-24T22:39:46Z theserverIwant www.google.com 0.683
2015-09-24T22:39:56Z theserverIwant www.google.com 1.661
2015-09-24T22:40:06Z theserverIwant www.google.com 0.769
2015-09-24T22:40:17Z theserverIwant www.google.com 0.789
2015-09-24T22:40:26Z theserverIwant www.google.com 1.017
2015-09-24T22:40:36Z theserverIwant www.google.com 0.711
2015-09-24T22:40:46Z theserverIwant www.google.com 0.779
2015-09-24T22:40:56Z theserverIwant www.google.com 0.656
2015-09-25T09:21:16Z theserverIwant www.google.com 0.462
...
The time 2015-09-24T22:36:36Z is a match against the PDT value of 15:36 on 2015-09-24 displayed in Grafana. Telegraf is logging incorrect data for ping times.
@mabushey re host filtering, the Influx QL sees double-quotes as an "identifier", strings need to be single-quoted like:
> SELECT * FROM ping_average_response_ms WHERE value < 5 AND host='tyrion'
see https://influxdb.com/docs/v0.9/query_language/data_exploration.html
Timestamps are also in unix nanoseconds, that is why your "InfluxDB timestamps are from a timezone 6 days off."
Re: the original issue, I agree it looks like something is off but unfortunately I still haven't been able to reproduce locally, I'll see if I can get unit tests to reproduce something
@mabushey would it be possible for you to drop in a special binary that logs pings where they're less than 2ms? You can find it here:
You can drop it in directly like this:
wget -O /opt/telegraf/telegraf https://circle-artifacts.com/gh/influxdb/telegraf/591/artifacts/0/tmp/circle-artifacts.SN2Ya2A/telegraf-linux-amd64
chmod +x /opt/telegraf/telegraf
ps, if you want to see where this is coming from, it's our CI build page, see the "Artifacts" tab here: https://circleci.com/gh/influxdb/telegraf/591
OK, I moved the symlink out of the way and grabbed the version posted.
#tail -f /var/log/telegraf/telegraf.log
2015/09/25 20:22:07 Starting Telegraf (version v0.1.9-18-gb3c2c8d)
...
Another strange thing I've noticed that may or may not be related is:
The nine Nginx / php-fpm servers only log pings every 30 seconds. All the other servers, Including the Nginx/hhvm machine, log pings every 10 seconds. The config is managed by Ansible so things are consistent. I'm attaching screenshots from the same graph, I'm just selecting a single server from the legend (an Nginx/phpfpm server and the Nginx/hhvm server).
# ping google.com
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=106 ttl=56 time=0.797 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=107 ttl=56 time=0.657 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=108 ttl=56 time=0.708 ms
64 bytes from dfw25s08-in-f14.1e100.net (216.58.218.142): icmp_seq=109 ttl=56 time=0.796 ms
# ping www.google.com
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=21 ttl=44 time=43.0 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=22 ttl=44 time=43.1 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=23 ttl=44 time=43.1 ms
64 bytes from vl-in-f99.1e100.net (74.125.141.99): icmp_seq=24 ttl=44 time=43.0 ms
Strange that there is such different ping times between www and not having www, but the responses are from different servers. I wonder if DNS is changing...
@mabushey were you able to get any useful logging output from the special binary? FWIW, I also saw that happen with google.com
vs www.google.com
, although right now they are identical for me
It appears the problem I've been seeing is DNS related, basically flipping between www.google.com and google.com. I wonder if 8.8.8.8 would better... I'm using google.com now and all seems good. I set the Decimals to 3 under Legend Values in Grafana and now it show values like "0.689" instead of "1".
I do have a weird issue where some of my servers report their ping time every 10 seconds as per the conf file, but some of them only report every 30 seconds. The conf files are managed by Ansible so they are identical other than host name.
.
@mabushey that's strange, I'm not sure how to explain that. If you are able to build from tip I have a change that increases the logging of telegraf, which would help to debug this.
Binaries can also be found in the Artifacts tab here: https://circleci.com/gh/influxdb/telegraf/598
I'd rather do a git clone and compile than figure out circleci's nonsensical UI. I'm not sure what 'tip' (I don't see a branch with that name) is. Does the master branch of telegraf have the increased logging?
Thanks for all the help, I've replaced collectd on all my servers as I believe that telegraf is a better solution.
sorry, that was a term from my mercurial days, I meant HEAD instead of tip. This is the direct link to the binary: https://circle-artifacts.com/gh/influxdb/telegraf/598/artifacts/0/tmp/circle-artifacts.bpCA9GO/telegraf-linux-amd64
And yes, the master branch has the increased logging
@mabushey - thanks for feedback on telegraf. Let me know if you'd like a hoodie and stickers sent your way as a thank you. jimmy at influxdb.com
@sparrc You're awesome! This was reported in the log:
2015/09/30 19:40:41 Error in plugin [haproxy]: Unable to connect to haproxy server 'http://user:pass@www7.example.com:480': Get http://www7.example.com:480/;csv: dial tcp 1.2.3.4:480: i/o timeout
2015/09/30 19:40:41 Cranking default (10s) interval, gathered 78 metrics from 10 plugins in 30.012961989s
I realized I had a hardcoded hostname in the telegraf.conf Ansible template so IPTables was blocking the request. It worked on the one web node because the firewall was fine locally, and Telegraf worked correctly on the rest of the servers because Ansible didn't include the haproxy part. Thanks for the better logging!
I like this much better:
2015/09/30 19:50:55 Cranking default (10s) interval, gathered 196 metrics from 10 plugins in 16.552711ms
2015/09/30 19:51:05 Cranking default (10s) interval, gathered 196 metrics from 10 plugins in 18.087127ms
I think logging this automatically anytime the metric gathering time is greater than the interval time is critical. I added learning Go to my todo list, it seems like an awesome language.
The server I installed the logging version of Telegraph stopped reporting mem and cpu statistics. I reinstalled the .deb of the release version, restarted, and the cpu and mem metrics came back. I hope this is only a bug in the version with the extra logging.
I think some keys changed a bit in this version, I noticed some series now contain a port number.
All looks good, the changing ping time does not appear to be an issue with Telegraf. Thanks again for all your debug help. :)
@mabushey yes I've changed some series names for the sake of consistency. See the CHANGELOG.md for more details.
Sorry for the inconvenience, there will probably be more breaking changes coming in the 0.1.X release as I get some of the standards and rules for plugins cemented.
I like making things cleaner/better and appreciate consistency. Thanks. :)
@mabushey How could you ping multiple hosts (probably with different interfaces)? how does your telegraf configurations looks like? I tried several states but I failed. I use telegraf 1.2.0. Thanks in advance.
I finally overcome it! Apparently there is a bug in Telegraf that does not read configurations from /etc/telegraf/telegraf.d/ directory. I restarted telegraf service, even run the command with -config-directory manually but it does not read the config directory. I put exactly the config lines in /etc/telegraf/telegraf.d/ to /etc/telegraf/telegraf.conf and everything is OK now! @mabushey Sorry for any inconvenience I caused.
From /etc/opt/telegraf/telegraf.conf:
I'm not sure if this is a bug in Telegraf or the cloud provider has some machine that's responding locally to ping requests to google.com. When I hover over the purple area at the bottom 90% of the way to the right they are mostly 1ms with an occasional 2ms or 4ms.