sensu-plugins / sensu-plugins-http

This plugin provides native HTTP instrumentation for monitoring and metrics collection, including: response code, JSON response, HTTP last modified, SSL expiry, and metrics via `curl`.
http://sensu-plugins.io
MIT License
30 stars 97 forks source link

Improve Timeout handling and improves obscuring of DNS failure messages being swallowed by the base timeout (circuit breaker) #123

Closed johanek closed 5 years ago

johanek commented 5 years ago

…ture Net::HTTP errors

Pull Request Checklist

Related to comments in #82

General

~- [ ] Update README with any necessary configuration snippets~

~- [ ] Binstubs are created if needed~

Purpose

Improve error reporting related to Net::HTTP timeouts

majormoses commented 5 years ago

Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.

johanek commented 5 years ago

Hi, thanks for looking at this so quickly.

I don't think this is a breaking change, the timeout command line option is still being used to set the Net::HTTP timeout options.

Updated the changelog as requested

majormoses commented 5 years ago

I need to sit down and look at it again to eval it for breaking changes. Will try to get to it tonight.

majormoses commented 5 years ago

First off I want to thank you for your time and apologies it took me a couple of days to get back to this.

It's definitely a breaking change but one I agree that we should make, now the real question is how to get there. In it's current form it would timeout at the timeout threshold (defaults to 15) with the proposed changes it could take much longer to fail as each of http.read_timeout, http.open_timeout, http.ssl_timeout, http.continue_timeout, and http.keep_alive_timeout are all individually being set to the same value. As it could easily take 30 seconds overall to fail. I know I have personally configured http checks to run at orgs every 15 seconds, changing this could lead to multiple instances of these checks piling up which could exacerbate the situation. Perhaps the best path forward is to deprecate the --timeout option in favor of unique arguments for each of those various http timeouts so that they can be controlled independently. Alternatively we could keep the main timeout one as a heavy handed circuit breaker and introduce options for each of the timeouts with much lower values.

johanek commented 5 years ago

OK, I agree with your point about the changed behaviour.

Keeping Timeout.timeout() will always mask the DNS failures, as these are not caught by the Net::HTTP open_timeout. I tested setting the open timeout to 1 second and confirmed this.

So the only way forward I see is to set explicit open and read timeout values as unique CLI options. Let me know what you think and I'll implement it.

majormoses commented 5 years ago

Ya I think we should expose those options, we could either do this as one or more changes

Keeping Timeout.timeout() will always mask the DNS failures

We could use the ruby resolver rather than relying on the OS resolver so we can rescue those. We could expose the DNS server to use so that it could be set to still use the local resolver that may be in place (such as dnsmasq running as a cache and forward lookup in front of your service discovery such as consul) which would allow us to keep the high level circuit breaker, I am not opposed to removing it entirely I just want to find a path forward that serves everyone's needs. Documentation: https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/DNS.html But yes I agree that anytime you use Timeout.timout() it will obscure the real error maybe we could add a --debug flag that turns off that timeout :thinking: actually once they are separate flags we should be good because you just need to set the timeout to 0 and it will sit around forever until wither an exception is thrown or it completes successfully.

johanek commented 5 years ago

Hi,

That's really interesting that you can use the ruby resolver instead, and set the timeouts via that. I didn't know you could do that.

So I've updated the PR accordingly to use that method, and to add the open and read timeout values via CLI options. If the DNS request times out, we can't catch that error directly as it's caught by Net::HTTP. But the resulting error is much more useful for debugging:

CheckHttp CRITICAL: Request error: Failed to open TCP connection to testing.com:443 (Hostname not known: testing.com)

I set the DNS timeout to 800ms, so that it's always lower than the lowest value for --timeout. But perhaps this too should be configurable?

johanek commented 5 years ago

The changed resolver behaviour has broken some tests, I'll look into that

majormoses commented 5 years ago

You can rescue the ruby DNS timeout via https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/ResolvTimeout.html additionally you can use https://ruby-doc.org/stdlib-2.4.0/libdoc/resolv/rdoc/Resolv/ResolvError.html as a more generic DNS resolution exception class.

majormoses commented 5 years ago

It also appears that this breaks DNS resolution according to the tests which is curious because it mostly uses localhost, I have not used the DefaultResolver before it feels like it should work from a quick test locally:

$ irb
irb(main):001:0> require 'resolv-replace'
=> true
irb(main):002:0> dns_resolver = Resolv::DNS.new
=> #<Resolv::DNS:0x0055aa6d6800e8 @mutex=#<Thread::Mutex:0x0055aa6d6800c0>, @config=#<Resolv::DNS::Config:0x0055aa6d680098 @mutex=#<Thread::Mutex:0x0055aa6d680070>, @config_info=nil, @initialized=nil, @timeouts=nil>, @initialized=nil>
irb(main):003:0> dns_resolver.timeouts = 0.001
=> 0.001
irb(main):004:0> Resolv::DefaultResolver.replace_resolvers([dns_resolver])
=> [#<Resolv::DNS:0x0055aa6d6800e8 @mutex=#<Thread::Mutex:0x0055aa6d6800c0>, @config=#<Resolv::DNS::Config:0x0055aa6d680098 @mutex=#<Thread::Mutex:0x0055aa6d680070>, @config_info=nil, @initialized=nil, @timeouts=[0.001]>, @initialized=nil>]
irb(main):006:0> Resolv.getaddress('localhost')
=> "127.0.0.1"

I will need to take a closer look at why its failing in the tests.

majormoses commented 5 years ago

I was not able to replicate this behavior in a container but I did notice that in the container it returned an ipv6 instead of an ipv4 address:

$ docker run -it ruby:2.4.1-slim
irb(main):001:0> require 'resolv-replace'
=> true
irb(main):002:0> dns_resolver = Resolv::DNS.new
=> #<Resolv::DNS:0x00000001cc7698 @mutex=#<Thread::Mutex:0x00000001cc7670>, @config=#<Resolv::DNS::Config:0x00000001cc7648 @mutex=#<Thread::Mutex:0x00000001cc7620>, @config_info=nil, @initialized=nil, @timeouts=nil>, @initialized=nil>
irb(main):003:0> dns_resolver.timeouts = 0.001
=> 0.001
irb(main):004:0> Resolv.getaddress('localhost')
=> "::1"

I don't think this is an ipv4 vs 6 issue but I will need to do some more testing to figure out whats going on.

majormoses commented 5 years ago

Interestingly enough it seems to fail with the same error about localhost not being resolved even when specifying the ipv4 loopback address:

kitchen@27364eb91750:~$ /usr/local/bin/check-http.rb --url http://127.0.0.1/okay
Check failed to run: no address for localhost, ["/usr/local/lib/ruby/2.4.0/resolv.rb:95:in `getaddress'", "/usr/local/lib/ruby/2.4.0/resolv.rb:45:in `getaddress'", "/usr/local/lib/ruby/gems/2.4.0/gems/sensu-plugins-http-4.0.0/bin/check-http.rb:270:in `run'", "/usr/local/lib/ruby/gems/2.4.0/gems/sensu-plugin-3.0.1/lib/sensu-plugin/cli.rb:57:in `block in <class:CLI>'"]

Outside of the code it works as expected:

kitchen@27364eb91750:~$ curl -I localhost/okay
HTTP/1.1 200 OK
Server: nginx/1.6.2
Date: Tue, 15 Jan 2019 05:18:57 GMT
Content-Type: application/octet-stream
Content-Length: 0
Connection: keep-alive
majormoses commented 5 years ago

I need to step away to regain some sanity but I am not 100% sure if its caused by the tests being run in docker with a broken ipv6 or if it breaks with all ipv6 setups. I assume its the former, I don't really do anything with ipv6 so I am not sure where to go from here. If you have a hostname that resolves to ipv6 and can validate that it works that would be appreciated and then I can figure out how to disable ipv6 in the container (probably easiest would be to just comment it out like I did while messing around).

kitchen@27364eb91750:~$ ping6 localhost
PING localhost (::1): 56 data bytes
ping6: sending packet: Cannot assign requested address
kitchen@27364eb91750:~$ ping localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.037 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.060 ms
^C--- localhost ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.037/0.050/0.060/0.000 ms
# I commented the ipv6 addresses and it worked
kitchen@27364eb91750:~$ grep -rn '::1' /etc/hosts
2:#::1  localhost ip6-localhost ip6-loopback
5:ff02::1   ip6-allnodes

kitchen@27364eb91750:~$ /usr/local/bin/check-http.rb --url http://localhost/okay
"127.0.0.1"
"127.0.0.1"
CheckHttp OK: 200, 0 bytes
majormoses commented 5 years ago

Here is the current diff of my changes/debug code:

$ git diff
diff --git a/bin/check-http.rb b/bin/check-http.rb
index cda472b..5d0fd0b 100755
--- a/bin/check-http.rb
+++ b/bin/check-http.rb
@@ -264,14 +264,26 @@ class CheckHttp < Sensu::Plugin::Check::CLI
     end

     # Use Ruby DNS Resolver and set DNS resolution timeout to 800ms
+    hosts_resolver = Resolv::Hosts.new
     dns_resolver = Resolv::DNS.new
     dns_resolver.timeouts = 0.8
-    Resolv::DefaultResolver.replace_resolvers([dns_resolver])
+    Resolv::DefaultResolver.replace_resolvers(
+      [
+        hosts_resolver,
+        dns_resolver,
+      ]
+    )
+    resolved_hostname = Resolv.getaddress(config[:host])
+    p resolved_hostname

     begin
       Timeout.timeout(config[:timeout]) do
-        acquire_resource
+        acquire_resource(resolved_hostname)
       end
+    rescue Resolv::ResolvTimeout => e
+      critical "Unable to resolve DNS in a timely manner: #{e}"
+    rescue Resolv::ResolvError => e
+      critical "Unable to resolve DNS: #{e}"
     rescue Net::OpenTimeout
       critical 'Request timed out opening connection'
     rescue Net::ReadTimeout
@@ -283,23 +295,26 @@ class CheckHttp < Sensu::Plugin::Check::CLI
     end
   end

-  def acquire_resource
+  def acquire_resource(endpoint)
     http = nil
-
+    # if endpoint.start_with?('::')
+    #   endpoint = "\'[#{endpoint}]\'"
+    # end
+    p endpoint
     if config[:no_proxy]
-      http = Net::HTTP.new(config[:host], config[:port], nil, nil)
+      http = Net::HTTP.new(endpoint, config[:port], nil, nil)
     elsif config[:proxy_url]
       proxy_uri = URI.parse(config[:proxy_url])
       if proxy_uri.host.nil?
         unknown 'Invalid proxy url specified'
       end
       http = if proxy_uri.user && proxy_uri.password
-               Net::HTTP.new(config[:host], config[:port], proxy_uri.host, proxy_uri.port, proxy_uri.user, proxy_uri.password)
+               Net::HTTP.new(endpoint, config[:port], proxy_uri.host, proxy_uri.port, proxy_uri.user, proxy_uri.password)
              else
-               Net::HTTP.new(config[:host], config[:port], proxy_uri.host, proxy_uri.port)
+               Net::HTTP.new(endpoint, config[:port], proxy_uri.host, proxy_uri.port)
              end
     else
-      http = Net::HTTP.new(config[:host], config[:port])
+      http = Net::HTTP.new(endpoint, config[:port])
     end
     http.read_timeout = config[:read_timeout]
     http.open_timeout = config[:open_timeout]
diff --git a/test/fixtures/bootstrap.sh b/test/fixtures/bootstrap.sh
index 8816812..048dd01 100755
--- a/test/fixtures/bootstrap.sh
+++ b/test/fixtures/bootstrap.sh
@@ -37,6 +37,7 @@ fi
 echo "
   server {
     listen 80;
+    listen [::]:80 ipv6only=on;

     location /okay {
       limit_except GET {
johanek commented 5 years ago

I started working on this but ran out of time, I won't have a chance to pick it up again for a couple of days. But I commited and pushed the current WIP

I've also added a 'Host' header to Net::HTTP requests as otherwise the request may not succeed for a virtual host, given we are now making the HTTP request via IP address.

I also had problems with localhost resolving to ::1 and the rests failing - I didn't get to the bottom of this, but I suspect this may be a docker for mac issue. I wanted to see what happened when the tests were run by travis. I "fixed" them by updating /etc/hosts in the docker container, and made sure that the hostname localhost is only used for 127.0.0.1 and not also for ::1.

The test I added to validate DNS timeouts isn't working, the resolution doesn't timeout because the lookup for localhost comes from the host resolver. I tried using google.com and it succeeded too. How can we simulate a DNS timeout in these conditions? I was looking at using tc to simulate a network delay but this requires more config to work successfully within docker containers.

majormoses commented 5 years ago

Probably need to use some kind of unresponsive server for example blackhole.webpagetest.org see: https://serverfault.com/questions/776049/how-to-simulate-dns-server-response-timeout

johanek commented 5 years ago

Hi,

I've made some more updates:

Updated tests to use 127.0.0.1 instead of localhost. This is to get around the ruby resolver (correctly) preferring the ipv6 localhost address over ipv4. This could also be solved by updating /etc/hosts within the container to only use localhost for the ipv4 address.

I removed the suggested step to manually resolve the hostname and pass the IP address to acquire_resource. This didn't work with https connections - establishing the SSL connection requires the hostname to be used so it can be compared against the certificate. Otherwise the check would fail with:

CheckHttp CRITICAL: Request error: SSL_connect returned=1 errno=0 state=error: certificate verify failed

So I don't think we can explicitly capture the DNS lookup failures. However, the error returned from the check is still useful and incidicative of the problem, i.e.

CheckHttp CRITICAL: Request error: Failed to open TCP connection to blackhole.webpagetest.org

The DNS timeout test is working when setting a suitably low resolution timeout. I've also updated the changelog as requested.

I think this is the best we can do with this, but perhaps there's a better solution for the localhost issue with the tests.

Thanks!

majormoses commented 5 years ago

@johanek awesome I will take a closer look this evening, the other option I was thinking about was updating the bootstrap script to run a sed to comment out that line from /etc/hosts but I think I like your solution better. I think we probably want to put a comment there to indicate this as a year from now there is about a 3% chance I will remember why we needed to do that. I can add that comment for you if we are good to go.

johanek commented 5 years ago

I'm happy with where this is now, thanks for your feedback and suggestions to improve my original request 👍

majormoses commented 5 years ago

@johanek thanks for all your awesome work and being willing to explore the better vs the easier path with me. Apologies on the delay for getting this merged and released, I had a lot going on during that time frame and things should be back to whatever "normal" means. I have cut a release and when it makes it through the CI/CD pipeline and creates an artifact I will update here.

johanek commented 5 years ago

Thanks for your assistance @majormoses, it was a pleasure working through it with you 😄

majormoses commented 5 years ago

released: https://rubygems.org/gems/sensu-plugins-http/versions/4.1.0