Better ping failure handling

jeff1985 commented 10 years ago

While running for a long time, ping requests might fail even if your server is operating normally. I'd expect heartbeat to retry a request, if it fails and switch only if the downtime continues or if the number of lost packets is high.

Just an example for better understanding of the problem: i'm currently running heartbeat with 6 failover ips. Nearly every day i get one of the ips switched while the servers are operating normally.

So do you have an idea of a possible implementation?

mrkamel commented 10 years ago

Hi, you are free to change the config options timeout and tries to match your needs. The defaults are timeout: 10 and tries: 3, such that heartbeat must not reach a server for 30 seconds before switching failover ips. If you set tries: 6 heartbeat will check for 60 seconds before switching, etc.

Thus, you must increase the thresholds when you think it switches to frequently.

jeff1985 commented 10 years ago

In my understanding, increasing tries option to 6 would only make my problem worse.

Please refer to your implementation:

def ping(ip = ping_ip)
    `ping -W #{timeout} -c #{tries} #{ip}`

    $?.success?
end

So the tries option is passed to the -c parameter:

 -c count      Stop after sending count ECHO_REQUEST packets. 
With deadline option, ping waits for count ECHO_REPLY packets, until the timeout expires.

the exit status is defined:

If  ping  does  not  receive any reply packets at all it will exit with code 1. 
If a packet count and deadline are both specified, and fewer than count 
packets are received by the time the deadline has arrived, it will also 
exit with code 1.  On other error it exits with code 2. Otherwise it exits 
with code 0. 
This makes it possible to use the exit code to see if a host is alive or not.

So this means, if i specify tries: 100 and the server sends back 99 replies, heartbeat would think the ip is down.

What do you think about exchanging your ping implementation with the following:

def ping(ip = ping_ip)
    `ping -w #{timeout} #{ip}`

    $?.success?
end

mrkamel commented 10 years ago

The implementation uses the capital -W option. So we specify a timeout, not a deadline, such that ping only exits non-zero if it does not get any valid response:

irb> `ping -W 10 -c 3 www.google.de`
=> "...  0% packet loss ..."
irb> $?.success?
=> true
irb> `ping -W 10 -c 3 www.google.de`
=> "... 66% packet loss ..."
irb> $?.success?
=> true
irb> `ping -w 10 -c 3 www.google.de`
=> "... 0% packet loss..."
irb> $?.success?
=> true
irb> `ping -w 10 -c 3 www.google.de`
=> "... 90% packet loss ..."
irb> $?.success?
=> false

mrkamel commented 10 years ago

As ping emits 1 packet each second (not every timeout seconds as stated above) and to avoid further confusion (compare #2), i'll change the implementation to:

  def ping(ip = ping_ip)
    tries.times.any? do |i|
      `ping -W #{timeout} -i 1 #{ip}`

      $logger.info("#{ping_ip} is down, check #{i + 1}/#{tries}.") unless $?.success?

      $?.success?
    end
  end

This will as well make heartbeat's behaviour more transparent within the logs.

mrkamel commented 10 years ago

sry, typo:

  def ping(ip = ping_ip)
    tries.times.any? do |i|
      `ping -W #{timeout} -c 1 #{ip}`

      $logger.info("#{ping_ip} is down, check #{i + 1}/#{tries}.") unless $?.success?

      $?.success?
    end
  end

mrkamel commented 10 years ago

Thanks for opening this issue. Hope the change works fine for you as well.

jeff1985 commented 10 years ago

OK, thanks a lot. Seems that i mixed the -w and -W options. Thanks for pointing this out! I'll test the new implementation. It's a good idea to make it more transparent!

mrkamel / heartbeat

Better ping failure handling #5