Closed jeff1985 closed 10 years ago
Hi, you are free to change the config options timeout
and tries
to match your needs. The defaults are timeout: 10
and tries: 3
, such that heartbeat must not reach a server for 30 seconds before switching failover ips. If you set tries: 6
heartbeat will check for 60 seconds before switching, etc.
Thus, you must increase the thresholds when you think it switches to frequently.
In my understanding, increasing tries option to 6 would only make my problem worse.
Please refer to your implementation:
def ping(ip = ping_ip)
`ping -W #{timeout} -c #{tries} #{ip}`
$?.success?
end
So the tries option is passed to the -c parameter:
-c count Stop after sending count ECHO_REQUEST packets.
With deadline option, ping waits for count ECHO_REPLY packets, until the timeout expires.
the exit status is defined:
If ping does not receive any reply packets at all it will exit with code 1.
If a packet count and deadline are both specified, and fewer than count
packets are received by the time the deadline has arrived, it will also
exit with code 1. On other error it exits with code 2. Otherwise it exits
with code 0.
This makes it possible to use the exit code to see if a host is alive or not.
So this means, if i specify tries: 100
and the server sends back 99 replies, heartbeat would think the ip is down.
What do you think about exchanging your ping implementation with the following:
def ping(ip = ping_ip)
`ping -w #{timeout} #{ip}`
$?.success?
end
The implementation uses the capital -W
option. So we specify a timeout, not a deadline, such that ping only exits non-zero if it does not get any valid response:
irb> `ping -W 10 -c 3 www.google.de`
=> "... 0% packet loss ..."
irb> $?.success?
=> true
irb> `ping -W 10 -c 3 www.google.de`
=> "... 66% packet loss ..."
irb> $?.success?
=> true
irb> `ping -w 10 -c 3 www.google.de`
=> "... 0% packet loss..."
irb> $?.success?
=> true
irb> `ping -w 10 -c 3 www.google.de`
=> "... 90% packet loss ..."
irb> $?.success?
=> false
As ping
emits 1 packet each second (not every timeout
seconds as stated above) and to avoid further confusion (compare #2), i'll change the implementation to:
def ping(ip = ping_ip)
tries.times.any? do |i|
`ping -W #{timeout} -i 1 #{ip}`
$logger.info("#{ping_ip} is down, check #{i + 1}/#{tries}.") unless $?.success?
$?.success?
end
end
This will as well make heartbeat's behaviour more transparent within the logs.
sry, typo:
def ping(ip = ping_ip)
tries.times.any? do |i|
`ping -W #{timeout} -c 1 #{ip}`
$logger.info("#{ping_ip} is down, check #{i + 1}/#{tries}.") unless $?.success?
$?.success?
end
end
Thanks for opening this issue. Hope the change works fine for you as well.
OK, thanks a lot. Seems that i mixed the -w and -W options. Thanks for pointing this out! I'll test the new implementation. It's a good idea to make it more transparent!
While running for a long time, ping requests might fail even if your server is operating normally. I'd expect heartbeat to retry a request, if it fails and switch only if the downtime continues or if the number of lost packets is high.
Just an example for better understanding of the problem: i'm currently running heartbeat with 6 failover ips. Nearly every day i get one of the ips switched while the servers are operating normally.
So do you have an idea of a possible implementation?