Open chrisrogerson opened 2 years ago
Having multiple IPs for a single check is something useful and practical. But, I want to understand more the failure you are describing as it shouldn't happen. I have used it in environments with more than 50 checks and frequently updates and I never noticed this. Moreover, we update the prefix file atomically so there is no way that bird will see more than view of the content at a given time.
Could you please share some logs when the problem occurred ?
I use a single server as an anycast backup for the same service run on multiple addresses (AS path postpending in the peered router makes this server less preferred). I run the same script for multiple checks for multiple IP addresses (8 IPv4 and currently one IPv6) When that script fails, the server has 9 checks fail and has to run the "birdc configure" command 9 times at the same time and that fails. This happens when the checks recover and attempt to re-add the addresses as well. If I could add multiple IP's to a single check in the anycast-healthchecker config, it would likely resolve this issue for me as it would only be running the birdc configure command once when the check fails or recovers rather than the 9 times all at once.
We run birdc configure
only once at a given time not multiple times. There is a queue where service checks put the result and the main thread picks up item one by one, so in this way we ensure that run birdc configure once per a given time. We may run it multiple times within few seconds, but that hasn't been a problem.
Could you please share some logs where you notice multiple invocation of birdc configure
?
The issue is that I am not running that command multiple times in a few seconds. I am running it multiple times in one second. I have pasted santitized logs below where you can see three checks removing their addresses at once. When this occurs, the first two checks will get removed but the 3rd will not. I have tried reordering the checks to see if that matters and it is always the third check that does not withdraw it's address. I can manually run the "birdc reconfigure" command after this and have the address withdrawn. This is why I would like to attach multiple addresses to a single check.
Sanitized Logs: 2022-04-01 11:03:46,665 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK1 with IP prefix [IPv4 Address 1] and action to delete from Bird configuration 2022-04-01 11:03:46,666 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 1] for CHECK1 2022-04-01 11:03:46,666 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.666751 2022-04-01 11:03:46,667 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated 2022-04-01 11:03:46,668 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK2 with IP prefix [IPv4 Address 2] and action to delete from Bird configuration 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 2] for CHECK2 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.675965 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK3 with IP prefix [IPv6 Address 1] and action to delete from Bird configuration 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv6 Address 1] for CHECK3 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.6829078 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv6 is updated 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,689 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon
Thanks for the log. What do you see on bird log? Have you seen this problem only the IPv6 address?
For debugging purposes, can you add before https://github.com/unixsurfer/anycast_healthchecker/blob/master/anycast_healthchecker/healthchecker.py#L220
import time
time.sleep(1)
I am curious to see if that makes any difference.
Can you also try setting splay_startup
but without the above code change?
This is the BIRD log from that time:
2022-04-01 11:03:46.674
This does not just affect the IPv6, it affects any checks after the first two as ordered in the configuration regardless of IP version. I just happened to have the ipv6 check as the third in order in that log.
If we are being honest, I have no idea how to implement the debug you have suggested.
I did apply splay_startup = 50 (no idea what the units are) and that resolved the issue I was having.
Actually, it would seem that the splay_startup command has the effect of randomizing failure. Depending on the amount of splay between tests, it can correct the issue or not.
From the README:
splay_startup Unset by default
The maximum time to delay the startup of service checks. You can use either integer or floating-point number as a value.
In order to avoid launching all checks at the same time, after anycast-healthchecker is started, we can delay the 1st check in random way. This can be useful in cases where we have a lot of service checks and launching all them at the same time can overload the system. We randomize the delay of the 1st check for each service and splay_startup sets the maximum time we can delay that 1st check.
The unit is in seconds, it is a doc bug that we don't mention it:-)
At least you now have a workaround.
Just to add a +1 to allow a single check to impact multiple prefixes - it feels somewhat wasteful chealthchecking the same thing for this
I have been implementing this project for a while with BIRD2 for a single IPv4 and IPv6 IP address using separate checks. I have now run into a need to add more addresses. I have tried using more checks and found that past 2 checks, the program doesn't withdraw and add the route properly. This seems to be due to the "birdc configure" command being run 3 times in very quick succession as the my anycast-prefixes files do update properly. I would like to be able to add all of the addresses (I could end up with 12 or more on a single server) to a single check and have it run the birdc configure only once to rectify this but can't seem to figure out how to do this.