unixsurfer / anycast_healthchecker

A healthchecker for Anycasted services
Apache License 2.0
175 stars 32 forks source link

Cannot remove multiple IP addresses in a single check #30

Open chrisrogerson opened 2 years ago

chrisrogerson commented 2 years ago

I have been implementing this project for a while with BIRD2 for a single IPv4 and IPv6 IP address using separate checks. I have now run into a need to add more addresses. I have tried using more checks and found that past 2 checks, the program doesn't withdraw and add the route properly. This seems to be due to the "birdc configure" command being run 3 times in very quick succession as the my anycast-prefixes files do update properly. I would like to be able to add all of the addresses (I could end up with 12 or more on a single server) to a single check and have it run the birdc configure only once to rectify this but can't seem to figure out how to do this.

unixsurfer commented 2 years ago

Having multiple IPs for a single check is something useful and practical. But, I want to understand more the failure you are describing as it shouldn't happen. I have used it in environments with more than 50 checks and frequently updates and I never noticed this. Moreover, we update the prefix file atomically so there is no way that bird will see more than view of the content at a given time.

Could you please share some logs when the problem occurred ?

chrisrogerson commented 2 years ago

I use a single server as an anycast backup for the same service run on multiple addresses (AS path postpending in the peered router makes this server less preferred). I run the same script for multiple checks for multiple IP addresses (8 IPv4 and currently one IPv6) When that script fails, the server has 9 checks fail and has to run the "birdc configure" command 9 times at the same time and that fails. This happens when the checks recover and attempt to re-add the addresses as well. If I could add multiple IP's to a single check in the anycast-healthchecker config, it would likely resolve this issue for me as it would only be running the birdc configure command once when the check fails or recovers rather than the 9 times all at once.

unixsurfer commented 2 years ago

We run birdc configure only once at a given time not multiple times. There is a queue where service checks put the result and the main thread picks up item one by one, so in this way we ensure that run birdc configure once per a given time. We may run it multiple times within few seconds, but that hasn't been a problem.

Could you please share some logs where you notice multiple invocation of birdc configure?

chrisrogerson commented 2 years ago

The issue is that I am not running that command multiple times in a few seconds. I am running it multiple times in one second. I have pasted santitized logs below where you can see three checks removing their addresses at once. When this occurs, the first two checks will get removed but the 3rd will not. I have tried reordering the checks to see if that matters and it is always the third check that does not withdraw it's address. I can manually run the "birdc reconfigure" command after this and have the address withdrawn. This is why I would like to attach multiple addresses to a single check.

Sanitized Logs: 2022-04-01 11:03:46,665 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK1 with IP prefix [IPv4 Address 1] and action to delete from Bird configuration 2022-04-01 11:03:46,666 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 1] for CHECK1 2022-04-01 11:03:46,666 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.666751 2022-04-01 11:03:46,667 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated 2022-04-01 11:03:46,668 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK2 with IP prefix [IPv4 Address 2] and action to delete from Bird configuration 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 2] for CHECK2 2022-04-01 11:03:46,675 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.675965 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic 2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK3 with IP prefix [IPv6 Address 1] and action to delete from Bird configuration 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv6 Address 1] for CHECK3 2022-04-01 11:03:46,682 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.6829078 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv6 is updated 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic 2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure 2022-04-01 11:03:46,689 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon

unixsurfer commented 2 years ago

Thanks for the log. What do you see on bird log? Have you seen this problem only the IPv6 address?

For debugging purposes, can you add before https://github.com/unixsurfer/anycast_healthchecker/blob/master/anycast_healthchecker/healthchecker.py#L220

import time
time.sleep(1)

I am curious to see if that makes any difference.

Can you also try setting splay_startup but without the above code change?

chrisrogerson commented 2 years ago

This is the BIRD log from that time: 2022-04-01 11:03:46.674 Reconfiguring 2022-04-01 11:03:46.674 Reconfigured 2022-04-01 11:03:46.681 Reconfiguring 2022-04-01 11:03:46.681 Reloading channel SW5LAB.ipv4 2022-04-01 11:03:46.681 Reconfigured 2022-04-01 11:03:46.688 Reconfiguring 2022-04-01 11:03:46.688 Reloading channel SW5LAB.ipv4 2022-04-01 11:03:46.688 Reconfigured

This does not just affect the IPv6, it affects any checks after the first two as ordered in the configuration regardless of IP version. I just happened to have the ipv6 check as the third in order in that log.

If we are being honest, I have no idea how to implement the debug you have suggested.

I did apply splay_startup = 50 (no idea what the units are) and that resolved the issue I was having.

chrisrogerson commented 2 years ago

Actually, it would seem that the splay_startup command has the effect of randomizing failure. Depending on the amount of splay between tests, it can correct the issue or not.

unixsurfer commented 2 years ago

From the README:


    splay_startup Unset by default

The maximum time to delay the startup of service checks. You can use either integer or floating-point number as a value.

In order to avoid launching all checks at the same time, after anycast-healthchecker is started, we can delay the 1st check in random way. This can be useful in cases where we have a lot of service checks and launching all them at the same time can overload the system. We randomize the delay of the 1st check for each service and splay_startup sets the maximum time we can delay that 1st check.

The unit is in seconds, it is a doc bug that we don't mention it:-)

At least you now have a workaround.

danpoltawski commented 2 years ago

Just to add a +1 to allow a single check to impact multiple prefixes - it feels somewhat wasteful chealthchecking the same thing for this