IRRd connection failure handling

bluikko commented 3 years ago

arouteserver could failover to a secondary IRRd when the default IRRd rr.ntt.net is not accessible or not responsive.

In the past rr.ntt.net has been half a day in a state where connections were accepted and queries could be sent, but a response was never received (problem with the IRRd at NTT). The problem seems to be exacerbated by bgpq4 not having proper failure handling/timeouts in such a state - it took an excessive amount of minutes (dozens?) for bgpq4 to time out (I am not 100% sure the queries did time out, maybe @job has some insight to this).

While rr.ntt.net was unresponsive it was also revealed that there is a secondary public IRRd rr1.ntt.net (also includes IPv6 support!), so it could be possible to failover from the primary IRRd to the secondary IRRd in case of problem in the former.

pierky commented 3 years ago

Hello @bluikko,

in 1a7fdfaffda6dda690a05bf9089b3d5330d04eff I've introduced a mechanism that monitors the execution time of bgpq3/bgpq4 and kills the sub-process when it seems stuck. The timeout can be set in the program's configuration file (arouteserver.yml, bgpq3_timeout setting). Also, the setting where the IRRD host is configured (bgpq3_host) now accepts a list of hosts; when a query fails (either because of timeout or other issues), the next host in that list is used. If all the hosts in the list time out, the process is aborted.

The sum of these 2 mechanisms should provide a solution to the issues that you've mentioned. I'd like to hear your feedback on it.

Also, I'm not too sure about the default timeout I'm proposing. At the moment I've set 2 minutes, which I think should be fine to complete queries against big data-sets. What's your thoughts on this value?

I've used time bgpq4 -h rr.ntt.net -S RADB -3 -j -4 -A -l prefix_list AS-HURRICANE (so, a query against HE's rset) to make an idea of how long a big query could take, which gave me results in the range 7-100 seconds.

bluikko commented 3 years ago

Sounds good to me! If there will be a release candidate I can try to test it - but I don't have good ideas how to properly replicate the IRRd failure that happened earlier.

I had tested the query and get quite consistent under 20 seconds. That is the largest AS-set I am aware of so 2 minutes sounds reasonable, better to have a timeout too large rather than too small.

pierky commented 3 years ago

Thanks for the feedback @bluikko, and for volunteering to test the candidate release.

I've pushed https://test.pypi.org/project/arouteserver/1.11.0a1/ and 1.11.0-alpha1 tag on DockerHub. Instructions on how to install alpha pre-releases can be found at https://arouteserver.readthedocs.io/en/latest/INSTALLATION.html#development-and-pre-release-versions.

bluikko commented 3 years ago

Tested the failover mechanism and it looks good to me.

pierky commented 3 years ago

Thanks @bluikko, I've just pushed the latest changes to master, the CI/CI pipeline should complete in 1 hour and if everything goes well 1.11.0 should be out, with this new feature.

pierky / arouteserver

IRRd connection failure handling #85