Closed bluikko closed 3 years ago
Hello @bluikko,
in 1a7fdfaffda6dda690a05bf9089b3d5330d04eff I've introduced a mechanism that monitors the execution time of bgpq3
/bgpq4
and kills the sub-process when it seems stuck. The timeout can be set in the program's configuration file (arouteserver.yml, bgpq3_timeout
setting). Also, the setting where the IRRD host is configured (bgpq3_host
) now accepts a list of hosts; when a query fails (either because of timeout or other issues), the next host in that list is used. If all the hosts in the list time out, the process is aborted.
The sum of these 2 mechanisms should provide a solution to the issues that you've mentioned. I'd like to hear your feedback on it.
Also, I'm not too sure about the default timeout I'm proposing. At the moment I've set 2 minutes, which I think should be fine to complete queries against big data-sets. What's your thoughts on this value?
I've used time bgpq4 -h rr.ntt.net -S RADB -3 -j -4 -A -l prefix_list AS-HURRICANE
(so, a query against HE's rset) to make an idea of how long a big query could take, which gave me results in the range 7-100 seconds.
Sounds good to me! If there will be a release candidate I can try to test it - but I don't have good ideas how to properly replicate the IRRd failure that happened earlier.
I had tested the query and get quite consistent under 20 seconds. That is the largest AS-set I am aware of so 2 minutes sounds reasonable, better to have a timeout too large rather than too small.
Thanks for the feedback @bluikko, and for volunteering to test the candidate release.
I've pushed https://test.pypi.org/project/arouteserver/1.11.0a1/ and 1.11.0-alpha1 tag on DockerHub. Instructions on how to install alpha pre-releases can be found at https://arouteserver.readthedocs.io/en/latest/INSTALLATION.html#development-and-pre-release-versions.
Tested the failover mechanism and it looks good to me.
Thanks @bluikko, I've just pushed the latest changes to master, the CI/CI pipeline should complete in 1 hour and if everything goes well 1.11.0 should be out, with this new feature.
arouteserver could failover to a secondary IRRd when the default IRRd
rr.ntt.net
is not accessible or not responsive.In the past
rr.ntt.net
has been half a day in a state where connections were accepted and queries could be sent, but a response was never received (problem with the IRRd at NTT). The problem seems to be exacerbated bybgpq4
not having proper failure handling/timeouts in such a state - it took an excessive amount of minutes (dozens?) forbgpq4
to time out (I am not 100% sure the queries did time out, maybe @job has some insight to this).While
rr.ntt.net
was unresponsive it was also revealed that there is a secondary public IRRdrr1.ntt.net
(also includes IPv6 support!), so it could be possible to failover from the primary IRRd to the secondary IRRd in case of problem in the former.