pierky / arouteserver

A tool to automatically build (and test) feature-rich configurations for BGP route servers.
https://arouteserver.readthedocs.org/
GNU General Public License v3.0
284 stars 46 forks source link

Increase timeout for bgpq whois queries #93

Closed bluikko closed 2 years ago

bluikko commented 2 years ago

It looks like the performance of rr.ntt.net/rr1.ntt.net varies. Earlier there was no problem looking up the 2 very large AS-SETs AS-HURRICANE/AS-HURRICANEV6 but now they time out. It even had a margin that looked plentiful - but this proves the margin insufficient.
Other queries OK.

To be on the safe side, suggest to increase bgpq timeout 200%.

ARouteServer 2021-12-28 04:08:57,050 WARNING bgpq4 timed out while running the following command: 'bgpq4 -h rr.ntt.net -S RADB,RIPE,APNIC,AFRINIC,ARIN,NTTCOM,ALTDB,BBOI,BELL,JPIRR,LEVEL3,RADB,RGNET,TC -3 -j -4 -A -l prefix_list -R 32 AS-HURRICANE AS-HURRICANEV6' The host rr.ntt.net will not be used for the next IRR queries. The timeout is 120 seconds; to modify it, please edit the program's configuration file (usually arouteserver.yml) and change the 'bgpq3_timeout' setting. - Another attempt will be performed using the next host in the list.
ARouteServer 2021-12-28 04:10:57,118 ERROR Error while retrieving IPv4 prefixes from RADB::AS-HURRICANE, RADB::AS-HURRICANEV6 for client AS6939_1, client AS6939_2: Can't get authorized prefix list for RADB::AS-HURRICANE, RADB::AS-HURRICANEV6 IPv4: bgpq4 timed out while running the following command: 'bgpq4 -h rr1.ntt.net -S RADB,RIPE,APNIC,AFRINIC,ARIN,NTTCOM,ALTDB,BBOI,BELL,JPIRR,LEVEL3,RADB,RGNET,TC -3 -j -4 -A -l prefix_list -R 32 AS-HURRICANE AS-HURRICANEV6' The host rr1.ntt.net will not be used for the next IRR queries. The timeout is 120 seconds; to modify it, please edit the program's configuration file (usually arouteserver.yml) and change the 'bgpq3_timeout' setting. - No more attempts will be performed, all the hosts in the list failed.
ARouteServer 2021-12-28 04:10:57,118 ERROR Error while retrieving IPv6 prefixes from RADB::AS-HURRICANE, RADB::AS-HURRICANEV6 for client AS6939_1, client AS6939_2: Can't get authorized prefix list for RADB::AS-HURRICANE, RADB::AS-HURRICANEV6 IPv6: All the IRRD hosts timed out so far; there are no more hosts to use to perform the IRR queries.
ARouteServer 2021-12-28 04:10:57,119 ERROR Enricher 'IRRdb prefixes' completed with errors after 244 seconds

Question: can bgpq3_timeout be set in an environment variable? That would be easy for Docker arouteserver. (nevermind, mounting a customized arouteserver.yml is just as easy)

Edit: another option - set an exponential timeout if multiple whois servers?

bluikko commented 2 years ago

Seems that the performance of both (?!) rr.ntt.net and rr1.ntt.net has tanked - even a timeout of 360 seconds is not enough. It is interesting that both of the servers show the same behavior. I wonder if there's some issue that affects both of the servers.

As a test I increased bgpq3_timeout to 1200 and even that is not enough. The servers must have issues.

While looking at this I noticed that the arg -R 32 is passed to bgpq4 when retrieving v4 prefixes. I wonder if this might be a place for optimization and instead use cfg.filtering.ipv4_pref_len.max for -R? Since prefixes longer than that would be not allowed anyways? And similarly for IPv6 prefixes. (note: I know nothing about how this works internally in arouteserver so maybe this is a bad idea)

Edit: finally it completed, with:

INFO Enricher 'IRRdb prefixes' completed successfully after 1395 seconds
job commented 2 years ago

Are you using bgpq3 or bgpq4?

bluikko commented 2 years ago

Are you using bgpq3 or bgpq4?

It is the default bgpq of arouteserver docker image. It should be bgpq4 since one week ago it used to take less than a minute.

job commented 2 years ago

Not exactly addressing your question, but have you considered running your own IRRd instance? http://irrd.net/ - that way you create a local copy of the IRR data that you can hammer

bluikko commented 2 years ago

Not exactly addressing your question, but have you considered running your own IRRd instance? https://irrd.net/ - that way you create a local copy of the IRR data that you can hammer

Yes, every time there is an issue with the default servers... but right now do not have 300 GB of SSD space available.

pierky commented 2 years ago

Hi @bluikko, thanks for the details here.

Not sure that using a backoff value to increase the timeout when different servers are used would help much. Ideally, when a different server is attempted after a previous one failed, the original value of timeout should be enough to answer the query. If the issue is within the server itself, switching to a new one should be the solution, and having a longer timeout should be irrelevant. Of course, I see how the reality may be different than that 😉

Also, in this case the backoff multiplier wouldn't have helped to reach the final target value of 1400 seconds (unless we used a ~12x multiplier, which seems a bit aggressive 😆).

I'll look into the possible optimisation of changing the -R 32, but however I suspect it wouldn't help in similar circumstances.

bluikko commented 2 years ago

@pierky True. Do the servers work for you?

bluikko commented 2 years ago

The NTT IRR server performance seems to be back to usual:

INFO Enricher 'IRRdb prefixes' completed successfully after 245 seconds
pierky commented 2 years ago

Hello @bluikko,

thanks for the update.

I've tried to run bgpq4 using -R 32 (as it is in ARouteServer) and also using -R 24, and I've noticed the same time is needed to complete the query. Given the likely absence of performance optimisation given by a -R <max_pref_len> change I'm inclined towards keeping the current implementation as it is for the time being.

I'm going to close the ticket, but please feel free to reopen it should you have any further concerns or comments on this regard. Thanks again.