inex / IXP-Manager

Full stack web application powering peering at over 200 Internet Exchange Points (IXPs) globally.
https://www.ixpmanager.org/
GNU General Public License v2.0
377 stars 161 forks source link

DB-timeouts when a customer has 10s of thousands of prefixes #220

Closed rowanthorpe closed 9 years ago

rowanthorpe commented 9 years ago

We just added a customer whose AS-SET expands to a set of AS numbers which result in tens-of-thousands of route-objects from the various registries. When that customer is set as a route-server client with strict filtering enabled, the update-prefixes-and-asns cronjob takes so long (minutes) on that customer that our DB (MySQL) has a connection timeout, even when we extend MySQL's connection timeout limit to risky values. It appears (but I haven't dug into code to confirm yet) that ixptool.php -a 'irrdb-cli.update-prefix-db' and/or ixptool.php -a 'irrdb-cli.update-asn-db' hold the db-connection open throughout processing of each customer. If that is the case then in our situation the sensible/safe alternative is to close/reopen the db-handle during remote lookup of prefixes per-customer (in the meantime we have disabled strict filtering for that customer as a workaround). I hope this is a sane proposal, and I will try to see if this is easy to implement, but it may go too deep in core-code for me to do something with, without a lot of hair-pulling...

barryo commented 9 years ago

What's the actual error message? After how many seconds does the timeout occur?

nickhilliard commented 9 years ago

this seems a bit strange. the largest as-set we pull at inex has ~300k entries and it works fine.

barryo commented 9 years ago

Also, so we can replicate, what's the ASN/AS-SET? Send offlist if you prefer.

rowanthorpe commented 9 years ago

@barryo As for the error messages - the cronjob in normal mode was just sending us emails with the following standard opaque MySQL error-text each time it ran:

PHP Warning:  PDO::beginTransaction(): MySQL server has gone away in
/usr/share/php/Doctrine/DBAL/Connection.php on line 959
PHP Warning:  PDO::beginTransaction(): Error reading result set's header in
/usr/share/php/Doctrine/DBAL/Connection.php on line 959
ERROR: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

and after going through the familiar rigmarole of eliminating other possible causes of that error, I then ran the ixptool update commands with the verbose optflag, which always had the output of the form shown below. Note that between the point of printing the last debug output for the problematic customer and outputing the PHP error would always be a pause of several minutes (I would say from memory it was about 2 minutes, which makes sense because we have connect_timeout = 120 in our mysql config, so I would bet money it was exactly 120 seconds, but if you want to know the exact timeout anyway let me know and I'll ask someone to setup the equivalent situation on our staging-server/RS to run it there):

>>>> Processing AAAA: [IPv4: IRRDB ###; 0 stale; 0 new; DB updated] [IPv6: IRRDB ###; 0 stale; 0 new; DB updated]
>>>> Processing BBBB: [IPv4: IRRDB ###; 0 stale; 0 new; DB updated] [IPv6: IRRDB ###; 0 stale; 0 new; DB updated]
>>>> ....
>>>> ..[snip]..
>>>> ....
>>>> Processing JJJJ: [IPv4: IRRDB ###; 0 stale; 0 newPHP Warning: PDO::beginTransaction(): MySQL server has gone away in /usr/share/php/Doctrine/DBAL/Connection.php on line 959
>>>> PHP Warning:  PDO::beginTransaction(): Error reading result set's header in /usr/share/php/Doctrine/DBAL/Connection.php on line 959
>>>> ERROR: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

I don't know if it matters to send the ASN/AS-SET offlist or not, so just in case, I will do it that way.

rowanthorpe commented 9 years ago

I just made another discovery while getting the ASN, which I think shifts the troubleshooting goalposts a bit, but either way I think still indicates a "bug" (even if just a bug in the handling of a misconfiguration). The customer we added hasn't yet given us the exact ASN we should use (which will resolve to exactly one prefix), so in the meantime a more generic catchall ASN for them was entered (which we now realise resolves to many thousands of prefixes). The max-prefixes for the customer had been entered as 10 in the ixp-m interface though (in anticipation of receiving the single prefix). I would guess that is at least related to the reason for ixptool timing out, but even if that is the case I think it should at least fail instantly and noisily - indicating having hit the prefix-limit - rather than freezing silently until the database times out. I'll still send the ASN now in case that is a red herring though.

barryo commented 9 years ago

@rowanthorpe prefix limits are not considered when polling IRRDBs; they are only used for generating router configurations.

The error in question:

ERROR: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

indicates a PHP MySQL client issue and one we have never seen before. In fact, with out own install of ~60 members using route servers (plus other larger installs), this does not occur. The references below offer some ways of catching the exception and reconnecting under pdo_mysql but this has been solved internally in Doctrine 2.5.

One option you could try us to set mysqli.reconnect = On in php.ini and switch to mysqli in application.ini:

-resources.doctrine2.connection.options.driver = 'pdo_mysql' +resources.doctrine2.connection.options.driver = 'mysqli'

This is probably a won't fix for us because:

References: