lilmuckers / magento-lilmuckers_queue

A generic multi-adapter queuing system for magento.
MIT License
58 stars 21 forks source link

Database connection timeouts when called via shell/queue.php #22

Closed geerlingguy closed 7 years ago

geerlingguy commented 7 years ago

For my site, I have multiple queue workers using Beanstalkd as a queue backend, and I start them with shell/queue.php --watch [queue-name].

These queues seem to work fine, but after some time (I'm guessing after MySQL's wait_timeout has been reached on the database connection), there's an exception logged:

Next Zend_Db_Statement_Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in /var/www/html/lib/Zend/Db/Statement/Pdo.ph
p:241
Stack trace:
#0 /var/www/html/lib/Varien/Db/Statement/Pdo/Mysql.php(110): Zend_Db_Statement_Pdo->_execute(Array)
#1 /var/www/html/app/code/core/Zend/Db/Statement.php(291): Varien_Db_Statement_Pdo_Mysql->_execute(Array)
#2 /var/www/html/lib/Zend/Db/Adapter/Abstract.php(479): Zend_Db_Statement->execute(Array)
#3 /var/www/html/lib/Zend/Db/Adapter/Pdo/Abstract.php(238): Zend_Db_Adapter_Abstract->query('SELECT COUNT(*)...', Array)
#4 /var/www/html/lib/Varien/Db/Adapter/Pdo/Mysql.php(428): Zend_Db_Adapter_Pdo_Abstract->query('SELECT COUNT(*)...', Array)
#5 /var/www/html/lib/Zend/Db/Adapter/Abstract.php(825): Varien_Db_Adapter_Pdo_Mysql->query('SELECT COUNT(*)...', Array)
#6 /var/www/html/lib/Varien/Data/Collection/Db.php(225): Zend_Db_Adapter_Abstract->fetchOne(Object(Varien_Db_Select), Array)
#7 /var/www/html/app/code/local/Model/Event.php(122): Varien_Data_Collection_Db->getSize()
#8 /var/www/html/app/code/local/Model/Queue/ReportWorker.php(52): Model_Event->sendEvents(1)
#9 /var/www/html/app/code/community/Lilmuckers/Queue/Model/Queue/Task.php(427): Model_Queue_ReportWorker->sendEvents(Object(Lilmuckers_Queue_Model_Queue_Task))
#10 /var/www/html/app/code/community/Lilmuckers/Queue/Model/Queue/Abstract.php(214): Lilmuckers_Queue_Model_Queue_Task->run()
#11 /var/www/html/shell/queue.php(194): Lilmuckers_Queue_Model_Queue_Abstract->runTask(Object(Lilmuckers_Queue_Model_Queue_Task))
#12 /var/www/html/shell/queue.php(60): Lilmuckers_Shell_Queue->_watch(Array)
#13 /var/www/html/shell/queue.php(231): Lilmuckers_Shell_Queue->run()
#14 {main}

It seems that there was a fix for this issue in 2014 (see these lines specifically, moving the db connection handling outside of the while() loop: https://github.com/lilmuckers/magento-lilmuckers_queue/commit/715c75a786e918024a6b97bb6fdf1eea350ce315#diff-06ed6044f3f923b9c36f7a31eb80e753R180), which was part of the issue: https://github.com/lilmuckers/magento-lilmuckers_queue/issues/5

The code for the db connection handling seems to have moved back inside the while() handler since the fix was applied (see: https://github.com/lilmuckers/magento-lilmuckers_queue/blob/master/shell/queue.php#L183-L210), but I'm wondering if that was an accidental reversion?

Or might I just be reading the new code wrong?

lilmuckers commented 7 years ago

So the old code closed the connection as soon as the process ran - and then tried a reconnect when the process failed. The problem with that was that every iteration through the while loop (depending on the beanstalkd timeouts) was closing and reopening the DB connection - which was causing issues further down the pipeline with MySQL when you scaled out the workers. This also happened whenever a job finished. So if you had dozens of really fast jobs in there (say you're sending a newsletter email to 200 people) you end up blocking the connections to MySQL. Basically it caused the main magento stack to start shitting itself.

The most recent change was watching for these disconnect events, throwing an exception for them - and reconnecting based on that exception event. The idea there being that instead of constantly flipping the MySQL connection - it's only being flipped when absolutely required. Additionally there was an issue where Magento would periodiclaly think that a connection was still open even though it had dropped - so the explicit closeDbConnection logic was put in to reset the Magento db state.

The exception is logged either way - so it the process failing, or is it just filling up the exception logs and proceeding as expected? If it's dropping the mysql connection completely - the solution could be to adjust how it's catching the "went away" exception and triggering the reconnect logic.

I hope this helps

geerlingguy commented 7 years ago

It seems like after observing a few threads that I ran seperately, the code is doing what it should—after wait_timeout is reached, I see an exception in the log, and I see in the database there’s a new connection set up (using SHOW FULL PROCESSLIST;).

Since we have a number of workers per queue, and our wait_timeout is 3600 sec/60 minutes, there are a number of these exceptions logged.

Do you recommend setting a much higher wait_timeout, or are the exceptions mostly harmless in your experience?

lilmuckers commented 7 years ago

Unless i'm misunderstanding - it sounds like it's doing precisely what is expected: IE: Connecting, processing, then when it dies - reconnecting.

Extending the wait_timeout on the MySQL side in my experience is not the best idea, because most of your DB load (at scale) will be your customer connections - and a longer wait_timeout will result in your connection count staying at an unreasonably high level - requiring more resources thrown at your SQL cluster (or increase the kernel ULimit - which has other concerns to it). EDIT - The connection count will stay open longer if there are bugs in the code - flawless code won't cause the issue - but in my experience there's not really such a thing as "flawless" magento coce

That error is an expected error - because the connection is potentially idling for a long time during processing tasks and waiting for queue items and such. Which is why it's meant to recover "invisibly" from it (but, as i say, it does log the error).

geerlingguy commented 7 years ago

Yeah, I did some more testing with the queue watchers today, and can confirm that when I start them all at once, I can see zero of the 'gone away' errors for the full wait_timeout period (60m in my case), and after that, I start seeing individual workers throwing the 'gone away' exception here and there (as some queues are idle that whole time, while others get jobs to process and so aren't idle and don't necessarily hit wait_timeout until they're idle later).

I'm going to close this issue, as I think it's more of a 'known non-bug'—I wonder though if there's any way it could be handled more gracefully, as any persistent queue watcher that's not restarted regularly would run into wait_timeout at some point (assuming the queue doesn't get any jobs within that period of time). Though with MySQL's default of 28800 (8 hours), it's likely that most queues would have a job within any given 8 hour window (more so than with a 1 hour window), so wouldn't run into the exception as often.