pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

Queue died outright recently on server, unclear why #57

Closed axfelix closed 8 years ago

axfelix commented 8 years ago

This is a new error, now that LibreOffice is no longer hanging, we had all the queues die :)

From /var/apache2/error.log:

[Wed Jan 27 15:05:10 2016] [error] [client 54.67.38.74] PHP Fatal error: Uncaught exception 'Zend\Uri\Exception\InvalidUriPartException' with message 'Host "%s" is not valid or is not accepted by Zend\Uri\Http' in /var/www/vendor/zendframework/zendframework/library/Zend/Uri/Uri.php:746\nStack trace:\n#0 /var/www/vendor/zendframework/zendframework/library/Zend/Http/PhpEnvironment/Request.php(306): Zend\Uri\Uri->setHost('%s')\n#1 /var/www/vendor/zendframework/zendframework/library/Zend/Http/PhpEnvironment/Request.php(82): Zend\Http\PhpEnvironment\Request->setServer(Object(Zend\Stdlib\Parameters))\n#2 /var/www/vendor/zendframework/zendframework/library/Zend/Mvc/Service/RequestFactory.php(32): Zend\Http\PhpEnvironment\Request->__construct()\n#3 [internal function]: Zend\Mvc\Service\RequestFactory->createService(Object(Zend\ServiceManager\ServiceManager), 'request', 'Request')\n#4 /var/www/vendor/zendframework/zendframework/library/Zend/ServiceManager/ServiceManager.php(905): call_user_func(Array, Object(Zend\ServiceManager\ServiceManager), 'request', 'Request')\n#5 /var/www/vendor/zendframew in /var/www/vendor/zendframework/zendframework/library/Zend/ServiceManager/ServiceManager.php on line 912

Not sure what's going on there, will look into it, restarted gracefully (manually) for now.

axfelix commented 8 years ago

Seeing periodic queue hangs on the Bibtexreferences modules too -- wonder if this has anything to do with that module making external API calls (are we somehow having socket issues? that seems unlikely). The queues dying is still uncommon though; mostly we're seeing hangs without any output. So perhaps the above debug log is actually of particular interest if it's the only time we got good logs from this or a related issue...

axfelix commented 8 years ago

Magic cronjob to resolve sticky queues for the time being:

if [[ $(mysql -uxmlps -p xmlps -e "select * from job where status=0 and inputFileFormat!=0 and creationDate > (UNIX_TIMESTAMP() - 600);") ]]; then /var/www/start_queues.sh; fi

jalperin commented 8 years ago

shouldn't issue remain open until someone investigates/resolves why?

axfelix commented 8 years ago

yeah

axfelix commented 8 years ago

I'm tweaking that cron further, by the way ... had it kill a job during a live demo I gave last week :)

axfelix commented 8 years ago

current cron, not perfect:

*/10 * * * * bash -c 'if [[ $(mysql -uxmlps -p xmlps -e "select * from job where status=0 and creationDate < (UNIX_TIMESTAMP() - 1200);") ]]; then if [[ $(mysql -uxmlps -p xmlps -e "select * from job where status=0 and creationDate > (UNIX_TIMESTAMP() - 600);") ]]; then :; else /var/www/start_queues.sh; killall -o 5m soffice.bin; fi; fi'

kaschioudi commented 8 years ago

@axfelix : I looked into this and here is what I suggest to collect more data in order to audit the issue.

Apply https://github.com/pkp/xmlps/commit/83f37d87aa7f8fcd554d99ed123c05e069ea7e56 patch so that we can monitor queue activities. This will create a log file at /var/local/queue_debug.out with output like below [Mon, 14 Mar 2016 12:09:53 -0400] QUEUE => docx for JOB => 1127 under PID => 31098

Make changes to the system to create core dumps of unlimited size:

ulimit -c unlimited
install -m 1777 -d /var/local/dumps
echo "/var/local/dumps/core.%e.%p"> /proc/sys/kernel/core_pattern

(or modify /etc/sysctl.conf to make the change persistent over reboots)

create restart_queues.sh based on start_queues.sh, but kill processes using kill -ABRT $pid. This will generate core.php. files in /var/local/dumps/

Finally, we modify the cron to call restart_queue.sh and also log the result of the SQL command in a file.

axfelix commented 8 years ago

Sounds wise! My most recent cron has finally (knock on wood) been doing a good job of keeping things moving, but this is a much better solution to diagnose the issue. Will implement and report back.

axfelix commented 8 years ago

Still having trouble getting output from this, and don't think I'm encountering user permissions issues. cron is still doing its job though.

kaschioudi commented 8 years ago

if the php processes run as www-data, core file size limit for that user needs to be changed.

Either in /etc/security/limits.conf file

www-data hard core unlimited

or as root, you can run

su - www-data
ulimit -c unlimited
ulimit -c
axfelix commented 8 years ago

Closing because the cron has been working for months and the remaining "hangs" which are fixed by it are all in dependency libraries rather than our code.