quru / qis

Dynamic image server for web and print
https://quruimageserver.com
GNU Affero General Public License v3.0
89 stars 7 forks source link

HTTP error 500s sometimes returned after an Apache graceful restart #5

Closed fozcode closed 6 years ago

fozcode commented 6 years ago

I'm copying this issue here from Quru's internal bug tracker as it documents some interesting behaviour.

Seen in live today when it ruined a PDF report: qis_access_log: 109.111.129.21 [08/Dec/2017:03:50:46 +0000] "GET /image?src=Images/live/2016-06/17/9226877-587-1.jpg.tif&top=0&left=0&right=1&bottom=1&fill=&angle=0&tmp=ConditionReport HTTP/1.1" 500 620 - 132606 - "Prince/10 (www.princexml.com)" "-"

No error in qis.log indicates that the error was not in the Python code. But qis.log does indicate that the httpd processes were being restarted at that time. Checking the cron log:

cron: Dec 8 03:47:01 images7 run-parts(/etc/cron.daily)[8987]: starting logrotate Dec 8 03:49:54 images7 run-parts(/etc/cron.daily)[11603]: finished logrotate

So yes there was a graceful restart in progress. Checking the Apache error log: error_log: [Fri Dec 08 03:50:46 2017] [error] [client 109.111.129.21] Premature end of script headers: runserver.wsgi [Fri Dec 08 03:50:47 2017] [notice] child pid 11703 exit signal Segmentation fault (11)

We can see that the httpd process crashed at the same time and for the same client IP. Bingo.

It's going to be hard to find out if this is an issue with mod_wsgi on startup or with something the QIS code is doing e.g. when it forks off the auxiliary processes on the first request. The latter would seem more likely and can be tested seeing as this problem is reproducable. We may also need to try using the latest mod_wsgi instead of the old versions shipped in the standard o/s packages.

fozcode commented 6 years ago

More testing done on RHEL 7.4 / QIS v3.0.1 / Python 3.5 / Apache 2.4.6 / mod_wsgi 4.6.2 running the QIS benchmark tool while telling Apache to restart gracefully.

Typically the graceful restart works OK, but occasionally the mod_wsgi child processes die. Here is a typical part of the latest run:

[Tue May 01 14:38:27.537524 2018] [mpm_prefork:notice] [pid 9548] AH00171: Graceful restart requested, doing restart
[Tue May 01 14:38:29.138877 2018] [mpm_prefork:notice] [pid 9548] AH00163: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_wsgi/4.6.2 Python/3.5 configured -- resuming normal operations
[Tue May 01 14:38:29.138918 2018] [core:notice] [pid 9548] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Tue May 01 14:38:35.943454 2018] [wsgi:error] [pid 10549] A logging server is already running.
[Tue May 01 14:38:35.946140 2018] [wsgi:error] [pid 10551] A logging server is already running.
[Tue May 01 14:38:36.359267 2018] [wsgi:error] [pid 10561] A logging server is already running.
[Tue May 01 14:38:36.474497 2018] [wsgi:error] [pid 10566] A logging server is already running.

This is a successful restart, showing that 4 mod_wsgi child processes were launched (10549, 10551, 10561, 10566) which matches the configuration. I recorded 4 graceful restarts in a row without seeing any problem. Then on the 5th restart we see this:

[Tue May 01 14:40:09.688717 2018] [mpm_prefork:notice] [pid 9548] AH00171: Graceful restart requested, doing restart
[Tue May 01 14:40:12.732821 2018] [wsgi:error] [pid 12272] [client 127.0.0.1:40070] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.737602 2018] [wsgi:error] [pid 12307] [client 127.0.0.1:40068] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.737854 2018] [wsgi:error] [pid 12284] [client 127.0.0.1:40078] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.763233 2018] [wsgi:error] [pid 12270] [client 127.0.0.1:40024] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.768543 2018] [wsgi:error] [pid 12271] [client 127.0.0.1:40028] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.768744 2018] [wsgi:error] [pid 12318] [client 127.0.0.1:39998] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.768947 2018] [wsgi:error] [pid 12297] [client 127.0.0.1:40066] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[Tue May 01 14:40:12.839813 2018] [mpm_prefork:notice] [pid 9548] AH00163: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_wsgi/4.6.2 Python/3.5 configured -- resuming normal operations
[Tue May 01 14:40:12.839853 2018] [core:notice] [pid 9548] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Tue May 01 14:40:18.834357 2018] [wsgi:error] [pid 12854] A logging server is already running.
[Tue May 01 14:40:18.839752 2018] [wsgi:error] [pid 12855] A logging server is already running.
[Tue May 01 14:40:19.947014 2018] [wsgi:error] [pid 12881] A logging server is already running.
[Tue May 01 14:40:20.088385 2018] [wsgi:error] [pid 12888] A logging server is already running.

Here we can see that 7 daemon processes crashed, which is odd as we only had 4. This 7 figure is repeatable. It's not obvious from the logs whether the old processes crash while shutting down or whether there are 7 crashes while starting up. The latter makes more sense, if mod_wsgi starts more instances as the crashes occur until 4 child processes are up and stable. The final result is that we do get 4 running child processes and traffic continues as normal. The 7 server side crashes however caused 7 client side error 500s, and this is the cause of the condition report errors in the ticket description. Here is the client side log:

ERROR 11025 - HTTP Error 500: Internal Server Error
LOG 11025 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&width=150
ERROR 11018 - HTTP Error 500: Internal Server Error
LOG 11018 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&width=150
ERROR 11020 - HTTP Error 500: Internal Server Error
LOG 11020 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&width=150&recache=1
ERROR 11019 - HTTP Error 500: Internal Server Error
LOG 11019 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&fill=FFFFFF&bottom=0.999999999999&height=480&width=640&recache=1
ERROR 11028 - HTTP Error 500: Internal Server Error
LOG 11028 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&fill=FFFFFF&bottom=0.999999999999&height=480&width=640
ERROR 11022 - HTTP Error 500: Internal Server Error
LOG 11022 - 500 - http://127.0.0.1/image?src=test_images/sunrise2.jpg&height=305&width=232&fill=ffffff&recache=1
ERROR 11024 - HTTP Error 500: Internal Server Error
LOG 11024 - 500 - http://127.0.0.1/image?src=test_images/sunrise1.jpg&width=150
fozcode commented 6 years ago

Same result using the event MPM instead of prefork.

fozcode commented 6 years ago

Adding extra startup logging (around the loading of qismagick.so) suggests that this is happening on the shutdown of the old mod_wsgi processes.

[Tue May 01 15:24:31.888513 2018] [mpm_event:notice] [pid 18243:tid 139706484263040] AH00493: SIGUSR1 received. Doing graceful restart

[Tue May 01 15:24:34.952189 2018] [wsgi:error] [pid 19161:tid 139706067044096] [client 127.0.0.1:41742] Truncated or oversized response headers received from daemon process 'qis': /opt/qis/src/wsgi/runserver.wsgi, referer: qis/benchmark
[snip - more errors]

[Tue May 01 15:24:35.027561 2018] [mpm_event:notice] [pid 18243:tid 139706484263040] AH00489: Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_wsgi/4.6.2 Python/3.5 configured -- resuming normal operations
[Tue May 01 15:24:35.027633 2018] [core:notice] [pid 18243:tid 139706484263040] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'

[Tue May 01 15:24:40.954192 2018] [wsgi:error] [pid 19270:tid 139706484263040] EXTRA DEBUG 19270 is loading qismagick.so
[Tue May 01 15:24:41.069715 2018] [wsgi:error] [pid 19270:tid 139706484263040] EXTRA DEBUG 19270 loaded qismagick.so OK
[snip - more processes start up ok]

[Tue May 01 15:24:42.587939 2018] [wsgi:error] [pid 19484:tid 139706484263040] A logging server is already running.
fozcode commented 6 years ago

There is a mod_wsgi configuration option that I'm not currently setting for graceful-timeout=sss, which sets how long to wait for a child process when doing a graceful restart. See: http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html

But now some bad news. That option only affects mod_wsgi's own internal restarts, and not a restart of the whole Apache process as happens with logrotate.

Now some worse news. According to the documentation for shutdown-timeout=sss:

If this option is not defined, then the shutdown timeout will be set to 5 seconds. Note that this option does not change the shutdown timeout applied to daemon processes when Apache itself is being stopped or restarted. That timeout value is defined internally to Apache as 3 seconds and cannot be overridden.

In our logs, what is the time taken between the restart and the crashes? 15:24:31.8 to 15:24:34.9 = 3.1 seconds. And earlier? 14:40:09.6 to 14:40:12.7 = 3.1 seconds. It would seem that the processes are not actually crashing but are being forcibly killed by Apache while they are still servicing a request. The graceful restart works without error when none of the QIS processes are handling a long (>3 second) request.

Surely we can change that 3 seconds? According to the mod_wsgi author at: https://stackoverflow.com/questions/46493776/how-to-do-graceful-application-shutdown-from-mod-wsgi

Apache will still forcibly kill off the managed mod_wsgi daemon processes after 3 seconds and there isn't (or wasn't last time looked) a way to override how Apache manages those processes, to enable a more graceful shutdown on the managed daemon processes. So graceful stop signal doesn't change anything when using daemon mode.

Oh.

The question is if you really need to go to these lengths. Requests will inevitably fail anyway and users have to deal with that, so often interruption of a handful of requests on a restart is not a big deal.

Right.

So basically what we're seeing here is expected behaviour, and Apache's "graceful" restart is not in fact very graceful at all.

We can go to great lengths to implement our own graceful restart, overriding the default logrotate postrotate action, or we can take Graham Dumpleton's advice and handle it on the client side where as he says failures should be expected anyway.

Python's requests package does automatic retries out of the box, but in this case we're talking about PrinceXML as the client. Can PrinceXML do automatic retries of remote assets? A quick Google suggests "not currently" so we may need to submit a feature request for this.

I'm going to close this issue as "won't fix", though what I mean is it can't currently be fixed in a reasonable way. There's always uwsgi instead of mod_wsgi, and then of course there's nginx...