mod_wsgi Daemon process deadlock timer expired, stopping process

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.I am running a test with huge concurrency load on the server(50 users
concurrently accessing the system each making 50 requests over a period of
time). Imagine that there is a think time of 2 secs between each request.
2.I am using the below configuration directive to run the DJango
application using mod_wsgi
    WSGIDaemonProcess test user=webservd group=webservd processes=1 threads=150
    WSGIProcessGroup test

What is the expected output? What do you see instead?
I expect that the mod_wsgi Daemon process does not get shutdown and a new
process does not get respawned. But after serving around 1060 requests,
spawned over a period of 7 mins by concurrent users, the process gets
killed and a new process gets started. Before it gets killed there is no
spike in the response time of the requests. In fact, the avg response time
of a request is 0.15 secs. I see the below message in the Apache error logs
before the process is restarted.

Around 300 lines of the following message:

[Sat Oct 10 21:29:18 2009] [warn] (128)Network is unreachable: connect to
listener on [::]:9333

Following lines after the above(there were some repetitions of the above as
well:

[Sat Oct 10 12:10:44 2009] [info] mod_wsgi (pid=10824): Daemon process
deadlock timer expired, stopping process 'test'.
[Sat Oct 10 12:10:44 2009] [info] mod_wsgi (pid=10824): Shutdown requested
'test'.
[Sat Oct 10 21:40:48 2009] [info] server seems busy, (you may need to
increase StartServers, or Min/MaxSpareServers), spawning 8 children, there
are 0 idle, and 61 total children.......

We have an app-level cache that gets created on server startup and gets
refreshed every 30 mins. We want to have a single daemon process to serve
the requests, instead of multiple daemon processes.

What version of the product are you using? On what operating system?
mod_wsgi v2.2 and Solaris 10

Please provide any additional information below.

Original issue reported on code.google.com by windson.rangavajhala on 11 Oct 2009 at 6:29

GoogleCodeExporter commented 9 years ago

Upgrade to latest version of mod_wsgi and duplicate it. Version 2.2 is quite 
old.

It is also highly unlikely you need 150 threads in the process. If your typical 
request is only 0.15 seconds, then 
150 threads means you could technically handle up to 900 requests/second which 
I would imagine your 
benchmarking is showing you are not achieving anyway. As such, the high number 
of unnecessary threads you 
are using is resulting in the background thread which checks for whether a 
Python dead lock has occurred 
being starved and it is never getting a chance to execute.

I will look at the mechanics of how the dead lock detection is done and see 
what can be done where the thread 
itself is being starved as opposed to detecting a dead lock. I would suspect 
though that if you simply drop the 
number of threads you will find you are still able to handle same throughput 
and problem will possibly go 
away.

Even so, this all doesn't sound right as that would mean you have managed to 
starve the dead lock thread for 5 
minutes which is highly unlikely. It is more likely that the problem is that 
you are using mod_wsgi 2.2 as there 
were issues with configuration corruption in older versions of mod_wsgi which 
were fixed since that version.

BTW, you do not need to set 'processes=1' as it defaults to that and using 
'processes' option, even with value 
'1' has consequences in respect of value of 'wsgi.multiprocess'. This isn't 
relevant to you issue, but don't set it 
all the same.

The network unreachable problem isn't related to mod_wsgi but likely from some 
other Apache module you 
are using as mod_wsgi doesn't try and connect on INET ports and uses UNIX ports 
for sockets which are path 
based and not a number.

Original comment by Graham.Dumpleton@gmail.com on 11 Oct 2009 at 10:42

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

hi Graham,

We reduced the threads to 15 in number and the performance has certainly 
improved.
But the deadlock issue is not gone. Did you get a chance to look into why the
deadlock timer expired issue would have occurred? 

************************************************************************
Some useful info for you:

This happens only when there are database updates involved in our test 
requests. Here
are the scenarios.

# of Threads  | Process Lifetime | Total # of hits processed  |  # of processed 
hits 
                                 | before process is killed   |  having db updates
--------------------------------------------------------------------------------
----
15              5 mins              3,940                        206
30              8 mins             14,022                        566
60              4.5 mins            2,815                        128

What is the recommended and the thread limit for a high-traffic production 
scenario?

************************************************************************

We tried to install mod_wsgi2.6(by taking src from downloads section in your 
mod_wsgi
site, but could not get the same installed on our machine due to the following 
issue.
We made sure we set the CC env variable to the appropriate cc complier 
location. But,
this failed yet.

/my-home/mod_wsgi-2.6> make install
/usr/apache2/bin/apxs -c -I/usr/local/python-2.5.1/include/python2.5
-I/usr/local/pkgs/studio-12.1/include -DNDEBUG  mod_wsgi.c -fast -xarch=sse3 
-m64 -g
-L/usr/local/pkgs/studio-12.1/lib -R/usr/local/pkgs/studio-12.1/lib -B direct 
-z text
-lpython2.5 -lresolv -lcurses -lsocket -lnsl -lrt -ldl -lm
/var/apache2/build/libtool --silent --mode=compile cc -prefer-pic -O -xarch=386
-xchip=pentium -xspace -Xa -xildoff -xO4 -DSSL_EXPERIMENTAL -DSSL_ENGINE 
-DSOLARIS2=10 -D_POSIX_PTHREAD_SEMANTICS -D_REENTRANT -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64  -I/usr/apache2/include  -I/usr/apache2/include  
-I/usr/apache2/include -I/usr/sfw/include 
-I/usr/local/python-2.5.1/include/python2.5
-I/usr/local/pkgs/studio-12.1/include -DNDEBUG  -c -o mod_wsgi.lo mod_wsgi.c && 
touch
mod_wsgi.slo
/var/apache2/build/libtool: line 1282: cc: command not found
apxs:Error: Command failed with rc=65536
.
*** Error code 1
make: Fatal error: Command failed for target `mod_wsgi.la'

It would be really great if you can reflect your thoughts as soon as possible.

Thanks,
Pavan

Original comment by windson.rangavajhala on 16 Oct 2009 at 2:57

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The apxs program is going to ignore CC variable. It will use what compiler was 
used when Apache was 
configured/built. If that was Sun C and you are trying to get it to use gcc, it 
will not work. You cannot just edit 
apxs either to make it use gcc as it is preconfigured for Sun C command line 
options. If it is only an issue with 
finding 'cc' command, add the bin directory where it is to your PATH.

After looking at deadlock timeout, thread starvation unlikely to be the issue. 
I would still be concerned about 
memory corruption due to that old mod_wsgi version.

BTW, are you 100% sure your code is thread safe? Do you use third party Python 
packages with C extension 
components. Some of those commonly used with Django aren't thread safe, such as 
GeoDjango (??).

As for how many threads you need, that depends on how long your requests are 
and in practice how many 
concurrent requests and requests/second you will handle in reality. There was 
the start of a discussion about 
this at:

http://groups.google.com/group/modwsgi/browse_frm/thread/3f69f70d8a0e087b

One could use that test script around a sleep() call for time your requests 
take to try and gauge how many 
threads you realistically need. If you want more information about that, please 
use the mailing list rather than 
this issue tracker.

At this point, only other thing I can suggest is that you set deadlock-timeout 
option to something higher than 
300 seconds against WSGIDaemonProcess directive and see if it makes a 
difference.

I just cannot find anything wrong with code and this feature has never troubled 
anyone else before.

Original comment by Graham.Dumpleton@gmail.com on 17 Oct 2009 at 11:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Closing at this point as no further feedback and couldn't see any issue with 
mod_wsgi. Is possibly the 
configuration corruption issue fixed in newer version of mod_wsgi than what was 
used.

Original comment by Graham.Dumpleton@gmail.com on 22 Nov 2009 at 3:17

Changed state: WontFix
Added labels: ****
Removed labels: ****

shangma / modwsgi

mod_wsgi Daemon process deadlock timer expired, stopping process #159