Closed GoogleCodeExporter closed 9 years ago
Out of curiosity: could you repeat the test with different number of serving
threads (-num_threads), please?
Original comment by valenok
on 3 Oct 2010 at 9:32
Looks like the default value is -t 10.
With -t 5, the problem reproduced about the same.
With -t 20 and -t 50, the problem was much worse. Delays were visible within
the first 2000 requests and it timed out (I think ab gives up at 30 seconds) at
16K requests in both cases.
With -t 1, there were no major delays until the 16K mark again (hmm...), but it
timed out again.
The number of requests completed before timeout in each of those tests was
suspiciously similar: 16374, 16341, 16371.
Original comment by eric.sco...@gmail.com
on 3 Oct 2010 at 10:54
Did a bit more testing this afternoon/evening, and it appears this failure mode
is related to the most intense loads. I used another machine to test mongoose
running on the MacPro, basically using my WiFi network to introduce some
network latency. In this configuration, I was able to run it through a test of
1 million requests without this kind of delay. (Worst response time was just
under 1 second. Mean was 3.6 ms.)
Kudos on memory management, BTW. Activity Monitor shows absolutely consistent
memory consumption (1.8 MB real memory, 96.1 MB virtual memory) throughout this
test.
Original comment by eric.sco...@gmail.com
on 4 Oct 2010 at 3:59
More info: Repro'd this again on my laptop using a slightly hacked up version
of the chat server (attempting to remove file I/O from the equation).
All of the worker threads were stuck on the pthread_cond_wait in
consume_socket().
The master thread was waiting on the select() call in master_thread().
Original comment by eric.sco...@gmail.com
on 5 Oct 2010 at 1:01
I can see it as well.
Master thread blocks in select() because the set of file descriptors to watch
(read_set) is empty - all zeros. Looking.
Original comment by valenok
on 5 Oct 2010 at 6:37
I suspect we're hitting the wall on file descriptors. I encountered the same
result using Apache as the web server, but it went away when I used ab's keep
alive (-k) option. Went back to Mongoose and tried enabling keep-alive in both
Mongoose and ab, but the problem persisted.
I've tried using ulimit -n to increase the file descriptor limit, but it had no
apparent effect.
This seems fairly similar:
http://serverfault.com/questions/10852/what-limits-the-maximum-number-of-connect
ions-on-a-linux-server, except that the Mac seems to stall at 16K requests
instead of the 11K reported for Linux in that article.
Original comment by eric.sco...@gmail.com
on 5 Oct 2010 at 4:38
I doubt this is file descriptor issue.
Mongoose is using limited number of serving threads, and each thread opens one
file. More than that, when serving threads are stuck in consume_socket(), that
means they are waiting for the new connection, meaning they don't have any file
descriptors opened.
I'll get back to this later today.
Will try running under valgrind with memory checker on.
Original comment by valenok
on 5 Oct 2010 at 4:43
Yes, but FDs are also used for TCP sockets, right? I was assuming that was the
issue. In the Linux article, they say that FDs are sometimes held open by the
system after the socket is closed. Not sure if this is true on Mac also, but I
wouldn't be surprised.
I note that the problem seems to go away when I tested from one machine to
another, using the reduced bandwidth of WiFi instead of localhost as a way to
slow down the request stream.
I don't think that mongoose is leaking memory. Activity Monitor on the Mac
showed VM and real mem consumption were absolutely stable (and remarkably low
IMHO!) throughout the test. I suppose it's possible it's leaking something
else; that wouldn't have shown up with my testing.
Original comment by eric.sco...@gmail.com
on 5 Oct 2010 at 7:06
Okay, I thought that was a problem in my environment (Linux 2.6) with my
special case of implementation. I'm using Mongoose as an embedded webserver for
my own application.
File System is not accessed in my application. Everything works in memory with
connect to a MySQL database (which is hosted on another physical machine) and
heavy caching of the retreived information.
Everything works fine as long as I hit it with less than 3000 requests. When I
increase the number everything goes sloooooow.
Original comment by Bla...@gmail.com
on 6 Oct 2010 at 10:34
It looks like it stalls surely after ~16000 requests. Should I say, after 16384
requests?
Master thread is in select(), worker threads are waiting for sockets..
I'm a bit lost.
Original comment by valenok
on 11 Oct 2010 at 10:16
I'm still looking at the possibility that this is a system-imposed limitation.
(Ports remain allocated for a brief interval following closure.)
http://www.brianp.net/2008/10/03/changing-the-length-of-the-time_wait-state-on-m
ac-os-x/ would suggest that. Been away from my Macs for a brief vacation, but
will try twiddling these settings and report back when I return later tonight.
Original comment by eric.sco...@gmail.com
on 11 Oct 2010 at 10:41
As noted in my comment #2, the number of requests completed before timeout
occurs is always just slightly less than 16384, which makes sense if a few
other ports in the transient range are in use by other processes.
Original comment by eric.sco...@gmail.com
on 11 Oct 2010 at 10:45
Just tried that.
sysctl -w net.inet.ip.portrange.hifirst=30000 - had no effect.
sysctl -w net.inet.tcp.msl=1000 - had an effect, now "ab -c 20 -n 30000
127.0.0.1:8080/mongoose.c" works reliably without being stuck after 16000
requests.
Original comment by valenok
on 11 Oct 2010 at 10:51
Bumped up sysctl -w net.inet.tcp.msl=100000, and mongoose now stalls again
after ~16k requests.
Original comment by valenok
on 11 Oct 2010 at 10:53
It looks like this question
http://stackoverflow.com/questions/1216267/ab-program-freezes-after-lots-of-requ
ests-why also talks about the same issue.
Original comment by valenok
on 11 Oct 2010 at 11:08
Pulled your latest changes (including the Oct 4 patch for keep_alive). With
keep_alive enabled on both mongoose and ab, I just pushed it through a million
requests on my 8-core Mac Pro without any hiccups. Actually, it was scorching
fast: 20K requests per second in the heaviest test I've attempted so far (!).
$ /usr/sbin/ab -k -c 10 -n 1000000 http://127.0.0.1:8080/LICENSE
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /LICENSE
Document Length: 1062 bytes
Concurrency Level: 10
Time taken for tests: 49.132 seconds
Complete requests: 1000000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 1000000
Total transferred: 1280000000 bytes
HTML transferred: 1062000000 bytes
Requests per second: 20353.41 [#/sec] (mean)
Time per request: 0.491 [ms] (mean)
Time per request: 0.049 [ms] (mean, across all concurrent requests)
Transfer rate: 25441.76 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 0 0 0.1 0 1
Waiting: 0 0 0.1 0 1
Total: 0 0 0.1 0 1
Percentage of the requests served within a certain time (ms)
50% 0
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 1 (longest request)
I think this bug can be closed. Thank you for taking the time to investigate
and for the keep_alive patch.
Original comment by eric.sco...@gmail.com
on 12 Oct 2010 at 6:06
But without keep-alive, the issue still exists!
I've tried to play with SO_LINGER option when closing a socket,
and it appears to fix the issue even without messing with sysctl variables.
Original comment by valenok
on 12 Oct 2010 at 9:16
Having read more of the articles on the care and feeding of ephemeral ports, I
think that's a change you shouldn't put into production. (See the Stack
Overflow article you cited in #15 and also
http://queue.acm.org/detail.cfm?id=1851338 and
http://stackoverflow.com/questions/1410723/is-there-a-way-to-reopen-a-socket)
All of these sources point to a risk of confusing stray packets from an old
connection on that port # with a new one if you twiddle either SO_LINGER or
net.inet.tcp.msl.
I think the correct behavior is to encourage the use of keep_alive in
higher-traffic situations. Perhaps it should be enabled by default.
Note that using ab against Apache itself yields the same result with keep_alive
disabled (my comment #6).
P.S. With keep_alive enabled, ab + mongoose ran successfully through a test of
100 million requests overnight. There was a very, very slight memory accretion
in the mongoose process (a few K bytes), but otherwise, the 21K requests/sec
pace held up throughout the test.
Original comment by eric.sco...@gmail.com
on 12 Oct 2010 at 1:12
I believe that I am also affected by this issue.
I have a fairly hard-hit CGI script which generates map tiles. It's not
unusual for 60 to 120 requests to arrive from a single client at a time. I
decided to switch from Apache to Mongoose to serve this script for efficiency
reasons (it didn't seem reasonable to spin up a mass pool of Apaches with PHP
and a host of other modules to serve a script which doesn't depend on any of
that stuff).
In any case, after a certain amount of time, Mongoose slows way down or quits
serving pages altogether (leaving a great deal of connections in the SYN_RECV
state on the server side). I'm only showing about 500 connections in TIME_WAIT
when this happens. Keepalives are enabled (and I've patched the source with
the Oct 4 keepalive patch). Mongoose's CPU time also drops to nothing in this
state.
I'll try to collect more information over the next several days and post it
here. Is there anything specific you'd like me to try or record?
Original comment by larry.wi...@gmail.com
on 21 Nov 2010 at 2:48
I've submitted
https://code.google.com/p/mongoose/source/detail?r=2550b2463d036674b4d050166b529
24a13ae7314 that should help.
Please verify.
Original comment by valenok
on 24 Jun 2011 at 7:33
Original issue reported on code.google.com by
eric.sco...@gmail.com
on 3 Oct 2010 at 7:55Attachments: