Mongoose stalls for 1-12 seconds under heavy load from Apache Bench on Mac

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Download Mongoose 2.11 on Mac OS 10.6.4. Build it using "make mac" and run 
it with "./mongoose" without moving it to any other location (i.e. inside its 
own source directory).
2. (optional) Launch Instruments (Mac developer tool), choose the CPU Sampler 
tool, and attach it to the mongoose process.
3. Open another Terminal window and enter the following: "ab -n 20000 
http://127.0.0.1:8080/LICENSE".

What is the expected output? What do you see instead?

If run with a sufficiently small number of requests (-n 4000 instead of -n 
20000, for example), ab reports that all requests are completed within a few 
milliseconds (typically <5 ms, even on low-end Mac laptop hardware).

Expected: This result continues or degrades gradually as load increases.

Actual: At some number of requests (appears to be between 5000 or 20000, 
depending on hardware capacity), Mongoose stalls completely and one request 
will take anywhere from 1 to 12 seconds (!) to complete. Instruments shows that 
activity in the Mongoose process goes almost completely dead during that time. 
(See attached screen shot from Instruments. The highlighted region corresponds 
to one of these stalls.)

See also attached file ab-output.txt, which contains one non-stalling run (10K 
requests) and a subsequent run with several stalls (20K requests). The attached 
Instruments screen shot was taken during the second run.

What version of the product are you using? On what operating system?

Mongoose 2.11
Mac OS 10.6.4

Tested on 13" 2009 model MacBook (not Pro) with 2GB RAM and 2008 model 8-core 
Mac Pro with 6GB RAM. Attached sample is from the Mac Pro.

Please provide any additional information below.

Original issue reported on code.google.com by eric.sco...@gmail.com on 3 Oct 2010 at 7:55

Attachments:

[Screen shot 2010-10-03 at 12.38.56.png](https://storage.googleapis.com/google-code-attachments/mongoose/issue-182/comment-0/Screen shot 2010-10-03 at 12.38.56.png)
ab-output.txt

GoogleCodeExporter commented 9 years ago

Out of curiosity: could you repeat the test with different number of serving 
threads (-num_threads), please?

Original comment by valenok on 3 Oct 2010 at 9:32

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Looks like the default value is -t 10.

With -t 5, the problem reproduced about the same.

With -t 20 and -t 50, the problem was much worse. Delays were visible within 
the first 2000 requests and it timed out (I think ab gives up at 30 seconds) at 
16K requests in both cases.

With -t 1, there were no major delays until the 16K mark again (hmm...), but it 
timed out again.

The number of requests completed before timeout in each of those tests was 
suspiciously similar: 16374, 16341, 16371.

Original comment by eric.sco...@gmail.com on 3 Oct 2010 at 10:54

GoogleCodeExporter commented 9 years ago

Did a bit more testing this afternoon/evening, and it appears this failure mode 
is related to the most intense loads. I used another machine to test mongoose 
running on the MacPro, basically using my WiFi network to introduce some 
network latency. In this configuration, I was able to run it through a test of 
1 million requests without this kind of delay. (Worst response time was just 
under 1 second. Mean was 3.6 ms.)

Kudos on memory management, BTW. Activity Monitor shows absolutely consistent 
memory consumption (1.8 MB real memory, 96.1 MB virtual memory) throughout this 
test.

Original comment by eric.sco...@gmail.com on 4 Oct 2010 at 3:59

GoogleCodeExporter commented 9 years ago

More info: Repro'd this again on my laptop using a slightly hacked up version 
of the chat server (attempting to remove file I/O from the equation).

All of the worker threads were stuck on the pthread_cond_wait in 
consume_socket().

The master thread was waiting on the select() call in master_thread().

Original comment by eric.sco...@gmail.com on 5 Oct 2010 at 1:01

GoogleCodeExporter commented 9 years ago

I can see it as well.
Master thread blocks in select() because the set of file descriptors to watch 
(read_set) is empty - all zeros. Looking.

Original comment by valenok on 5 Oct 2010 at 6:37

GoogleCodeExporter commented 9 years ago

I suspect we're hitting the wall on file descriptors. I encountered the same 
result using Apache as the web server, but it went away when I used ab's keep 
alive (-k) option. Went back to Mongoose and tried enabling keep-alive in both 
Mongoose and ab, but the problem persisted.

I've tried using ulimit -n to increase the file descriptor limit, but it had no 
apparent effect.

This seems fairly similar: 
http://serverfault.com/questions/10852/what-limits-the-maximum-number-of-connect
ions-on-a-linux-server, except that the Mac seems to stall at 16K requests 
instead of the 11K reported for Linux in that article.

Original comment by eric.sco...@gmail.com on 5 Oct 2010 at 4:38

GoogleCodeExporter commented 9 years ago

I doubt this is file descriptor issue.
Mongoose is using limited number of serving threads, and each thread opens one 
file. More than that, when serving threads are stuck in consume_socket(), that 
means they are waiting for the new connection, meaning they don't have any file 
descriptors opened.
I'll get back to this later today.
Will try running under valgrind with memory checker on.

Original comment by valenok on 5 Oct 2010 at 4:43

GoogleCodeExporter commented 9 years ago

Yes, but FDs are also used for TCP sockets, right? I was assuming that was the 
issue. In the Linux article, they say that FDs are sometimes held open by the 
system after the socket is closed. Not sure if this is true on Mac also, but I 
wouldn't be surprised.

I note that the problem seems to go away when I tested from one machine to 
another, using the reduced bandwidth of WiFi instead of localhost as a way to 
slow down the request stream.

I don't think that mongoose is leaking memory. Activity Monitor on the Mac 
showed VM and real mem consumption were absolutely stable (and remarkably low 
IMHO!) throughout the test. I suppose it's possible it's leaking something 
else; that wouldn't have shown up with my testing.

Original comment by eric.sco...@gmail.com on 5 Oct 2010 at 7:06

GoogleCodeExporter commented 9 years ago

Okay, I thought that was a problem in my environment (Linux 2.6) with my 
special case of implementation. I'm using Mongoose as an embedded webserver for 
my own application.

File System is not accessed in my application. Everything works in memory with 
connect to a MySQL database (which is hosted on another physical machine) and 
heavy caching of the retreived information.

Everything works fine as long as I hit it with less than 3000 requests. When I 
increase the number everything goes sloooooow.

Original comment by Bla...@gmail.com on 6 Oct 2010 at 10:34

GoogleCodeExporter commented 9 years ago

It looks like it stalls surely after ~16000 requests. Should I say, after 16384 
requests?
Master thread is in select(), worker threads are waiting for sockets..
I'm a bit lost.

Original comment by valenok on 11 Oct 2010 at 10:16

GoogleCodeExporter commented 9 years ago

I'm still looking at the possibility that this is a system-imposed limitation. 
(Ports remain allocated for a brief interval following closure.)

http://www.brianp.net/2008/10/03/changing-the-length-of-the-time_wait-state-on-m
ac-os-x/ would suggest that. Been away from my Macs for a brief vacation, but 
will try twiddling these settings and report back when I return later tonight.

Original comment by eric.sco...@gmail.com on 11 Oct 2010 at 10:41

GoogleCodeExporter commented 9 years ago

As noted in my comment #2, the number of requests completed before timeout 
occurs is always just slightly less than 16384, which makes sense if a few 
other ports in the transient range are in use by other processes.

Original comment by eric.sco...@gmail.com on 11 Oct 2010 at 10:45

GoogleCodeExporter commented 9 years ago

Just tried that.
sysctl -w net.inet.ip.portrange.hifirst=30000   - had no effect.
sysctl -w net.inet.tcp.msl=1000    - had an effect, now "ab -c 20 -n 30000 
127.0.0.1:8080/mongoose.c" works reliably without being stuck after 16000 
requests.

Original comment by valenok on 11 Oct 2010 at 10:51

GoogleCodeExporter commented 9 years ago

Bumped up sysctl -w net.inet.tcp.msl=100000, and mongoose now stalls again 
after ~16k requests.

Original comment by valenok on 11 Oct 2010 at 10:53

GoogleCodeExporter commented 9 years ago

It looks like this question 
http://stackoverflow.com/questions/1216267/ab-program-freezes-after-lots-of-requ
ests-why also talks about the same issue.

Original comment by valenok on 11 Oct 2010 at 11:08

GoogleCodeExporter commented 9 years ago

Pulled your latest changes (including the Oct 4 patch for keep_alive). With 
keep_alive enabled on both mongoose and ab, I just pushed it through a million 
requests on my 8-core Mac Pro without any hiccups. Actually, it was scorching 
fast: 20K requests per second in the heaviest test I've attempted so far (!).

$ /usr/sbin/ab -k -c 10 -n 1000000 http://127.0.0.1:8080/LICENSE
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests

Server Software:        
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /LICENSE
Document Length:        1062 bytes

Concurrency Level:      10
Time taken for tests:   49.132 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    1000000
Total transferred:      1280000000 bytes
HTML transferred:       1062000000 bytes
Requests per second:    20353.41 [#/sec] (mean)
Time per request:       0.491 [ms] (mean)
Time per request:       0.049 [ms] (mean, across all concurrent requests)
Transfer rate:          25441.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.1      0       1
Waiting:        0    0   0.1      0       1
Total:          0    0   0.1      0       1

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%      1 (longest request)

I think this bug can be closed. Thank you for taking the time to investigate 
and for the keep_alive patch.

Original comment by eric.sco...@gmail.com on 12 Oct 2010 at 6:06

GoogleCodeExporter commented 9 years ago

But without keep-alive, the issue still exists!
I've tried to play with SO_LINGER option when closing a socket,
and it appears to fix the issue even without messing with sysctl variables.

Original comment by valenok on 12 Oct 2010 at 9:16

GoogleCodeExporter commented 9 years ago

Having read more of the articles on the care and feeding of ephemeral ports, I 
think that's a change you shouldn't put into production. (See the Stack 
Overflow article you cited in #15 and also 
http://queue.acm.org/detail.cfm?id=1851338 and 
http://stackoverflow.com/questions/1410723/is-there-a-way-to-reopen-a-socket) 
All of these sources point to a risk of confusing stray packets from an old 
connection on that port # with a new one if you twiddle either SO_LINGER or 
net.inet.tcp.msl.

I think the correct behavior is to encourage the use of keep_alive in 
higher-traffic situations. Perhaps it should be enabled by default.

Note that using ab against Apache itself yields the same result with keep_alive 
disabled (my comment #6).

P.S. With keep_alive enabled, ab + mongoose ran successfully through a test of 
100 million requests overnight. There was a very, very slight memory accretion 
in the mongoose process (a few K bytes), but otherwise, the 21K requests/sec 
pace held up throughout the test.

Original comment by eric.sco...@gmail.com on 12 Oct 2010 at 1:12

GoogleCodeExporter commented 9 years ago

I believe that I am also affected by this issue.

I have a fairly hard-hit CGI script which generates map tiles.  It's not 
unusual for 60 to 120 requests to arrive from a single client at a time.  I 
decided to switch from Apache to Mongoose to serve this script for efficiency 
reasons (it didn't seem reasonable to spin up a mass pool of Apaches with PHP 
and a host of other modules to serve a script which doesn't depend on any of 
that stuff).  

In any case, after a certain amount of time, Mongoose slows way down or quits 
serving pages altogether (leaving a great deal of connections in the SYN_RECV 
state on the server side).  I'm only showing about 500 connections in TIME_WAIT 
when this happens.  Keepalives are enabled (and I've patched the source with 
the Oct 4 keepalive patch).  Mongoose's CPU time also drops to nothing in this 
state.

I'll try to collect more information over the next several days and post it 
here.  Is there anything specific you'd like me to try or record?

Original comment by larry.wi...@gmail.com on 21 Nov 2010 at 2:48

GoogleCodeExporter commented 9 years ago

I've submitted 
https://code.google.com/p/mongoose/source/detail?r=2550b2463d036674b4d050166b529
24a13ae7314 that should help.
Please verify.

Original comment by valenok on 24 Jun 2011 at 7:33

Changed state: Fixed

tst2005googlecode / mongoose

Mongoose stalls for 1-12 seconds under heavy load from Apache Bench on Mac #182