Open GoogleCodeExporter opened 9 years ago
Unfortunately your test does not seem to run with Internet Explorer 8, and
Internet Explorer 9 does not run on Windows XP.
So I made a modified Version of the test which runs for IE8 (and many other
Browsers) and does not require a debug console in the browser. Furthermore it
uses a CGI script "echo.cgi" in the same folder instead of a callback, so it
can be tested with an unmodified mongoose. It actually does not matter what the
cgi script sends, as long as it is valid a json object.
The test HTML is attached.
I tried it with 8 different browsers and the server running on both, win32 and
linux, localhost and remote, with rather bad results: Actually all major
browsers had troubles with the async post requests. Browsers based on Presto
(Opera) and Trident (IE) had a lot of problems, but I also saw troubles with
WebKit (Safari, Chrome, ..) and sometimes Gecko (Firefox) browsers. Since the
error can be reproduced with different web engines, and with CGI as well as
with callbacks, it is very likely an issue in the mongoose core implementation.
@nullable: Could you please have a quick look if the essential meaning of your
test and mine is still the same (if not I will open another issue for this),
and maybe try it with IE9.
Original comment by bel2...@gmail.com
on 2 May 2012 at 12:53
Attachments:
> @nullable: Could you please have a quick look if the essential meaning of
your test and mine is still the same (if not I will open another issue for
this), and maybe try it with IE9.
Your test HTML is showing (as i expect) the same problem. Thank you for
investigating on this issue.
I forgot to say, that i tested also with the precompiled binary from this site
(v3.1) in a similar manner like you and it shows the same problems. Probably
you could attach the cgi script also?
BTW: IE9 can emulate IE8 and IE7.. :-)
The problem gets intensified the more CPU cores are available. On my Core 2 Duo
it shows the problem relatively seldom - but on another machine with 8 cores it
shows the problem very often. Defintly a multithreading issue.
Original comment by nullable...@gmail.com
on 2 May 2012 at 7:28
> Probably you could attach the cgi script also?
All files required for the test are here:
http://code.google.com/r/bel2125-mongoose/source/browse#hg%2Ftest%2Fajax
In case I make any improvements to the test, they will be available there as
well.
The problem does also occur on single core ARM CPUs and forcing the server to a
single CPU in Win32 (by SetProcessAffinityMask(GetCurrentProcess(), 1);) does
not seem to make it better. So it is at least not a "multicore only" issue.
Also reducing the number of worker thread to one does not eliminate the
problem, so I think it must be somewhere in the interaction between master
thread and worker thread and/or the sockets.
The result of a test with IE8, just one worker thread and process affinity set
to one core is attached as an image. The server log that belongs to the image
is also attached. The test was performed on localhost at a WinXP machine which
is strong and idle enough that there should be no timeout.
The log shows several errors:
- The test makes 10 async requests, but according to the
log there are just 5 accesses to the resource
- "accepted socket", "queued socket" and "grabbed socket"
occurs 9 times each
- The log shows one error 577 and one error 400
It seems that the POST data does not reach the right socket?
Original comment by bel2...@gmail.com
on 3 May 2012 at 7:29
Attachments:
Thanks for looking at this!
I am repeating bel2125's tests on macos, and don't see errors.
Let's narrow down the problem -- are we seeing this only on mongoose running on
windows?
Do you guys see any trouble when mongoose is running on linux/bsd/mac?
Original comment by valenok
on 5 May 2012 at 4:41
The errors can be observed on the following OS:
- Win7 (original post from nullable with IE9)
- WinXP (dual core X86 with my modified test; IE8 and other browsers)
- Linux (dual core X86, Debian V6)
- Linux (single core ARM)
I did the last three tests and found how often the problem occurs does not seem
to depend on the platform but the browser.
Opera does not seem to understand the test at all, so I got only errors from
that browser (or I made something wrong with this browser). Internet Explorer
(I used 8, nullable used 9) showed a lot of errors, also Safari and Chrome did
not do well. Firefox worked well with the 10 test cases, but instead of the
"done" message below the result table it keeps showing "running" - other
browsers do not. Embedded browsers and Aurora seem to work better. However, it
is not an "Internet Explorer only" problem, it is just more frequent there - if
you repeat the test more often you could see the problem with any major browser.
I also tested with LunarScape, which is a browser that supports three different
web engines and found the same results.
In all my tests the server was running on real machines, not VMs. A virtual
machine may show a different timing, so it is probably not ideal for this kind
of test.
I do not have a MacOS computer available so I did not test it there.
Original comment by bel2...@gmail.com
on 5 May 2012 at 9:59
I changed my test a bit by giving each request another url - but still just
returning the post data. I attached the original debug log of a session with a
failure (one timeout) - and the same log grouped by requests.
What seems strange to me are the different get_request_len values:
2917 (/echo_x) or 2918 (/echo_xy) seems to be the correct length, but there
also 2707 (/echo_1), 415 and 416.
I attached again my changed test - just for completness. There some other small
changes as reporting no longer to the browsers debug console but displaying the
result on the website.
Original comment by nullable...@gmail.com
on 6 May 2012 at 4:06
Attachments:
Unfortunately your test does again not run with Internet Explorer 8, which is
the most recent version available for WinXP ("Object required, line 77, row 1").
Also my Qt Browser gets stuck after the first of ten retries.
Since my test shows basically the same phenomena I continued using mine.
In my CGI script I tried to store the post data in a file, just for analysis,
but when I do this the test seems to work all the time. Write to files changes
the timing or leads to a serialization of the requests at the file system
operations.
Certainly this cannot be a solution to the issue, but only instrumentation.
Furthermore, the problem seems to depend on the size of the post data. On the
other hand, the reply data is not relevant.
Original comment by bel2...@gmail.com
on 7 May 2012 at 7:17
Attachments:
Have you similar observations about the mentioned len stated in the debug
output of get_request_len(..)?
I have noticed the same about the size of the post data.
Original comment by nullable...@gmail.com
on 7 May 2012 at 8:25
I made a simplified test with reduced data, basically just the ID:
data: { 'id' : id }
in this test I always get a CONTENT_LENGTH of 4 bytes (5 bytes for id=10), but
there seems to be some error when receiving the post data in the CGI script - I
did not analyze that in more detail yet, since my method of just writing all
data in a file in the CGI script seems to change the timing too much.
Original comment by bel2...@gmail.com
on 7 May 2012 at 2:12
I made an additional observation:
I set num_threads to 1 and tried to load the initial web-site
(http://localhost:8080/test.html) of the second test provided by me.
Now the following files should be transferred:
- test.html
- jquery.js
- favicon.ico -> Not Found
If i make break point in consume_socket(..) and "synchronize" the calls this
way, all is fine. If i don't have a break point, the client thread hang
immediately in the recv(..) call of pull(..). It can't serve the files.
1. Can someone confirm this behaviour?
2. Is it probably related to the bug we are discussing here?
Original comment by nullable...@gmail.com
on 8 May 2012 at 6:54
I have some more test results, with the test scripts from
http://code.google.com/r/bel2125-mongoose/source/browse#hg%2Ftest%2Fajax,
Version rcf8cef88d4ba. In contrast to the earlier tests, the test data is no
longer generated but specified explicitly in test.html.
In all Tests the "async POST" button was used - then nothing was done for about
a minute - then again "async POST" was pressed ...
In case the test is used unmodified (test.html contains six lines of
'longText', echo.cgi does not write the data to file), almost no test works.
One has to try very often to find one "success" in the test log.
-> see bad_case.log
In case the data is stored to a file (change line 47 in echo.cgi from doLogging
= false to doLogging = true), the POST requests work well. I did not observe
any errors yet.
-> see good_case.log
In case the data is not stored to a file but very short (comment out all
"longtext" -lines in test.html, i.e. line 52-57), it works much better. The
data is now basically just "id:1" and "async:true". The "errors" are relatively
rare, but they are still there.
-> see short_post_data.log
Maybe the result depends on the fragmentation of the post data - at least it
depends on the size. Indeed when looking at the logs, the lenghts differ much
more in the bad case.
Original comment by bel2...@gmail.com
on 8 May 2012 at 8:00
Attachments:
> 1. Can someone confirm this behaviour?
I never had problems with any browser when loading the test page (test.html +
jquery.js), not for t=1, not for t=10 - not with my test, not with yours. But I
can only use IE8 - IE8 is not able to execute your scripts.
In my test sometimes threads are blocked for quite a while during the test, and
if only one is available of course the entire server is blocked. I made most of
the tests with my own clone (http://code.google.com/r/bel2125-mongoose/),
beside some fixes for keep-alive it can be configured to close a socket
forcefully when it is idle for 2 minutes. This way the server is never blocked
completely for a longer time.
Original comment by bel2...@gmail.com
on 8 May 2012 at 9:38
I'm a bit confused... i recognized, that the problem is tight coupled with the
number of CPUs you have. On my machine with 8 CPU i have big problems, as
described also by simply loading the files (html, js, ..) without executing the
javascript part. But even once loaded successfully, executing the javascript
tests are failing (at a big part).
This is much less the case on my 2 CPU machine.
To get a bit deeper into this, i made a shortcut and processed every request
directly on the master thread. I was very surprised to see the same problems
even with only one (the master) thread! I attached my changed code (which is
marked with a "DEBUG" comment). Unfortunatly i didn't yet converted the
javascripts to be able running on IE8.
I hope i didn't messed up the code and my change is valid - otherwise let me
know.
Original comment by nullable...@gmail.com
on 9 May 2012 at 4:13
Attachments:
I always had the problem on single core machines as well.
At a first glance your code changes look OK for me, but I did not test it yet.
I made a test with the standard server and 10 threads using the "fiddler2" web
debugger. It can be used at the client PC to log the communication between
server and client (probably not for localhost). One can clearly see that in
some cases the server sents a 577 error (see attachment).
@nullable: Maybe you could provide such a log for your case too? It would be
interesting if your test on your machine with your IE will also yield in a 577
server response.
Original comment by bel2...@gmail.com
on 10 May 2012 at 1:14
Attachments:
Here are some test results:
With fiddler and browser running on a VM:
Tests from comment 6: no problems, always 200 OK
Tests from comment 13: no problems, always 200 OK
Without fiddler - browser running on a VM:
Tests from comment 6: often timeouts
Tests from comment 13: often timeouts
With fiddler, browser and server running on the same machine (not a VM):
Tests from comment 6: no problems, always 200 OK
Tests from comment 13: no problems, always 200 OK
Without fiddler - browser and server running on the same machine (not a VM):
Tests from comment 6: often timeouts, some errors (XHR status code 12152)
Tests from comment 13: mostly could not load test.html and jquery.js, if
loaded: often timeouts, some errors (XHR status code 12152)
I never got error 577. XHR status code 12152 seems to come from IE when the
response is not valid.
Original comment by nullable...@gmail.com
on 11 May 2012 at 7:37
Hmm. Then the fiddler web debugging proxy changes the timing too much on your
machine.
Did you see any error codes in the server debug log?
Original comment by bel2...@gmail.com
on 11 May 2012 at 11:30
The server debug log shows 404 (after the timeout) - i guess that is when the
ajax timeout closes the connection, but the request was not fully received.
On monday i will try the same tests with mongoose version 3.1 and 2.12.
Original comment by nullable...@gmail.com
on 12 May 2012 at 11:45
I tested now version 3.2, 3.1 and 2.12 - the sources and binaries are contained
in tests.zip.
The reports (server debug log and screenshot of the browser window after a
test) are attached.
The tests were made on my 8 CPU machine with Win7 64bit using IE9.
All versions shows the problem.
Original comment by nullable...@gmail.com
on 14 May 2012 at 7:27
Attachments:
With some instrumentation I found that the errors arise somewhere in push and
pull.
That there were no 577 errors in the test performed by nullable is because 577s
are set in "forward_body_data" only, which is called for CGI pages but not for
callbacks. Still the source of these errors could be the same, in the push/pull
functions.
My push and pull functions with additional traces can be found as an attachment
(together with forward_body_data), as well as the debug traces.
Still this does not explain WHY the functions fwrite and recv sometimes fail in
push and pull.
Original comment by bel2...@gmail.com
on 16 May 2012 at 6:41
Attachments:
Since it is not really possible to see inside Internet Explorer, I made a test
client to reproduce this issue. It can be found here:
http://code.google.com/r/bel2125-mongoose/source/browse/#hg%2Ftestclient
The test client sends multiple parallel requests to the web server, and indeed
the issue can be reproduced.
It turns out that it there are errors not only in simultaneous POST requests,
but also in GET Requests (although not that frequent). I made tests with GET
requests (0 bytes of data) as well as POST requests with 3, 1000 and 10000
bytes of data. The more data is sent with the POST requests, the higher the
number of errors – but there are errors basically everywhere, even for GET.
/ajax/echo.cgi GET
CPUs (bit masks): process=3, system=3
Preparing test ... OK: 77 bytes of data
Starting multi client test: 20 cycles, 20 clients each
Test cylce 1 completed
Error: Client 2002 got 0 bytes instead of 77
Error: Client 2018 got 0 bytes instead of 77
Test cylce 2 completed
Test cylce 3 completed
Test cylce 4 completed
...
/ajax/echo.cgi POST 3
CPUs (bit masks): process=3, system=3
Preparing test ... OK: 95 bytes of data
Starting multi client test: 20 cycles, 20 clients each
Error: Client 1005 got 0 bytes instead of 95
Test cylce 1 completed
Test cylce 2 completed
Test cylce 3 completed
Test cylce 4 completed
Test cylce 5 completed
Error: Client 6007 got 0 bytes instead of 95
Error: Client 6013 got 0 bytes instead of 95
Test cylce 6 completed
...
/ajax/echo.cgi POST 1000
CPUs (bit masks): process=3, system=3
Preparing test ... OK: 98 bytes of data
Starting multi client test: 20 cycles, 20 clients each
Error: Client 1000 got 33 bytes instead of 98
Test cylce 1 completed
Test cylce 2 completed
Error: Client 3002 got 33 bytes instead of 98
Error: Client 3006 got 0 bytes instead of 98
Test cylce 3 completed
Test cylce 4 completed
Test cylce 5 completed
Test cylce 6 completed
...
/ajax/echo.cgi POST 10000
CPUs (bit masks): process=3, system=3
Preparing test ... OK: 99 bytes of data
Starting multi client test: 20 cycles, 20 clients each
Error: Client 1000 got 33 bytes instead of 99
Error: Client 1003 got 33 bytes instead of 99
Error: Client 1007 got 33 bytes instead of 99
Test cylce 1 completed
Error: Client 2000 got 33 bytes instead of 99
Error: Client 2002 got 33 bytes instead of 99
Error: Client 2001 got 33 bytes instead of 99
Error: Client 2008 got 33 bytes instead of 99
Error: Client 2006 got 33 bytes instead of 99
Test cylce 2 completed
Thread 3000 did not finish!
Test cylce 3 completed
Test cylce 4 completed
Error: Client 5002 got 33 bytes instead of 99
Thread 5000 did not finish!
Test cylce 5 completed
Error: Client 6004 got 33 bytes instead of 99
Error: Client 6006 got 33 bytes instead of 99
Error: Client 6001 got 33 bytes instead of 99
Error: Client 6008 got 33 bytes instead of 99
Error: Client 6010 got 33 bytes instead of 99
Error: Client 6012 got 33 bytes instead of 99
Error: Client 6018 got 33 bytes instead of 99
Thread 6000 did not finish!
Test cylce 6 completed
Error: Client 7001 got 33 bytes instead of 99
Error: Client 7006 got 33 bytes instead of 99
Error: Client 7003 got 33 bytes instead of 99
Error: Client 7005 got 33 bytes instead of 99
Error: Client 7007 got 33 bytes instead of 99
Error: Client 7009 got 33 bytes instead of 99
Error: Client 7011 got 33 bytes instead of 99
Test cylce 7 completed
Error: Client 8005 got 33 bytes instead of 99
Error: Client 8003 got 33 bytes instead of 99
Error: Client 8000 got 33 bytes instead of 99
Error: Client 8001 got 33 bytes instead of 99
Error: Client 8002 got 33 bytes instead of 99
Error: Client 8004 got 33 bytes instead of 99
Error: Client 8006 got 33 bytes instead of 99
Test cylce 8 completed
Error: Client 9001 got 33 bytes instead of 99
Error: Client 9003 got 33 bytes instead of 99
Error: Client 9002 got 33 bytes instead of 99
Error: Client 9000 got 33 bytes instead of 99
Error: Client 9004 got 33 bytes instead of 99
Error: Client 9008 got 33 bytes instead of 99
Test cylce 9 completed
Error: Client 10002 got 33 bytes instead of 99
Error: Client 10000 got 0 bytes instead of 99
Error: Client 10001 got 33 bytes instead of 99
Error: Client 10004 got 33 bytes instead of 99
Test cylce 10 completed
Error: Client 11003 got 33 bytes instead of 99
Error: Client 11001 got 33 bytes instead of 99
Test cylce 11 completed
Error: Client 12000 got 33 bytes instead of 99
Error: Client 12018 got 33 bytes instead of 99
Test cylce 12 completed
…
Original comment by bel2...@gmail.com
on 21 May 2012 at 9:50
I modified the test so that it runs automatically and tests POST requests with
different content lengths. The test case can be found at
http://code.google.com/r/bel2125-mongoose/source/browse/#hg%2Ftestclient as
version r7ffffb3606d1.
The test logs are attached.
Errors can be found for ANY content length, there are just more frequent for
larger POST requests. There are at least two kinds of errors. One with a 33
byte response, another with no response at all (0 bytes).
Original comment by bel2...@gmail.com
on 22 May 2012 at 10:04
Attachments:
When looking closer at the log, it seems like the misbehavior of mongoose is
not limited to parallel requests from many clients.
There are also several error situations with just one client making a POST
request, e.g.,
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021001&na
me=testrun1.txt&token=cKV2NU5T5QLIZoW8EHzi4QtEju0%3A1337681079528#1216
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021001&na
me=testrun1.txt&token=cKV2NU5T5QLIZoW8EHzi4QtEju0%3A1337681079528#2305
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021001&na
me=testrun1.txt&token=cKV2NU5T5QLIZoW8EHzi4QtEju0%3A1337681079528#3403
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021001&na
me=testrun1.txt&token=cKV2NU5T5QLIZoW8EHzi4QtEju0%3A1337681079528#4498
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021002&na
me=testrun2.txt&token=JO8nywuKrKMn8rD9oR60Kq_404I%3A1337681079528#432
-
http://code.google.com/p/mongoose/issues/attachmentText?id=349&aid=3490021002&na
me=testrun2.txt&token=JO8nywuKrKMn8rD9oR60Kq_404I%3A1337681079528#1466
In other words, mongoose is unreliable in handling POST requests in all
situations, more threads and a larger amount of data just increases the
frequency of the errors. However, this error may strike you even for a single
client with a very simple request.
Original comment by bel2...@gmail.com
on 22 May 2012 at 10:27
These are bad news... do you know whether this applies also GET requests in the
way stated as in comment 22?
Original comment by nullable...@gmail.com
on 22 May 2012 at 12:16
[deleted comment]
The POST test above was running for several hours, and then it took some time
to evaluate it and to ensure the test was running correctly. I am doing the
same for GET requests, but it's not completed yet.
In contrast to POST requests, GET requests may not have any body data. Only the
size of the request head and the length of an optional query string may vary.
Some extra data in the request head and an additional query string may actually
be used for GET, POST and any other request.
In contrast to increasing the length of POST body data, increasing the length
of the head data and using longer query strings does not seem to yield in a
significant increase in the frequency of this error. Query strings up to
somewhere between 3 and 4 kB can be used – for longer strings the server does
not seem work at all, but 3 kB is well above the limit for Internet Explorer
(2083 bytes according to
http://en.wikipedia.org/wiki/Query_string#Compatibility_issues), so this should
be sufficient. Adding some 10 kB of useless data fields in the request header
does not seem to cause the same issues as adding the same amount of body data.
To find out if at least a non concurrent GET request is 100% reliable, I will
first run a long term test (for several hours) with small request headers and
just one client thread.
Original comment by bel2...@gmail.com
on 22 May 2012 at 4:07
I made another modified version of the test which performs 1 GET request every
10 seconds (revision 8fabfb035854,
http://code.google.com/r/bel2125-mongoose/source/browse/testclient/testclient.c?
r=8fabfb035854746c902d23e2981a4ef80de04304).
The result after 3000 tests (about 9 hours) is:
GET 998 2 (998 of 1000 OK, two cases of "Client 0 got %u bytes instead of
77")
GET 1000 0 (1000 of 1000 OK)
GET 998 2 (998 of 1000 OK, two cases of "Client 0 got %u bytes instead of
77")
... that means 4 of 3000 GET requests got lost (0.133%).
This is about the same ratio as for concurrent post request with less than a
few hundred bytes of body data (see the first 10 lines of testclient.log in
comment 21). There 14 of 10000 small POST requests got lost (0.14%).
So there seems to be a basic flaw that hits 0.14% of all requests, no matter if
they are GET or POST requests and no matter how large they are ("got 0 bytes
instead xx"). In addition there is a second error that depends on the post data
size ("got 33 bytes instead xx").
Original comment by bel2...@gmail.com
on 23 May 2012 at 7:06
Some more test results:
A non-concurrent POST request with 0 bytes of data (Test Version 478bda26593b):
OK Errors
1000 0
999 1
999 1
994 6
... not much difference to the GET request.
Original comment by bel2...@gmail.com
on 25 May 2012 at 6:38
Since both, GET and POST requests to CGI generated scripts are unreliable, I
tested a GET request to static contents (Test version 5af5ab6c3e45):
Both, the single client and the 20-client concurrent test did not show any
losses in 3000 (single client) and 33000 (multiclient) test requests.
In order to check if the flaw is in mongoose or the CGI process, I tested with
a different CGI process. In fact I mofidied the line
if (CreateProcessA(NULL, cmdline, NULL, NULL, TRUE,
to
if (CreateProcessA(NULL, "cmd /c type reply.txt", NULL, NULL, TRUE,
The result is the same as for the CGI process before:
Multiclient test:
997 3
996 4
992 8
996 4
998 2
1000 0
999 1
989 11
994 6
single client test:
CPUs (bit masks): process=3, system=3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
test cycle 0: 995 good, 5 bad
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
test cycle 1: 993 good, 7 bad
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
Error: Client 1 got 0 bytes instead of 3
test cycle 2: 939 good, 11 bad
Original comment by bel2...@gmail.com
on 25 May 2012 at 7:01
Attachments:
All my tests before used either CGI (GET, POST) or static content (GET).
I modified the test to check callback generated content
(http://code.google.com/r/bel2125-mongoose/source/browse/testclient/testclient.c
?r=334dd2cf9681994ee69e66646ea3ed34ab118a3a). The callback is here:
http://code.google.com/r/bel2125-mongoose/source/browse/main.c?r=4d4559368c99c17
a003456a4767fdee2b265fee1
It seems out that both, POST and GET requests to this callback are reliable ...
DATASIZE OK FAILED
1 1000 0
2 1000 0
4 1000 0
8 1000 0
16 1000 0
32 1000 0
64 1000 0
128 1000 0
256 1000 0
512 1000 0
1024 1000 0
2048 1000 0
4096 1000 0
8192 1000 0
16384 1000 0
32768 1000 0
65536 1000 0
131072 1000 0
262144 all failed
... until a POST content length of about 200 kB. So this must be a third error.
Original comment by bel2...@gmail.com
on 29 May 2012 at 6:59
TL;DR:
major offender is incorrect 'graceful close', but that doesn't seem to cover it
all. Yet.
.....................
/TL;DR:
Bugger: old message doesn't show up in here. Anyway, not really important as I
was able to reproduce the issue with static content and callback + POST/GET.
There's more than one problem here - including issues in the test app
'testclient' itself - but the major biggy is incorrect handling of the
'graceful close', at least on Windows (tested on Win7/64/Winsock2).
I have a mongoose clone over at https://github.com/GerHobbelt/mongoose/network
where branch bel2125-multitransfer-test now contains code that at least
resolves this issue in mongoose.
Doesn't mean everything is well, as I still get some odd responses which are
very timing dependent, so more analysis is being done.
Anyway, relevant URLs and code at
https://github.com/GerHobbelt/mongoose/blob/c69da4db0f2fc6b0d60d984e6c839504fc12
421a/mongoose.c#L4920 - note the MSDN and particularly the netherlabs URL in
the comments there if you want/need more in depth intel on this thing.
.......................
Other issues which popped up:
- testclient would lock up for large-and-beyond POSTs, etc. due to the TCP
buffers filling up and not getting read (pulled) at the other end. This goes
for both the client and the server side of the /_echo code.
( https://github.com/GerHobbelt/mongoose/commits/bel2125-multitransfer-test --
work started about near commit c81c3d / 29 May 2012)
--> fixed by having client 'slurp' content while it is still sending POST data and server main.c code fetching data in chunks and spitting it back again.
- testclient code has been overhauled to reduce CPU load; even very small
transfers would max out the CPUs.
- UNRESOLVED: 'testclient 35 1 .' (that'd be 1 thread testing POST /_echo with
N bytes) _can_ quit prematurely due to 'no bytes received'; this is an as yet
unresolved timing issue that is related to the 'graceful close' issue that's
the major offender. Exact cause undetermined, (of course) does not exhibit this
behaviour in the debugger; effect is clearly very timing dependent.
- UNRESOLVED: 'testclient' running the entire test set will sometimes show a
lock up, which lasts until timeout; exact cause not yet determined.
Original comment by ger.hobbelt
on 31 May 2012 at 8:13
oh-kay. Turned out the remaining 'odd behaviour' items over here were all due
to bugs in testclient - all my own mistakes (heck, I altered it so much I can't
blame anyone else :-) ).
Anyway, all tests in testclient PASS for
https://github.com/GerHobbelt/mongoose/commits/bel2125-multitransfer-test ,
latest commit.
TODO: cherrypick/backport and then push back into gc/hg, so this can be turned
into pull request.
Original comment by ger.hobbelt
on 31 May 2012 at 12:35
Thank you for your work on this topic. Since you made quite some changes, it
will take me a while to really look through the code.
I know that the testclient uses some unusual way of closing the connection,
basically waiting for 10 seconds if no more data is received. On the one hand,
if the server does not respond well the client will not hang, on the other hand
I think it is realistic that a user presses reload or close if there is no
response for 10 seconds anyway. I also thought that I should not have any
problem with the Nagle algorithm, since it will send any remaining data after
some 100 ms anyway – maybe this was not a correct assumption?
In any case, the problem must not be in the testclient alone, since the entire
issue started with missing responses to post requests within Internet Explorer.
I just used the testclient to see a bit deeper inside, deeper that it is
possible with Internet Explorer. Especially since web debugging tools like
Fiddler alter the behavior of the system too much – maybe by serializing the
requests?
So there must be at least some issue in the server code.
I have seen that there are quite some differences between your version of
mongoose.c and mine, and even more differences to the mainline. I think many of
them are structure and coding style. What would you think is the main
functional difference regarding to this issue?
Original comment by bel2...@gmail.com
on 4 Jun 2012 at 4:02
Grmbl >:-(( Had written a reply explaining what's changed in testclient and
why, plus changes in testcode in mongoose event callback, then Safari crashes
and it's bye bye.
Anyway, bottom line:
- TCP is a stream protocol (and yes, I had to dust off my W/ Richard Stevens'es
and be reminded of that) and Nagle in combination with HTTP is not a fix, but
rather bad ducktape for a botched interchange. (There's no 'end of message
flush moment' in a stream protocol other than close (or more properly:
shutdown(SHUT_RW) so other side knows you're done as client, then wait for
server to see this (instantly! no Nagle needed!) and have him do the same
(shutdown(SHUT_RW) or full 'graceful' close. Since the graceful close is
botched in mongoose...)
- testclient would cause a connection and thus thread lockup in mongoose - this
happened if data sent is larger than SO_RCVBUF of client +
SO_SNDUF+internal-buffer-sizes of mongoose as client will block in send()
because it's not done yet, mongoose will have got a return from recv() already
and jumped to the event callback code, which will do further recv(), which is
fine, but it'll try to send() the data received so far as part of the echo code
and since testclient isn't fetching any data yet (still busy blocking in the
large send() action itself), buffers fill up and then both lock up in blocking
calls: testclient isn't done sending yet, while mongoose needs to get its
written data out the door as everything is full.
If you haven't got the Stevens' books, here's a few links which got me going
the right way (contrasting with a lot of 'disable Nagle' distraction on the
Net):
http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why
-is-my-tcp-not-reliable
http://www.techrepublic.com/article/tcpip-options-for-high-performance-data-tran
smission/1050878
More to follow later. Getting late again :-(
Original comment by ger.hobbelt
on 4 Jun 2012 at 10:22
And in answer to what's the main functional differences:
- main.c (hence mongoose) has been changed in the echo event callback to listen
for both remaining incoming data and send buffer available simultaneously as
you can't predict how much data is incoming (read: waiting in the TCP stack
buffers; you only know the _total_ when it's a _proper_ HTTP/1.1 client
transmit, but that doesn't help you as you need to be able to send when
possible and read when required or you'ld get the previously mentioned
connection lockup again)
- mongoose.c has a corrected graceful close logic + corrected should_keep_alive
logic (yes, I am aggressive in closing a connection in that no-one can
guarantee full awareness of server state when a 5xx error happens -- which in
original mongoose also means a status of -1 can arrive at should_keep_alive.
Thank you for reminding me about the should_keep_alive logic; mine was still
flawed, and fatally in case of a 401! That's corrected:
https://github.com/GerHobbelt/mongoose/blob/master/mongoose.c#L1383
- testclient now acts as a correct client by sending request, which will be
followed by close anyway as we send Connection:close, so we know what to do:
shutdown(write side) and then fetch the remainder of the server response.
I have to go back in and create a mode in testclient where it's BAD behaved by
NOT doing a shutdown(WR); not doing that would make the TCP stack at the client
side wait before sending the request thus leading to slow performance.
(When HTTP requests are pipelined they shouldn't depend on the previous
responses in the pipeline as the stream protocol gives no guarantees whatsoever
when data will arrive so you can't have a 'message-based' approach like send
request, wait for response to complete, send another request, on a single TCP
connection. That mistake is made very often and Nagle is that used to 'fix'
that mistake.)
- mongoose.c also has timeouts (SO_RCVTIMEO et al) set for the _listening_
socket (and then sets them again on the socket produced by accept() in case the
stack isn't conform specs: options of a listened socket propagate into the
sockets produced by accept()). Original mongoose didn't do this and just set
timeouts on the accepted ones. Minor difference and probably not relevant to
this issue.
- also note the comments at the graceful close site: sockets are explicitly
switched back to BLOCKING mode before doing the graceful close dance; like MSDN
docs and others say: don't do this on a non-blocking socket.
The rest of the edits would probably be irrelevant to this case.
Original comment by ger.hobbelt
on 4 Jun 2012 at 10:49
The source of a minimal Qt WebKit based browser is available here:
http://code.google.com/r/bel2125-mongoose/source/browse#hg%2FBrowser
Do not think of it as a fully featured browser that could be a replacement for
your favorit PC/Mac/desktop browser. It could be a starting point for building
a browser for an embedded device, e.g., on one of those embedded-something
boards you may get at an electronics components store for less than 50 $ or €.
It could also be a a starting point for a browser window you may add to an
application, e.g., to show help pages or dynamic contents generated from that
application. It does not have a menu or navigation bar, just specify the
starting url on the command line. It does not contain a window resize handler,
so the size is fixed and can be set from the command line by using -w###x###
(e.g., -w600x400).
Original comment by bel2...@gmail.com
on 8 Jun 2012 at 6:17
Studying the links from ger.hobbelt (comment 33), it seems that TCP has quite
some properties that are undesired for HTTP. After reading that, I felt like I
have to write down some important use cases one would have for a HTTP server:
U1) the classical interactive use by a human user
U2) as a protocol for devices that interchange information (no humans)
U3) use in a bad network (mobile internet in rural areas)
U4) exposure to the internet (an unfriendly to vicious environment)
U5) use it mainly in keep-alive mode
U6) use it on embedded devices (small memory, non-standard operating systems)
U7) use it embedded in other applications
U1 means it has to be reasonably responsive. There the must not be long or
frequent delays in interactive user sessions.
U2 means it has to be reliable. A user may tolerate a loading fault once in a
while and push the reload button, a fully automatic client may not.
U3 means not every connection is shut down nicely, sometimes the network is
rather slow, sometimes there are connection leaks for a few seconds and
sometimes it breaks down completely for an hour, because of weather conditions.
U4 means that you have to expect unknown and bad behaved clients. Probably not
attacks from experienced hackers (I would not use it as a government, banking
or large corporate web server anyway), but it should no crash when some kid
that happened to find the IP address fiddles around with the server.
Anyway, not only in the internet environment and bad networks, but as a general
design principle, a server should be robust against rude clients. Not every
rude behavior is an attack.
U5: keep-alive mode is recommended for HTTP/1.1 and all browsers that I know
use it (that my testclient used “connection: close” was just meant to
simply the situation). Multiple requests reuse the same connection, therefore
one cannot shutdown the socket after the first request. Regarding to this, I
need to study if Ajax (http://en.wikipedia.org/wiki/Ajax_(programming))
actually uses the same connection for every request of opens a different one.
Remember, the entire issue started with Ajax POST requests. In case they reuse
the connection, Ajax is doing exactly what should not be done according to the
above considerations: the second Ajax request may depend on the data received
from the first Ajax request (it does not in the test, but often it will –
depending on the web site content).
U6 means, that I must consider situations other than a recent desktop Windows
and a major Linux distribution as well. There might be some differences in the
TCP/IP stack implementations as well, and there are no books, mailing lists,
experts, .. available on the details. In addition, one must try to keep
everything “small”.
U7 means one must not change any setting that affects the entire process. Of
course one must not do anything that makes the process crash or hang.
Ideally all of the above requirements are fulfilled at the same time. Is that
unrealistic? I hope not. However, I am aware that some of them cannot be
fulfilled by using TCP in the most friendly way possible. That is where my
timeouts and ungraceful close after some seconds came from.
Original comment by bel2...@gmail.com
on 8 Jun 2012 at 7:47
I started to merge some of the fixes by ger.hobbelt into my clone. A first
merge is available in
http://code.google.com/r/bel2125-mongoose/source/browse#hg, but that work is
not yet finished. I was not able to merge everything, because there are quite
some differences, some functions are not available on all my platforms, and I
simply did not yet study every difference. I am not sure if I already got the
most important differences merged.
I tried the shutdown code from ger.hobbelt in mongoose, but not in testclient
(a server should not rely on only having nice clients, see comment 36). Still,
I added a shutdown call.
With this merge I repeated the test, and found that there are still errors when
using mongoose and testclient for 2000 byte POST requests to CGI, as well as
when using the Ajax test with Internet Explorer 8.
One of the documents cited in comment 33 mentions that the Apache web server
disables Nagle on all sockets. Ignoring all warnings I just tried that and added
{int disable_Nagle = 1; setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
(char*)disable_Nagle, sizeof(disable_Nagle));}
in mongoose.c (where I set the socket timeout). Doing that in the merged
version did not fix the errors, neither for the testclient, nor for the Ajax
POST test. However, it seems to work if both, mongoose and testclient disable
Nagle for every socket (no error in a quick test). Anyhow, this is no fix
because the reference is Internet Explorer, not testclient.
Original comment by bel2...@gmail.com
on 8 Jun 2012 at 8:09
I think by now I merged all the essential parts of mongoose.c from
gerhobbelt-mongoose-issue-360 into my clone. Unfortunately not all the errors
are completely gone. There are still some errors with testclient (that I did
not merge completly). But, much more important, the POST test issues with
Internet Explorer/Trident and WebKit/QtBrowser are still there (see images).
Original comment by bel2...@gmail.com
on 12 Jun 2012 at 10:12
.. here are the images mentioned in comment 38
Original comment by bel2...@gmail.com
on 12 Jun 2012 at 10:13
Attachments:
I think I've nailed the POST issue as well now; at least it tests OK (success)
with IE8/IE9/FF with mongoose built on Win7/64/MSVC2010, Win7/32/MingW and
Ubuntu10/gcc4.
http://code.google.com/r/gerhobbelt-issue349/
POST issue consists of several issues, depending on where you are,
platform-wise.
1) On Windows, you're basically screwed if the CGI interpreter is in a path
with spaces in it (e.g. c:\Program Files (x86)\Lua\5.1\lua.exe). Fixed by
properly quoting the commandline.
2) Any CGI run will fail when your mongoose + website and CGI interpreter are
on different drives and document_root is a relative path (issue 364: fixed)
3) When CGI scripts are DOS format (CRLF instead of LF in the script file
itself), Linux at least will SILENTLY fail: nothing to stderr, nothing to
stdout.
4) The echo.cgi script ignores content data that's part of the request,
resulting in 'broken pipe' as soon as you're going to forward that data to the
script, because the script has already terminated since it found it had enough
input from the HTTP headers alone. (you WANT forward_content_data() to
rigorously check I/O operations, because that's what you want for PUT, but in
case of a GET+content or POST+content and a CGI script that ignores the
incoming anyway, you DON'T want the same code to send a 5xx response. Fixed.
5) When you've got a persistent connection and CGI scripts or NEW_REQUEST
callback handlers or whatnot IGNORING the request content, i.e. not
mg_read()ing ALL request content, than for 'large-ish and beyond' content sizes
the discard function would horribly fail - as that one believed it only had to
adjust the buffer. Fixed by simply letting mg_read() handle it; replicating the
logic in mg_read() -- which will produce return value 0 when the request
content has been read in its entirety -- was found to be too much hassle: it's
subtle so replicating it elsewhere in the same code is begging for bugs. Fixed.
6) the test.html page would b0rk fatally with IE8/9 since IE doesn't define the
'console' object/methods when you're not running the IE debugger (F12). Note
for the wicked: once you're invoked the [F12] debugger once and closed it
again, the console object remains available. Nevertheless, JS code now checks
for a valid console instance before console.log()ging. Fixed.
7) To work, you need all the #360 fixes as well (which concern themselves with
TCP connectivity in a HTTP scenario: graceful close et al). These fixes are all
included in the issue349 clone.
Original comment by ger.hobbelt
on 13 Jun 2012 at 7:10
Hi
I try to run the issue349 clone against my tests provided in comment 18
(http://code.google.com/p/mongoose/issues/detail?id=349#c18).
Unfortunatley i got a linker error until changed the usage of mg_cry(..) in
mg_start(..) to cry(..) (line 5197 mongoose.c).
After that, the test are still failing either with error 12152 or when the
assert in the user callback kicks in because mg_read(..) returned -1.
Note that these tests make no use of CGI and console.log(..).
Original comment by nullable...@gmail.com
on 17 Jun 2012 at 8:50
Weird; I must've done something wrong then in the last commits; going to fetch
& run this on a new machine. Hold one...
Original comment by ger.hobbelt
on 17 Jun 2012 at 9:55
Confirmed; mg_cry-->cry edit applied. Tests from #18 (at least the #18
test.html + ajax code, note IE8 corrections in there) are located in issue349
clone in ./test/ajax/ subdirectory - tested just now for MSVC build: these fail
when lua isn't in the expected (UNIXy) spot; run
mongoose -I "'C:\Program Files (x86)\Lua\5.1\lua.exe'"
after installing LuaForWindows (5.1.4) for all test/ajax/test.html tests to
succeed.
(note the single quotes! This is to allow -I to accept interpreter paths with
embedded space and/or interpreter commandline arguments, without Windows cmd
screwing it up for you.)
Will look into the C code included in #18 test.zip now...
Original comment by ger.hobbelt
on 17 Jun 2012 at 11:14
:-( I notice that my ajax/test.html isn't your #18 www/test.html in that #18
zip.
Original comment by ger.hobbelt
on 17 Jun 2012 at 11:21
Indeed its not the same. For my tests (of comment #18) i just copied
mongoose.h, mongoose.c and mongoose_sys_porting.h of the issue349 clone into a
copy of echo-v3.2 (from the tests.zip). So, i went the otherway round.
Original comment by nullable...@gmail.com
on 17 Jun 2012 at 11:32
confirmed the 12152 errors for the test.zip included test.html code for IE9.
'Funny thing' is FF and the others don't error out.
Did some hair pulling (and renamed your test.zip:www/test.html to test_18.html
and dropped it in my issue349 repo), found this one:
http://stackoverflow.com/questions/3731420/why-does-ie-issue-random-xhr-408-1215
2-responses-using-jquery-post
did some more testing with the /echo handler code ripped from your main and
found:
1- with the handler code as-is, the issue of hanging POST requests and random
12152 errors recur. Hence: issue reproducible.
2- Adding a 'Connection: close' in the response output 'fixes' the issue as per
stackoverflow suggestion. But that's not good enough, of course ;-) because the
explanation there is sound and pointing straight at IE+WinSock themselves.
Network sniff confirms this: the 12152 items are 'correct' on the wire so the
browser screws up.
3- Turns out that the 12152 which means something like 'malformed HTTP
response' is IE9 saying they _demand_ a Connection: header: when the code is
changed to send a 'Connection: keep-alive', suddenly IE9 succeeds always, at
least on this machine. (Win7/64 with latest MS updates)
(BTW/OT: if you want the access log to log the 'correct' status, you might want
to 'request_info->status_code = 200' in any custom event handler. Otherwise the
access log is filled with lines listing their status code as '-1'.)
issue349 clone to be updated shortly...
Original comment by ger.hobbelt
on 17 Jun 2012 at 12:45
issue349 clone and github repo 'master' branch updated. But take note: it gets
weirder and weirder...
My issue349 hg clone ONLY WORKS OK when you always follow the stackoverflow
approach, i.e. Connection: close.
Meanwhile the main github master branch /seemed/ to work OK with the very same
test_18.html (renamed from your #18 www/test.html) and Connection: Keep-Alive.
However, when hammering [F5] and the [test] button, I can only state that the
12152 errors are gone, but you still get 'pending' issues in IE9 for arbitrary
requests (including serving 'static pages' such as the jquery.js file).
Network sniffing shows that IE9 somehow opens connections without sending any
request through it, so you get to see a mongoose error log about '[Error 580:
Mongoose Internal Server Error or client closed connection prematurely]' then
while IE9 waits for about 5 seconds and then retries - with success for the
static pages.
The POSTs of course fail when this happens to them, as there's no retry
mechanism there.
This is an IE9 (or at least IE*, haven't tested with IE8, etc. yet) specific
issue, and now that the 12152 errors are gone over here, I still remain with
those pesky (Pending...) issues which seem to be entirely client-side now.
I must be missing something very subtle about HTTP/1.1 keep-alive IE9 logic...
:-S
Original comment by ger.hobbelt
on 17 Jun 2012 at 1:38
updated the issue349 clone after extensive testing with another Win7/64/IE9
machine.
Note that commit
http://code.google.com/r/gerhobbelt-issue349/source/detail?r=2722e77ff55c642877b
79085976925b57e1749d2 is the last one exhibiting the IE9 12152/pending/timeout
issue: you can toggle the #if 0 in the next commit there
http://code.google.com/r/gerhobbelt-issue349/source/detail?r=21ff9988b9bd2d668c0
0af91df4afc47be51aaf0 to observe the issue going away (#if 0) and recurring
(#if 01).
It boils down to this: when you send IE9 a response with Connection: keep-alive
yet close the connection immediately afterwards, then IE9 will get confused and
throw the 12152 tantrum at random intervals. When you keep the connection open
to serve the next request like your Connection: header said you would,
everything succeeds.... with a little snag:
http://support.microsoft.com/kb/813827 : note the mention of the '1 minute' in
there.
This (combined with any socket timeout set up mongoose=server-side) determines
how long your threads will be locked up by that single IE9 browser: in my case
I see a usage of 3..4 threads in parallel to serve a single IE9 browser, so at
10 threads you can DoS a mongoose with a few IE9 browsers pretty darn quickly.
The solution to *that* problem would be to push back inactive (i.e. waiting)
kept-alive connections onto the queue, but that's a work yet to be done. I hope
to get it done in the issue349 somewhere today. Till then, that last commit
should be the point to verify my tests, work and observations.
Original comment by ger.hobbelt
on 18 Jun 2012 at 11:52
Implemented and tested the 'push back idle connections so HTTP keep-alive yet
idle connections don't occupy mongoose threads for a long time' solution in the
github repo (master branch):
https://github.com/GerHobbelt/mongoose/commit/d76e6f13b3c1e93cb7cea0eb13ae253d2d
f5c45f
+
https://github.com/GerHobbelt/mongoose/commit/c92ace3916a1afe5293198462f2b524d46
d2084d
Tested on Win7/64 with IE9 and latest FF with bel2125 #18 test page and /echo
server-side event handler (in the github repo these are test/ajax/test_18.html
and examples/bel2125-issue349.c respectively): all tests succeed (and quickly!
:-) ), even when you hammer the mongoose server.
Tested with both 1(!) and 5 thread mongoose servers, with keep_alive support
enabled.
Will backport to the hg issue349 clone later.
N.B.: consequence of this 'test & swap' is twofold:
+ well-behaved browsers/clients with keep-alive don't occupy threads for a long
time (~ 1 minute) any more, so all incoming connections / requests get served
ASAP.
+ few threads can serve *many* connections in parallel: requests are served in
Round Robin fashion (i.e. we all get served slower together when the load grows)
+ INIT_CONNECTION and EXIT_CONNECTION event callbacks happen when they should
+ you can bind user data to the 'request_info' record for a particular
connection: it will survive idle periods of that connection.
- as a test&swap (think of it as a 'task switch') can occur after having served
a request, do NOT expect subsequent requests on the same connection to be
processed by the very same thread: this is by design as another interleaved
request can be long-running all by itself, e.g. when it invokes special user
code in the NEW_REQUEST event callback.
Naive user event callback implementations may suffer from the latter point as
old assumptions won't hold.
Original comment by ger.hobbelt
on 19 Jun 2012 at 2:17
http://code.google.com/r/gerhobbelt-issue349/source/detail?r=2936d125aed5cc31244
c6e075435c84aa4a5f2cc
(Tested to compile and run with the MSVC2010 solution; test_18.html test passes
as with the github version.)
Original comment by ger.hobbelt
on 19 Jun 2012 at 2:43
Original issue reported on code.google.com by
nullable...@gmail.com
on 26 Apr 2012 at 12:28Attachments: