Closed rickynelson1939 closed 7 years ago
damn, it's hard to reproduce, here's binary data that works for me dump.zip
damn it, even log level change cause this bug to change its behaviour...
well, judging by the log, next pipelined request came before file, issued by previous request, was completely send -> "lws_http_transaction_completed" wasn't called, fragments wasn't reset -> next pipelined request was read on top of previous request headers
Does the patch above help?
No more "Duplicated method" warning, but it gets strange, when log level is at 7 or less there is "forbidding on uri sanitation" warning given, but when I'm trying to log it at least on debug level (15) then there is no errors and traffic is processing a lot slower, so it's hard to tell what is happening exactly...
This (which goes out as one pipelined request)
$ curl http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico > /dev/null
doesn't have any problem here. It didn't have any problem before the last patch either, FWIW.
100 x pipelined http://localhost:7681/libwebsockets.org-logo.png
with larger payload is also OK. Doing 100 x curl favicon.ico from a machine on another continent doesn't show up any problem.
I actually testing this just with a buffer filled with repeated single http request sent via "send" in blocking loop, if that's matter. The fact that this bug is not reproducible even on my machine if log level is high enough should tell about something, I guess it takes some special condition, e.g specific case of fragmentation or something like that.
Maybe if I dump every result of call to a recv and then repeat this exact data and call sequence, but with higher log level it will reveal something. Or maybe you have some better ideas?
I can't reproduce the problem.... it doesn't mean the problem doesn't exist.
Using your test packet doesn't show up any problem (with default log level) here either.
$ cat ~/dump.bin | nc localhost 7681
That sends the whole test packet the first time, still no problem.
If you can casually reproduce the issue, you will have to stick some debugging around forbid:
in parsers.c: 953, find out who jumped there and what made him feel that was a good idea. That logging will happen after the problem comes so it shouldn't affect reproducibility on your end.
Okay, I'll try some of that right now.
it jumps to forbid from parsers.c: 833, and I actually noticed, all frag_index'es are zero
Might be interesting to instrument server.c::1288 lws_handshake_server() who calls lws_parse()... if he errors out, dump a copy of *buf contents from entry and see if what it tries to parse is sensible.
0x642e32 "1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r"...
It's more useful if you dump the *buf as it was on entry to lws_handshake_server(), if lws_parse() fails. Then we can see what it was trying to parse.
so you mean obuf then? 0x642e31 ".1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n"...
So it was given just the one char '.'... we also know if it got to parsers.c: 833 it did not have any method yet... either something reset the ah while it was still in use or we start at an unreasonable place in the rx.
Will the problem keep coming if you add logging to _lws_header_table_reset() parsers.c:64, and some log to say we entered lws_handshake_server(), so we can understand the ordering a bit?
wierd thing, if I used lwsl_warn - there was no error, but after I used fprintf - it did triggered forbid log.txt
I tried changing the log emit function to the default stderr (test-server.c 340 -> lws_set_log_level(debug_level, NULL);) , the problem didn't start appearing. Can you see if that lets you dump logs?
Basically the successful ones have this pattern
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
but the failing one is only
lws_handshake_server
It should be OK since already one reset before it was called. But it suggests that's part of a different behaviour that might be the real problem.
Hmm, no, it stop working when I do that. I even tried replace every lwsl_* so it define plain fprintf, but no luck... I think it's time to dump recv
What exactly am I looking at?
this is data exactly as it was received from socket when bug was triggered, from start to the very moment of breakpoint, now I just emulate each and every result of a recv call exactly as it was in previous session, but now with log level 65535, and I'll be damned, it worked log.txt
Still... it's something to do with something... packet 76 ends with
GET /favicon.ico HTTP/1
and then naturally packet 77 begins with
.1
so we learned we blow on fragmentation of the first line... lws has bytewise parsers for everything, it should not matter to it how the stuff is fragmented.
Is it possible the logging sensitivity is to do with the kernel having time to coalesce the two packets? Can you check that?
Trying to duplicate this standalone, as expected it is immune to fragmentation. In one terminal
$ nc -l -k 7680 | nc localhost 7681
then in another (quickly before it timesout)
$ echo -n "GET /favicon.ico HTTP/1" | nc localhost 7680 ; echo -n -e ".1\xa\xd" | nc localhost 7680
gets me
[2017/06/27 14:57:17:0485] DEBUG: lws_read: incoming len 23 state 0
[2017/06/27 14:57:17:0485] PARSER: issuing 23 bytes to parser
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'G' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'E' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'T' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART ' ' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: known hdr 0
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) '/'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'f'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'a'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'v'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'i'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'c'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'o'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'n'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) '.'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'i'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'c'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'o'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) ' '
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'H'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'T'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'T'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'P'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) '/'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) '1'
[2017/06/27 14:57:17:0486] DEBUG: lws_read: thinks we have used 0
[2017/06/27 14:57:17:0486] INFO: lws_read: read_ok, used 23
[2017/06/27 14:57:17:0486] DEBUG: lws_server_socket_service: wsi 0x258f070: ah read rxpos 23, rxlen 23
[2017/06/27 14:57:17:0615] DEBUG: lws_read: incoming len 4 state 0
[2017/06/27 14:57:17:0616] PARSER: issuing 4 bytes to parser
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '.'
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '1'
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '
'
'2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '
[2017/06/27 14:57:17:0616] PARSER: *
[2017/06/27 14:57:17:0616] DEBUG: lws_read: thinks we have used 0
[2017/06/27 14:57:17:0616] INFO: lws_read: read_ok, used 4
[2017/06/27 14:57:17:0616] DEBUG: lws_server_socket_service: wsi 0x258f070: ah read rxpos 4, rxlen 4
[2017/06/27 14:57:20:0198] NOTICE: wsi 0x258f070: TIMEDOUT WAITING on 3 (did hdr 0, ah 0x23b1480, wl 0, pfd events 1) 1498546640 vs 1498546639
Without the logging he just shows timeout, ie, no failure at parse.
I'll try fake up the same pipelined as well.
Hmm, I'll try, but how am I supposed to do this exactly? here's a log with a log level 65535 and input from real socket, no forbid was triggered log.txt.zip
I just mean do the same recv info dump you did before, but with logging enabled... are the 76th and 77th packet still fragmented in the same place?
no, it seems at some point something went different recv_dump_no_bug.zip
wait a minute, what a hell, there's a bunch of zeros at the end of first file
FWIW I found I can spam endless amounts of fragmented and unfragmented requests on one connection with no problems.
$ while [ 1 ] ; do echo -n -e "GET /favicon.ico HTTP/1.1\x0d\x0a\x0d\x0aGET /favicon.ico HTTP/1" | nc localhost 7680 ; echo -n -e ".1\x0d\x0a\x0d\x0a" | nc localhost 7680 ; done
wait a minute, what a hell, there's a bunch of zeros at the end of first file
So there is... the file size is 2KB though
damn, I used wrong variable in dump code, I redump both cases right now
Hmmm in the bugging dump the earlier pkt #82 also begins with ".1" but that has no problem. Presumably it blows on pkt 157, the last one, which begins again with ".1"
pkts 82 and 157 are the same in both dumps. So it's not to do with timing or size of the rx.
and the horrible truth is, they are absolute same on every file, but one of them did trigger bug and the other is not...
If it's not timing it's usually stack.
And yet it has pattern, obuf started with '.' every time. So it's not random, I suppose. What about stack? It gets corrupted? or somewhere is uninitialized variable that derive its value from dirty stack? Without log this error is incredibly stable, each time.
The stack contents are not actually random either... it's not just inherited data but also pointer writes behind stack buffers writing on caller's data for example, which is only noticed after it pops back up to the parent and it uses it. Other than this suspicious .1 business it acts very nondeterministic.... if there's only one thread and it's not network timing then that's actually quite hard to achieve.
There's also the fact I have not been able to trigger it at all.... it could be specific to platform or toolchain (I am on x86_64 Fedora 26 / gcc 7.1.1).
Ubuntu 16.04.2 LTS gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
usleep(20 * 1000); before recv cures this as well
maybe it got something to do with the way my lws is configured, I leave it here, just in case:
-DCMAKE_BUILD_TYPE=DEBUG -DLIB_SUFFIX=64 -DCMAKE_CXX_FLAGS=-m64 -DCMAKE_C_FLAGS=-m64 \
-DLWS_WITH_SHARED=OFF \
-DLWS_WITH_SSL=OFF \
-DLWS_USE_BUNDLED_ZLIB_DEFAULT=OFF \
-DLWS_WITH_ZLIB=OFF \
-DLWS_WITHOUT_CLIENT=ON \
-DLWS_WITHOUT_TEST_SERVER_EXTPOLL=ON \
-DLWS_WITHOUT_TEST_PING=ON \
-DLWS_WITHOUT_TEST_ECHO=ON \
-DLWS_WITHOUT_TEST_CLIENT=ON \
-DLWS_WITHOUT_TEST_FRAGGLE=ON \
-DLWS_WITHOUT_EXTENSIONS=ON \
-DLWS_WITH_RANGES=OFF \
-DLWS_WITH_ZIP_FOPS=0
Those options (including the m64) don't change anything here. Is your platform x86_64 or 32-bit though?
Did you remove this abortive patch from Sunday about early ah detach? It's removed in master.
I just cloned master branch when we started. Ubuntu is 64 bit
Reviewing what we found so far, the evidence of the problem is there for an extra call to lws_header_table_reset().
When we get the headers split between two packets it's supposed to do this.... this is taken from your log line 6148
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
lws_handshake_server
The first three lines do the whole headers in one, but then on the last 3, at first it clears down the ah and starts to parse the last part of the current packet. Then it runs out, and the last line is it resuming the parse from the next packet. There is no problem.
But just before it blows the pattern is (line 6226 in your log)
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
lws_handshake_server
There is an extra ah reset inbetween, which kills the parser state. When we resume on the next packet, the parser as gone back to zero and is looking for the method, and gets "." which fails out since it doesn't match any method.
The only non-client guys who reset the ah are
lws_header_table_attach()
, if you call it when you already have an ah, or you get it attached for the first timelws_header_table_detach()
if he decides to give it to someone on the waiting listlws_http_transaction_completed()
Can you add logs to those three and try the test logging lws_handshake_server and _lws_header_table_reset again? The first two shouldn't get called past the start and end, but ws_http_transaction_completed()
might be too chatty for the bug. If so you can try swallowing the first n of them until we just get ones near the bug.
Thanks... that gave me an idea of the root cause, but I have to do other stuff now. I'll study it later or tomorrow and reply then.
Okay. And thank you too, this was interesting to investigate :) I need to go to sleep soon anyway.
Please update against master and see if any change.
I cloned master branch again, since I mutilated too much code yesterday, and... "forbidding on uri sanitation" again
It's still coming at the same point, ie, when we get a packet starting ".1"?
I checked I can spam that endlessly still without problems.
If it's still coming on ".1", how about the test from yesterday that ends with
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
lws_handshake_server
it's still the same?
if pipelined requests are sent in glued chunk larger than 2048 bytes then connection is stopped processing and warning "Duplicated method" is given