warmcat / libwebsockets

canonical libwebsockets.org networking library
https://libwebsockets.org
Other
4.82k stars 1.49k forks source link

pipelined requests bug #946

Closed rickynelson1939 closed 7 years ago

rickynelson1939 commented 7 years ago

if pipelined requests are sent in glued chunk larger than 2048 bytes then connection is stopped processing and warning "Duplicated method" is given

rickynelson1939 commented 7 years ago

damn, it's hard to reproduce, here's binary data that works for me dump.zip

rickynelson1939 commented 7 years ago

damn it, even log level change cause this bug to change its behaviour...

rickynelson1939 commented 7 years ago

log.txt

rickynelson1939 commented 7 years ago

well, judging by the log, next pipelined request came before file, issued by previous request, was completely send -> "lws_http_transaction_completed" wasn't called, fragments wasn't reset -> next pipelined request was read on top of previous request headers

lws-team commented 7 years ago

Does the patch above help?

rickynelson1939 commented 7 years ago

No more "Duplicated method" warning, but it gets strange, when log level is at 7 or less there is "forbidding on uri sanitation" warning given, but when I'm trying to log it at least on debug level (15) then there is no errors and traffic is processing a lot slower, so it's hard to tell what is happening exactly...

lws-team commented 7 years ago

This (which goes out as one pipelined request)

$ curl http://localhost:7681/favicon.ico  http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico  http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico  http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico  http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico http://localhost:7681/favicon.ico > /dev/null

doesn't have any problem here. It didn't have any problem before the last patch either, FWIW.

100 x pipelined http://localhost:7681/libwebsockets.org-logo.png with larger payload is also OK. Doing 100 x curl favicon.ico from a machine on another continent doesn't show up any problem.

rickynelson1939 commented 7 years ago

I actually testing this just with a buffer filled with repeated single http request sent via "send" in blocking loop, if that's matter. The fact that this bug is not reproducible even on my machine if log level is high enough should tell about something, I guess it takes some special condition, e.g specific case of fragmentation or something like that.

rickynelson1939 commented 7 years ago

Maybe if I dump every result of call to a recv and then repeat this exact data and call sequence, but with higher log level it will reveal something. Or maybe you have some better ideas?

lws-team commented 7 years ago

I can't reproduce the problem.... it doesn't mean the problem doesn't exist.

Using your test packet doesn't show up any problem (with default log level) here either.

$ cat ~/dump.bin | nc localhost 7681

That sends the whole test packet the first time, still no problem.

If you can casually reproduce the issue, you will have to stick some debugging around forbid: in parsers.c: 953, find out who jumped there and what made him feel that was a good idea. That logging will happen after the problem comes so it shouldn't affect reproducibility on your end.

rickynelson1939 commented 7 years ago

Okay, I'll try some of that right now.

rickynelson1939 commented 7 years ago

it jumps to forbid from parsers.c: 833, and I actually noticed, all frag_index'es are zero

lws-team commented 7 years ago

Might be interesting to instrument server.c::1288 lws_handshake_server() who calls lws_parse()... if he errors out, dump a copy of *buf contents from entry and see if what it tries to parse is sensible.

rickynelson1939 commented 7 years ago

0x642e32 "1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r"...

lws-team commented 7 years ago

It's more useful if you dump the *buf as it was on entry to lws_handshake_server(), if lws_parse() fails. Then we can see what it was trying to parse.

rickynelson1939 commented 7 years ago

so you mean obuf then? 0x642e31 ".1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n\r\nGET /favicon.ico HTTP/1.1\r\nConnection: keep-alive\r\nHost: 192.168.137.27\r\n"...

lws-team commented 7 years ago

So it was given just the one char '.'... we also know if it got to parsers.c: 833 it did not have any method yet... either something reset the ah while it was still in use or we start at an unreasonable place in the rx.

Will the problem keep coming if you add logging to _lws_header_table_reset() parsers.c:64, and some log to say we entered lws_handshake_server(), so we can understand the ordering a bit?

rickynelson1939 commented 7 years ago

wierd thing, if I used lwsl_warn - there was no error, but after I used fprintf - it did triggered forbid log.txt

lws-team commented 7 years ago

I tried changing the log emit function to the default stderr (test-server.c 340 -> lws_set_log_level(debug_level, NULL);) , the problem didn't start appearing. Can you see if that lets you dump logs?

Basically the successful ones have this pattern

_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset

but the failing one is only

lws_handshake_server

It should be OK since already one reset before it was called. But it suggests that's part of a different behaviour that might be the real problem.

rickynelson1939 commented 7 years ago

Hmm, no, it stop working when I do that. I even tried replace every lwsl_* so it define plain fprintf, but no luck... I think it's time to dump recv

rickynelson1939 commented 7 years ago

recv_dump.zip

lws-team commented 7 years ago

What exactly am I looking at?

rickynelson1939 commented 7 years ago

this is data exactly as it was received from socket when bug was triggered, from start to the very moment of breakpoint, now I just emulate each and every result of a recv call exactly as it was in previous session, but now with log level 65535, and I'll be damned, it worked log.txt

lws-team commented 7 years ago

Still... it's something to do with something... packet 76 ends with

GET /favicon.ico HTTP/1

and then naturally packet 77 begins with

.1

so we learned we blow on fragmentation of the first line... lws has bytewise parsers for everything, it should not matter to it how the stuff is fragmented.

Is it possible the logging sensitivity is to do with the kernel having time to coalesce the two packets? Can you check that?

lws-team commented 7 years ago

Trying to duplicate this standalone, as expected it is immune to fragmentation. In one terminal

 $  nc -l -k 7680 | nc localhost 7681

then in another (quickly before it timesout)

 $ echo -n "GET /favicon.ico HTTP/1" | nc localhost 7680 ; echo -n -e ".1\xa\xd" | nc localhost 7680

gets me

[2017/06/27 14:57:17:0485] DEBUG: lws_read: incoming len 23  state 0
[2017/06/27 14:57:17:0485] PARSER: issuing 23 bytes to parser
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'G' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'E' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART 'T' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: WSI_TOKEN_NAME_PART ' ' (mode=0)
[2017/06/27 14:57:17:0485] PARSER: known hdr 0
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) '/'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'f'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'a'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'v'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'i'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'c'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'o'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'n'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) '.'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'i'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'c'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) 'o'
[2017/06/27 14:57:17:0485] PARSER: WSI_TOK_(0) ' '
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'H'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'T'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'T'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) 'P'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) '/'
[2017/06/27 14:57:17:0486] PARSER: WSI_TOK_(15) '1'
[2017/06/27 14:57:17:0486] DEBUG: lws_read: thinks we have used 0
[2017/06/27 14:57:17:0486] INFO: lws_read: read_ok, used 23
[2017/06/27 14:57:17:0486] DEBUG: lws_server_socket_service: wsi 0x258f070: ah read rxpos 23, rxlen 23
[2017/06/27 14:57:17:0615] DEBUG: lws_read: incoming len 4  state 0
[2017/06/27 14:57:17:0616] PARSER: issuing 4 bytes to parser
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '.'
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '1'
[2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '
'
'2017/06/27 14:57:17:0616] PARSER: WSI_TOK_(15) '
[2017/06/27 14:57:17:0616] PARSER: *
[2017/06/27 14:57:17:0616] DEBUG: lws_read: thinks we have used 0
[2017/06/27 14:57:17:0616] INFO: lws_read: read_ok, used 4
[2017/06/27 14:57:17:0616] DEBUG: lws_server_socket_service: wsi 0x258f070: ah read rxpos 4, rxlen 4
[2017/06/27 14:57:20:0198] NOTICE: wsi 0x258f070: TIMEDOUT WAITING on 3 (did hdr 0, ah 0x23b1480, wl 0, pfd events 1) 1498546640 vs 1498546639

Without the logging he just shows timeout, ie, no failure at parse.

I'll try fake up the same pipelined as well.

rickynelson1939 commented 7 years ago

Hmm, I'll try, but how am I supposed to do this exactly? here's a log with a log level 65535 and input from real socket, no forbid was triggered log.txt.zip

lws-team commented 7 years ago

I just mean do the same recv info dump you did before, but with logging enabled... are the 76th and 77th packet still fragmented in the same place?

rickynelson1939 commented 7 years ago

no, it seems at some point something went different recv_dump_no_bug.zip

rickynelson1939 commented 7 years ago

wait a minute, what a hell, there's a bunch of zeros at the end of first file

lws-team commented 7 years ago

FWIW I found I can spam endless amounts of fragmented and unfragmented requests on one connection with no problems.

 $ while [ 1 ] ; do echo -n -e "GET /favicon.ico HTTP/1.1\x0d\x0a\x0d\x0aGET /favicon.ico HTTP/1" | nc localhost 7680 ; echo -n -e ".1\x0d\x0a\x0d\x0a" | nc localhost 7680 ; done
lws-team commented 7 years ago

wait a minute, what a hell, there's a bunch of zeros at the end of first file

So there is... the file size is 2KB though

rickynelson1939 commented 7 years ago

damn, I used wrong variable in dump code, I redump both cases right now

rickynelson1939 commented 7 years ago

recv_dump_with_bug.zip recv_dump_no_bug.zip

lws-team commented 7 years ago

Hmmm in the bugging dump the earlier pkt #82 also begins with ".1" but that has no problem. Presumably it blows on pkt 157, the last one, which begins again with ".1"

pkts 82 and 157 are the same in both dumps. So it's not to do with timing or size of the rx.

rickynelson1939 commented 7 years ago

and the horrible truth is, they are absolute same on every file, but one of them did trigger bug and the other is not...

lws-team commented 7 years ago

If it's not timing it's usually stack.

rickynelson1939 commented 7 years ago

And yet it has pattern, obuf started with '.' every time. So it's not random, I suppose. What about stack? It gets corrupted? or somewhere is uninitialized variable that derive its value from dirty stack? Without log this error is incredibly stable, each time.

lws-team commented 7 years ago

The stack contents are not actually random either... it's not just inherited data but also pointer writes behind stack buffers writing on caller's data for example, which is only noticed after it pops back up to the parent and it uses it. Other than this suspicious .1 business it acts very nondeterministic.... if there's only one thread and it's not network timing then that's actually quite hard to achieve.

There's also the fact I have not been able to trigger it at all.... it could be specific to platform or toolchain (I am on x86_64 Fedora 26 / gcc 7.1.1).

rickynelson1939 commented 7 years ago

Ubuntu 16.04.2 LTS gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

rickynelson1939 commented 7 years ago

usleep(20 * 1000); before recv cures this as well

rickynelson1939 commented 7 years ago

maybe it got something to do with the way my lws is configured, I leave it here, just in case:

-DCMAKE_BUILD_TYPE=DEBUG -DLIB_SUFFIX=64 -DCMAKE_CXX_FLAGS=-m64 -DCMAKE_C_FLAGS=-m64 \
-DLWS_WITH_SHARED=OFF \
-DLWS_WITH_SSL=OFF \
-DLWS_USE_BUNDLED_ZLIB_DEFAULT=OFF \
-DLWS_WITH_ZLIB=OFF \
-DLWS_WITHOUT_CLIENT=ON \
-DLWS_WITHOUT_TEST_SERVER_EXTPOLL=ON \
-DLWS_WITHOUT_TEST_PING=ON \
-DLWS_WITHOUT_TEST_ECHO=ON \
-DLWS_WITHOUT_TEST_CLIENT=ON \
-DLWS_WITHOUT_TEST_FRAGGLE=ON \
-DLWS_WITHOUT_EXTENSIONS=ON \
-DLWS_WITH_RANGES=OFF \
-DLWS_WITH_ZIP_FOPS=0
lws-team commented 7 years ago

Those options (including the m64) don't change anything here. Is your platform x86_64 or 32-bit though?

Did you remove this abortive patch from Sunday about early ah detach? It's removed in master.

rickynelson1939 commented 7 years ago

I just cloned master branch when we started. Ubuntu is 64 bit

lws-team commented 7 years ago

Reviewing what we found so far, the evidence of the problem is there for an extra call to lws_header_table_reset().

When we get the headers split between two packets it's supposed to do this.... this is taken from your log line 6148

_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
lws_handshake_server

The first three lines do the whole headers in one, but then on the last 3, at first it clears down the ah and starts to parse the last part of the current packet. Then it runs out, and the last line is it resuming the parse from the next packet. There is no problem.

But just before it blows the pattern is (line 6226 in your log)

_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
lws_handshake_server

There is an extra ah reset inbetween, which kills the parser state. When we resume on the next packet, the parser as gone back to zero and is looking for the method, and gets "." which fails out since it doesn't match any method.

The only non-client guys who reset the ah are

Can you add logs to those three and try the test logging lws_handshake_server and _lws_header_table_reset again? The first two shouldn't get called past the start and end, but ws_http_transaction_completed() might be too chatty for the bug. If so you can try swallowing the first n of them until we just get ones near the bug.

rickynelson1939 commented 7 years ago

log.txt

lws-team commented 7 years ago

Thanks... that gave me an idea of the root cause, but I have to do other stuff now. I'll study it later or tomorrow and reply then.

rickynelson1939 commented 7 years ago

Okay. And thank you too, this was interesting to investigate :) I need to go to sleep soon anyway.

lws-team commented 7 years ago

Please update against master and see if any change.

rickynelson1939 commented 7 years ago

I cloned master branch again, since I mutilated too much code yesterday, and... "forbidding on uri sanitation" again

lws-team commented 7 years ago

It's still coming at the same point, ie, when we get a packet starting ".1"?

I checked I can spam that endlessly still without problems.

If it's still coming on ".1", how about the test from yesterday that ends with

_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
_lws_header_table_reset
lws_handshake_server
_lws_header_table_reset
lws_handshake_server

it's still the same?