Closed alexgille closed 8 years ago
Hm... what we do in Linux with that kind of situation is run under valgrind. So if you normally run sudo lwsws
, you would run sudo valgrind lwsws
.
He filters tracks memory accesses and reports problems with a backtrace.
According to this
http://stackoverflow.com/questions/413477/is-there-a-good-valgrind-substitute-for-windows
windows tools for the same are inferior and you can use wine to run your windows binary with valgrind on Linux.
Rather than asking me to guess, running valgrind and get a backtrace for the first bad access is a hugely better bet.
I would first of all suggest building a debug build in visual studio and pressing "break" when you get that exception to get a stacktrace where this happens.
Also regarding valgrind alternatives I'd suggest http://drmemory.org/ ... which works on Windows (and Linux)... it is free and open source.
@JoakimSoderberg I think the issue is the error he gets is quite disconnected from the actual cause... much later it notices that a malloc header is corrupted. But where it notices that is unrelated to what trashed it. (IIUI anyway). Drmemory sounds pretty good though, thanks.
Yea sure, but that would still be my first step at least :)
Next step is Dr. Memory... I've used it succesfully before. But I turned it of for my projects in continuous integration because it was a bit flaky at times. It is being developed quite actively so it has probably improved a lot since I use it last :)
It would be nice to have some unit tests for libwebsockets running through valgrind / drmemory though in the CI pipeline to catch these things ... One idea would be to create a simple script with a test server and test client running something and terminating after a while, must not be a "proper" unit test. Or simply run the authobahn test through valgrind. A similar thing I did for my project:
https://github.com/JoakimSoderberg/libws/blob/master/.travis.yml#L17 <-- installing autobahn https://github.com/JoakimSoderberg/libws/blob/master/.travis-ci-autobahn.sh#L6 <-- Running fuzzing server and later connecting client to it
But yea, just an idea... I don't have time to implement that myself, just putting it out there :)
@JoakimSoderberg I'm already building on Debug and running the executable in VS2015 debugger, otherwhise it's impossible to know what caused the crash. When I manage to have the exception raised, the VS debugger only gives me hexadecimal memory address where it was raised, in which thread, in which dll, but not at which code line nor on which variable/resource.
@lws-team It looks like the server freed a resource but still references thus uses it causing the crash...
I did a benchmark with Apache ab.exe to stress-test my server, asking 10,000 parallel connections to send 10,000 requests each, and it worked incredibly fine without complaining. No crash, no memory leaks, no process memory growing.
I will try DrMemory and get back to you with more hints.
Well that's good to hear anyway.
The error you're dealing with is not merely a 'reference'... something has written behind its allocation or beyond its allocation to crap up someone else's heap management area, that's what it means by "heap corruption".
Valgrind will definitely catch such problems and give you a backtrace to who crapped on the memory, which is very valuable under these conditions. If drmemory can do the same, you should be able to pinpoint who did the bad write in the first place.
Here is a first error I have in Dr. Memory after stress-test :
~~Dr.M~~
~~Dr.M~~ Error #16: UNADDRESSABLE ACCESS: reading 0x00d2ffec-0x00d2ffee 2 byte(s)
~~Dr.M~~ # 0 websockets.dll!lws_get_peer_addresses [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:751]
~~Dr.M~~ # 1 websockets.dll!lws_close_reason [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:1373]
~~Dr.M~~ # 2 websockets.dll!lws_generate_client_handshake [c:\tools\libwebsockets-2.0.0\lib\client.c:918]
~~Dr.M~~ # 3 libuv.dll!uv_timer_get_repeat +0x10b (0x0ffd6f7c <libuv.dll+0x16f7c>)
~~Dr.M~~ # 4 libuv.dll!uv_run +0x15f (0x0ffc44e0 <libuv.dll+0x44e0>)
~~Dr.M~~ # 5 KERNEL32.dll!BaseThreadInitThunk +0x23 (0x75bb38f4 <KERNEL32.dll+0x138f4>)
~~Dr.M~~ Note: @0:08:51.810 in thread 7804
~~Dr.M~~ Note: instruction: movsx 0x04(%eax,%ecx,8) -> %eax
It's a bit confusing.... lws_close_reason() does not call lws_get_peer_address()
LWS_VISIBLE LWS_EXTERN void
lws_close_reason(struct lws *wsi, enum lws_close_status status,
unsigned char *buf, size_t len)
{
unsigned char *p, *start;
int budget = sizeof(wsi->u.ws.ping_payload_buf) - LWS_PRE;
assert(wsi->mode == LWSCM_WS_SERVING || wsi->mode == LWSCM_WS_CLIENT);
start = p = &wsi->u.ws.ping_payload_buf[LWS_PRE];
*p++ = (((int)status) >> 8) & 0xff;
*p++ = ((int)status) & 0xff;
if (buf)
while (len-- && p < start + budget)
*p++ = *buf++;
wsi->u.ws.close_in_ping_buffer_len = p - start;
}
Nor does lws_generate_client_handshake() call lws_close_reason()...
Something seems wrong: https://github.com/warmcat/libwebsockets/blob/v2.0-stable/lib/libwebsockets.c#L1373
I am assuming you are using v2.0? since your directory is named libwebsockets-2.0.0
But line 1373 in libwebsockets.c is not lws_close_reason
but rather _lws_rx_flow_control
Maybe completly clean your build dir and regenerate the cmake to make sure you're not using some garbage files?
Yeah I checked on my version of libwebsockets and it's the same oddity...
Here is the complete Dr. Memory report.
@JoakimSoderberg I'm downloading the latest version on master and build it now. My version is 2.0.0 from master, downloaded a week ago.
Valgrind can't deal with OpenSSL either, these complaints seem related to that. Aside from fixing whatever source level debug problem you have, maybe try and see if any problems are still coming with SSL disabled (it means comment the cert/key things in the conf)
I'm not currently using/supporting SSL. You mean commenting "init-ssl": "yes"
in global conf file?
Yes, sorry, if it's enabled.
Now init-ssl is commented. With latest version, the callstack for error #16 is not the same :
Error #16: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0219ffcc-0x0219ffce 2 byte(s)
# 0 websockets.dll!lws_service_timeout_check [c:\tools\libwebsockets-2.0.0\lib\service.c:332]
# 1 websockets.dll!lws_service_fd_tsi [c:\tools\libwebsockets-2.0.0\lib\service.c:701]
# 2 websockets.dll!lws_uv_timeout_cb [c:\tools\libwebsockets-2.0.0\lib\libuv.c:141]
# 3 libuv.dll!uv_timer_get_repeat +0x10b (0x0ff86f7c <libuv.dll+0x16f7c>)
# 4 libuv.dll!uv_run +0x15f (0x0ff744e0 <libuv.dll+0x44e0>)
# 5 websockets.dll!lws_libuv_run [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
# 6 KERNEL32.dll!BaseThreadInitThunk +0x23 (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:49.986 in thread 2848
Note: instruction: movsx 0x04(%edx,%eax,8) -> %eax
And some other interesting errors :
Error #19: HANDLE LEAK: KERNEL handle 0x00000124 and 9999 similar handle(s) were opened but not closed:
# 0 system call NtCreateFile
# 1 MSWSOCK.dll!Tcpip6_WSHGetSocketInformation +0x540 (0x73e99cd1 <MSWSOCK.dll+0x9cd1>)
# 2 MSWSOCK.dll!sethostname +0x3e41 (0x73ead022 <MSWSOCK.dll+0x1d022>)
# 3 WS2_32.dll!accept +0x87 (0x77286d88 <WS2_32.dll+0x16d88>)
# 4 websockets.dll!lws_server_socket_service [c:\tools\libwebsockets-2.0.0\lib\server.c:1808]
# 5 websockets.dll!lws_service_fd_tsi [c:\tools\libwebsockets-2.0.0\lib\service.c:786]
# 6 websockets.dll!lws_service_fd [c:\tools\libwebsockets-2.0.0\lib\service.c:1086]
# 7 websockets.dll!lws_io_cb [c:\tools\libwebsockets-2.0.0\lib\libuv.c:101]
# 8 libuv.dll!uv_key_set +0x37f (0x0ff7e560 <libuv.dll+0xe560>)
# 9 libuv.dll!uv_async_send +0x377 (0x0ff73b78 <libuv.dll+0x3b78>)
#10 libuv.dll!uv_run +0x166 (0x0ff744e7 <libuv.dll+0x44e7>)
#11 websockets.dll!lws_libuv_run [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
#12 KERNEL32.dll!BaseThreadInitThunk +0x23 (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:58.992 in thread 2848
Error #20: HANDLE LEAK: KERNEL handle 0x000001ec and 7297 similar handle(s) were opened but not closed:
# 0 system call NtCreateEvent
# 1 avghookx.dll!? +0x0 (0x72d528ab <avghookx.dll+0x28ab>)
# 2 KERNELBASE.dll!CreateEventExW +0x65 (0x7753cba6 <KERNELBASE.dll+0xbcba6>)
# 3 KERNELBASE.dll!CreateEventA +0x32 (0x7753cb03 <KERNELBASE.dll+0xbcb03>)
# 4 WS2_32.dll!WSACreateEvent +0xc (0x7728695d <WS2_32.dll+0x1695d>)
# 5 websockets.dll!lws_plat_insert_socket_into_fds [c:\tools\libwebsockets-2.0.0\lib\lws-plat-win.c:410]
# 6 websockets.dll!insert_wsi_socket_into_fds [c:\tools\libwebsockets-2.0.0\lib\pollfd.c:151]
# 7 websockets.dll!lws_adopt_socket_vhost [c:\tools\libwebsockets-2.0.0\lib\server.c:1464]
# 8 websockets.dll!lws_server_socket_service [c:\tools\libwebsockets-2.0.0\lib\server.c:1844]
# 9 websockets.dll!lws_service_fd_tsi [c:\tools\libwebsockets-2.0.0\lib\service.c:786]
#10 websockets.dll!lws_service_fd [c:\tools\libwebsockets-2.0.0\lib\service.c:1086]
#11 websockets.dll!lws_io_cb [c:\tools\libwebsockets-2.0.0\lib\libuv.c:101]
#12 libuv.dll!uv_key_set +0x37f (0x0ff7e560 <libuv.dll+0xe560>)
#13 libuv.dll!uv_async_send +0x377 (0x0ff73b78 <libuv.dll+0x3b78>)
#14 libuv.dll!uv_run +0x166 (0x0ff744e7 <libuv.dll+0x44e7>)
#15 websockets.dll!lws_libuv_run [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
#16 KERNEL32.dll!BaseThreadInitThunk +0x23 (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:58.992 in thread 2848
Note: handles created with the same callstack are closed here:
Note: # 0 system call NtClose
Note: # 1 KERNELBASE.dll!CloseHandle +0x57 (0x7752e0d8 <KERNELBASE.dll+0xae0d8>)
Note: # 2 WS2_32.dll!WSACloseEvent +0x18 (0x77286909 <WS2_32.dll+0x16909>)
Note: # 3 websockets.dll!lws_plat_delete_socket_from_fds [c:\tools\libwebsockets-2.0.0\lib\lws-plat-win.c:421]
Note: # 4 websockets.dll!remove_wsi_socket_from_fds [c:\tools\libwebsockets-2.0.0\lib\pollfd.c:233]
Note: # 5 websockets.dll!lws_close_free_wsi [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:423]
Note: # 6 websockets.dll!lws_service_timeout_check [c:\tools\libwebsockets-2.0.0\lib\service.c:351]
Note: # 7 websockets.dll!lws_service_fd_tsi [c:\tools\libwebsockets-2.0.0\lib\service.c:701]
Note: # 8 websockets.dll!lws_uv_timeout_cb [c:\tools\libwebsockets-2.0.0\lib\libuv.c:141]
Note: # 9 libuv.dll!uv_timer_get_repeat +0x10b (0x0ff86f7c <libuv.dll+0x16f7c>)
Note: #10 libuv.dll!uv_run +0x15f (0x0ff744e0 <libuv.dll+0x44e0>)
Note: #11 websockets.dll!lws_libuv_run [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
Note: #12 KERNEL32.dll!BaseThreadInitThunk +0x23 (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Then handle leaks might be caused by the crash...
Full report : results_latest.txt
I found out that in certain benchmark configurations the stress-test will not crash the executable, and in other it will crash in less than a minute.
Now I can reproduce this issue everytime with :
ab.exe -n 20000 -c 10000 http://127.0.0.1/index.html
'crash' doesn't give me any clue. Where does it crash.
I looked at the first error given,it's an lwsl_notice with some args.
I pushed a patch (on master) in case the sockfd is invalid, but that'd be a read, not a write causing heap corruption.
I dunno how to interpret the other ones, valgrind doesn't have similar complaints.
By 'crash' I mean the console closes and disappear without saying anything while running. I'm short of solutions now...
The windows stuff is otherwise acting so like the Linux build now, I start to wonder if it's on my side. But I can do the ab test
$ ab -n 20000 -c 10000 http://127.0.0.1:7681/test.html
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 7681
Document Path: /test.html
Document Length: 18662 bytes
Concurrency Level: 10000
Time taken for tests: 4.578 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 376020000 bytes
HTML transferred: 373240000 bytes
Requests per second: 4368.93 [#/sec] (mean)
Time per request: 2288.889 [ms] (mean)
Time per request: 0.229 [ms] (mean, across all concurrent requests)
Transfer rate: 80215.14 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 266 413.9 68 3004
Processing: 63 1669 697.8 1836 2744
Waiting: 59 1668 698.1 1836 2744
Total: 366 1934 902.1 1837 4300
Percentage of the requests served within a certain time (ms)
50% 1837
66% 2056
75% 2290
80% 2370
90% 3314
95% 3568
98% 3729
99% 3739
100% 4300 (longest request)
The basic problem is the 'crash' is actually a much later abort when the malloc subsystem checks and sees, eg, sentinel bytes before the start of the allocated region got overwritten. But that situation came about because of a much earlier write.that is actually the cause. So it can't be attacked from the abort end - it's unrelated in time.
It could still be a generic library issue that only your setup reveals, or something in the win plat part, but since you have the symptom we have to try to catch where it goes wrong on your side. If you have access to a linux box, you might try the wine valgrind thing, in my experience valgrind is really good at finding and reporting these kinds of problem.
Hi Andy,
The windows stuff is otherwise acting so like the Linux build now
Is it related to a new commit? If yes I'll test it and come back to you with my results.
I wish I could go deeper in the code to find out what's going wrong but as it appears to happen beyond the limit of my code, and as it looks like it's not coming from my usage of the lib (see code above), I'm kind of blind trying to find a solution. Do you have a Win32 platform for testing your builds at libwebsockets? Did you run the ab benchmark on Win32 or Linux?
I'm setting up a VBox + Linux testing environment to follow your guidance and tell you if I find something special.
I mean that you are able to otherwise use the library on win32 after the fixes from a couple of weeks ago.
I only build-test it on windows via Appveyor, I don't have any windows boxes here.
I ran ab on Linux, because that's what I develop on. There are no valgrind errors or heap corruption (glibc, who provides the malloc, also looks for these).
If you had this problem on Linux, I would really expect valgrind to find and report the root cause, so combined with your ability to reproduce it, I'm hoping wine + valgrind can do the same trick.
Has anything happened about this in the meanwhile?
Well, if there's any new info, please reopen
Sorry for the late reply, I've been in a rush for weeks. I will not be able to do the wine + valgrind trick on a win32 VBox for the moment as I'm short of time. I think libwebsocket is a promissing library that deserve a full Windows support in the medium term.
Mmm well, the reason it's like it is for windows is basically driveby level of interest on that platform.
Hi @lws-team,
I'm experiencing a fatal Heap Corruption Exception (0xC0000374) on my Win32 executable.
The exception is raised unexpectedly after an HTTP request to load web content (html, js, css); no web socket is involved. The behaviour is unpredictable; sometimes it's raised on an HTTP request after 5 minutes of execution, sometime after an hour.
The only relation I've found so far with HTTP requests is that there's a higher probability to reproduce this issue (and have a heap corrupted) when the server waited a long time without answering to any HTTP request, but it is still impossible to reproduce it everytime.
My program crashes without any console output when run in regular console, but raises the exception when run in a Visual Studio 2015 debugger. I don't have a strong Win32-debugging knowledge so the issue might be on my side (e.g. the way I use libwebsockets).
To help you :
conf files
main.c - slightly different from lwsws.c
console output
Regards