warmcat / libwebsockets

canonical libwebsockets.org networking library
https://libwebsockets.org
Other
4.78k stars 1.49k forks source link

[Win10] Heap Corruption Exception (0xC0000374) on HTTP request with lwsws #555

Closed alexgille closed 8 years ago

alexgille commented 8 years ago

Hi @lws-team,

I'm experiencing a fatal Heap Corruption Exception (0xC0000374) on my Win32 executable.

3683 clip_image001_394cb0f4

The exception is raised unexpectedly after an HTTP request to load web content (html, js, css); no web socket is involved. The behaviour is unpredictable; sometimes it's raised on an HTTP request after 5 minutes of execution, sometime after an hour.

The only relation I've found so far with HTTP requests is that there's a higher probability to reproduce this issue (and have a heap corrupted) when the server waited a long time without answering to any HTTP request, but it is still impossible to reproduce it everytime.

My program crashes without any console output when run in regular console, but raises the exception when run in a Visual Studio 2015 debugger. I don't have a strong Win32-debugging knowledge so the issue might be on my side (e.g. the way I use libwebsockets).

To help you :

conf files

// conf
{
    "global": {
        "count-threads": "1",
        "server-string": "lwsws",
        "init-ssl": "yes"
    }
}

// vhost
{
    "vhosts": [{
        "name": "localhost",
        "port": "80",
        "mounts": [{
            "mountpoint": "/",
            "origin": "file://C:/dev/WebDisplayCtrlSolution/WebSources/",
            "default": "index.html"
        }],
        "ws-protocols": [{
            "wdc_protocol": {
                "default": "1"
            }
        }]
    }]
}

main.c - slightly different from lwsws.c

#include <signal.h>
#include <libwebsockets.h>

static struct lws_context *context;

#define LWSWS_CONFIG_STRING_SIZE (32 * 1024)

static const char * const plugin_dirs[] = {
    "C:/dev/WebDisplayCtrlSolution/Debug/plugins",
    NULL
};

void signal_cb(int signum)
{
    lwsl_err("Signal %d caught, exiting...\n", signum);
    lws_libuv_stop(context);
}

int _tmain(int argc, _TCHAR* argv[])
{
    struct lws_context_creation_info info;
    char *cs;
    int opts = 0, cs_len = LWSWS_CONFIG_STRING_SIZE - 1;

    int debug_level = 7;
    char config_dir[128];
    char *config_strings;

    memset(&info, 0, sizeof info);
    strcpy(config_dir, "C:/dev/WebDisplayCtrlSolution/Debug");

    lws_set_log_level(debug_level, lwsl_emit_syslog);

    lwsl_notice("lwsws libwebsockets web server - license GPL2.1\n");
    lwsl_notice("(C) Copyright 2010-2016 Andy Green <andy@warmcat.com>\n");

    cs = config_strings = (char *) malloc(LWSWS_CONFIG_STRING_SIZE);
    if (!config_strings) {
        lwsl_err("Unable to allocate config strings heap\n");
        return -1;
    }

    signal(SIGINT, signal_cb);

    info.max_http_header_pool = 16;
    info.options = opts | LWS_SERVER_OPTION_VALIDATE_UTF8 |
        LWS_SERVER_OPTION_EXPLICIT_VHOSTS |
        LWS_SERVER_OPTION_LIBUV;

    info.plugin_dirs = plugin_dirs;
    lwsl_notice("Using config dir: \"%s\"\n", config_dir);

    if (lwsws_get_config_globals(&info, config_dir, &cs, &cs_len)) {
        free(config_strings);
        return 1;
    }

    context = lws_create_context(&info);
    if (context == NULL) {
        lwsl_err("libwebsocket init failed\n");
        free(config_strings);
        return 1;
    }

    if (!lwsws_get_config_vhosts(context, &info, config_dir, &cs, &cs_len)) {
           // no libuv signal handler
        lws_uv_sigint_cfg(context, 0, NULL);
        lws_uv_initloop(context, NULL, 0);

        lws_libuv_run(context, 0);
    }

    lws_context_destroy(context);
    free(config_strings);

    fprintf(stderr, "lwsws exited cleanly\n");

    return 0;
}

console output

[2016/05/03 11:40:58:1468] NOTICE: lwsws libwebsockets web server - license GPL2.1
[2016/05/03 11:40:58:1473] NOTICE: (C) Copyright 2010-2016 Andy Green <andy@warmcat.com>
[2016/05/03 11:40:58:1473] NOTICE: Using config dir: "C:/dev/WebDisplayCtrlSolution/Debug"
[2016/05/03 11:40:58:1548] NOTICE: Initial logging level 7
[2016/05/03 11:40:58:1548] NOTICE: Libwebsockets version: 2.0.0 desktop-kevdfjd\cm3@DESKTOP-KEVDFJD-
[2016/05/03 11:40:58:1553] NOTICE: IPV6 not compiled in
[2016/05/03 11:40:58:1553] NOTICE: libev support not compiled in
[2016/05/03 11:40:58:1553] NOTICE: libuv support compiled in and enabled
[2016/05/03 11:40:58:1553] NOTICE:  Threads: 1 each 30000 fds
[2016/05/03 11:40:58:1573] NOTICE:   Plugins:
[2016/05/03 11:40:58:1573] NOTICE:   Scanning C:/dev/WebDisplayCtrlSolution/Debug/plugins
[2016/05/03 11:40:58:1578] NOTICE:    protocol_dumb_increment.dll
[2016/05/03 11:40:58:1593] NOTICE:    WDCProtocol.dll
[2016/05/03 11:40:58:1644] NOTICE:  Compiled with OpenSSL support
[2016/05/03 11:40:58:1754] NOTICE:  mem: per-conn:          656 bytes + protocol rx buf
[2016/05/03 11:40:58:1889] NOTICE:  canonical_hostname = DESKTOP-KEVDFJD
[2016/05/03 11:40:58:1894] ERR: ----file://C:/dev/WebDisplayCtrlSolution/WebSources/
[2016/05/03 11:40:58:1894] NOTICE:   adding protocol wdc_protocol
[2016/05/03 11:40:58:1899] NOTICE: Creating Vhost 'localhost' port 80, 2 protocols
[2016/05/03 11:40:58:1899] NOTICE:    mounting file://C:/dev/WebDisplayCtrlSolution/WebSources/ to /
[2016/05/03 11:40:58:1899] NOTICE:  Using non-SSL mode
[2016/05/03 11:40:58:7273] NOTICE: lws_protocol_init
[2016/05/03 11:40:58:7279] NOTICE:     vh localhost prot wdc_protocol opt default
[2016/05/03 11:40:58:7284] NOTICE: Setting default protocol for vh localhost to wdc_protocol
[2016/05/03 11:40:58:7284] NOTICE: ### protocol init

Regards

lws-team commented 8 years ago

Hm... what we do in Linux with that kind of situation is run under valgrind. So if you normally run sudo lwsws, you would run sudo valgrind lwsws.

He filters tracks memory accesses and reports problems with a backtrace.

According to this

http://stackoverflow.com/questions/413477/is-there-a-good-valgrind-substitute-for-windows

windows tools for the same are inferior and you can use wine to run your windows binary with valgrind on Linux.

Rather than asking me to guess, running valgrind and get a backtrace for the first bad access is a hugely better bet.

JoakimSoderberg commented 8 years ago

I would first of all suggest building a debug build in visual studio and pressing "break" when you get that exception to get a stacktrace where this happens.

Also regarding valgrind alternatives I'd suggest http://drmemory.org/ ... which works on Windows (and Linux)... it is free and open source.

lws-team commented 8 years ago

@JoakimSoderberg I think the issue is the error he gets is quite disconnected from the actual cause... much later it notices that a malloc header is corrupted. But where it notices that is unrelated to what trashed it. (IIUI anyway). Drmemory sounds pretty good though, thanks.

JoakimSoderberg commented 8 years ago

Yea sure, but that would still be my first step at least :)

Next step is Dr. Memory... I've used it succesfully before. But I turned it of for my projects in continuous integration because it was a bit flaky at times. It is being developed quite actively so it has probably improved a lot since I use it last :)

It would be nice to have some unit tests for libwebsockets running through valgrind / drmemory though in the CI pipeline to catch these things ... One idea would be to create a simple script with a test server and test client running something and terminating after a while, must not be a "proper" unit test. Or simply run the authobahn test through valgrind. A similar thing I did for my project:

https://github.com/JoakimSoderberg/libws/blob/master/.travis.yml#L17 <-- installing autobahn https://github.com/JoakimSoderberg/libws/blob/master/.travis-ci-autobahn.sh#L6 <-- Running fuzzing server and later connecting client to it

But yea, just an idea... I don't have time to implement that myself, just putting it out there :)

alexgille commented 8 years ago

@JoakimSoderberg I'm already building on Debug and running the executable in VS2015 debugger, otherwhise it's impossible to know what caused the crash. When I manage to have the exception raised, the VS debugger only gives me hexadecimal memory address where it was raised, in which thread, in which dll, but not at which code line nor on which variable/resource.

@lws-team It looks like the server freed a resource but still references thus uses it causing the crash...

alexgille commented 8 years ago

I did a benchmark with Apache ab.exe to stress-test my server, asking 10,000 parallel connections to send 10,000 requests each, and it worked incredibly fine without complaining. No crash, no memory leaks, no process memory growing.

I will try DrMemory and get back to you with more hints.

lws-team commented 8 years ago

Well that's good to hear anyway.

The error you're dealing with is not merely a 'reference'... something has written behind its allocation or beyond its allocation to crap up someone else's heap management area, that's what it means by "heap corruption".

Valgrind will definitely catch such problems and give you a backtrace to who crapped on the memory, which is very valuable under these conditions. If drmemory can do the same, you should be able to pinpoint who did the bad write in the first place.

alexgille commented 8 years ago

Here is a first error I have in Dr. Memory after stress-test :

~~Dr.M~~
~~Dr.M~~ Error #16: UNADDRESSABLE ACCESS: reading 0x00d2ffec-0x00d2ffee 2 byte(s)
~~Dr.M~~ # 0 websockets.dll!lws_get_peer_addresses       [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:751]
~~Dr.M~~ # 1 websockets.dll!lws_close_reason             [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:1373]
~~Dr.M~~ # 2 websockets.dll!lws_generate_client_handshake [c:\tools\libwebsockets-2.0.0\lib\client.c:918]
~~Dr.M~~ # 3 libuv.dll!uv_timer_get_repeat              +0x10b    (0x0ffd6f7c <libuv.dll+0x16f7c>)
~~Dr.M~~ # 4 libuv.dll!uv_run                           +0x15f    (0x0ffc44e0 <libuv.dll+0x44e0>)
~~Dr.M~~ # 5 KERNEL32.dll!BaseThreadInitThunk           +0x23     (0x75bb38f4 <KERNEL32.dll+0x138f4>)
~~Dr.M~~ Note: @0:08:51.810 in thread 7804
~~Dr.M~~ Note: instruction: movsx  0x04(%eax,%ecx,8) -> %eax
lws-team commented 8 years ago

It's a bit confusing.... lws_close_reason() does not call lws_get_peer_address()

LWS_VISIBLE LWS_EXTERN void
lws_close_reason(struct lws *wsi, enum lws_close_status status,
         unsigned char *buf, size_t len)
{
    unsigned char *p, *start;
    int budget = sizeof(wsi->u.ws.ping_payload_buf) - LWS_PRE;

    assert(wsi->mode == LWSCM_WS_SERVING || wsi->mode == LWSCM_WS_CLIENT);

    start = p = &wsi->u.ws.ping_payload_buf[LWS_PRE];

    *p++ = (((int)status) >> 8) & 0xff;
    *p++ = ((int)status) & 0xff;

    if (buf)
        while (len-- && p < start + budget)
            *p++ = *buf++;

    wsi->u.ws.close_in_ping_buffer_len = p - start;
}
lws-team commented 8 years ago

Nor does lws_generate_client_handshake() call lws_close_reason()...

JoakimSoderberg commented 8 years ago

Something seems wrong: https://github.com/warmcat/libwebsockets/blob/v2.0-stable/lib/libwebsockets.c#L1373

I am assuming you are using v2.0? since your directory is named libwebsockets-2.0.0

But line 1373 in libwebsockets.c is not lws_close_reason but rather _lws_rx_flow_control

Maybe completly clean your build dir and regenerate the cmake to make sure you're not using some garbage files?

alexgille commented 8 years ago

Yeah I checked on my version of libwebsockets and it's the same oddity...

Here is the complete Dr. Memory report.

@JoakimSoderberg I'm downloading the latest version on master and build it now. My version is 2.0.0 from master, downloaded a week ago.

lws-team commented 8 years ago

Valgrind can't deal with OpenSSL either, these complaints seem related to that. Aside from fixing whatever source level debug problem you have, maybe try and see if any problems are still coming with SSL disabled (it means comment the cert/key things in the conf)

alexgille commented 8 years ago

I'm not currently using/supporting SSL. You mean commenting "init-ssl": "yes" in global conf file?

lws-team commented 8 years ago

Yes, sorry, if it's enabled.

alexgille commented 8 years ago

Now init-ssl is commented. With latest version, the callstack for error #16 is not the same :

Error #16: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0219ffcc-0x0219ffce 2 byte(s)
# 0 websockets.dll!lws_service_timeout_check [c:\tools\libwebsockets-2.0.0\lib\service.c:332]
# 1 websockets.dll!lws_service_fd_tsi       [c:\tools\libwebsockets-2.0.0\lib\service.c:701]
# 2 websockets.dll!lws_uv_timeout_cb        [c:\tools\libwebsockets-2.0.0\lib\libuv.c:141]
# 3 libuv.dll!uv_timer_get_repeat          +0x10b    (0x0ff86f7c <libuv.dll+0x16f7c>)
# 4 libuv.dll!uv_run                       +0x15f    (0x0ff744e0 <libuv.dll+0x44e0>)
# 5 websockets.dll!lws_libuv_run            [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
# 6 KERNEL32.dll!BaseThreadInitThunk       +0x23     (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:49.986 in thread 2848
Note: instruction: movsx  0x04(%edx,%eax,8) -> %eax

And some other interesting errors :

Error #19: HANDLE LEAK: KERNEL handle 0x00000124 and 9999 similar handle(s) were opened but not closed:
# 0 system call NtCreateFile      
# 1 MSWSOCK.dll!Tcpip6_WSHGetSocketInformation  +0x540    (0x73e99cd1 <MSWSOCK.dll+0x9cd1>)
# 2 MSWSOCK.dll!sethostname                     +0x3e41   (0x73ead022 <MSWSOCK.dll+0x1d022>)
# 3 WS2_32.dll!accept                           +0x87     (0x77286d88 <WS2_32.dll+0x16d88>)
# 4 websockets.dll!lws_server_socket_service     [c:\tools\libwebsockets-2.0.0\lib\server.c:1808]
# 5 websockets.dll!lws_service_fd_tsi            [c:\tools\libwebsockets-2.0.0\lib\service.c:786]
# 6 websockets.dll!lws_service_fd                [c:\tools\libwebsockets-2.0.0\lib\service.c:1086]
# 7 websockets.dll!lws_io_cb                     [c:\tools\libwebsockets-2.0.0\lib\libuv.c:101]
# 8 libuv.dll!uv_key_set                        +0x37f    (0x0ff7e560 <libuv.dll+0xe560>)
# 9 libuv.dll!uv_async_send                     +0x377    (0x0ff73b78 <libuv.dll+0x3b78>)
#10 libuv.dll!uv_run                            +0x166    (0x0ff744e7 <libuv.dll+0x44e7>)
#11 websockets.dll!lws_libuv_run                 [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
#12 KERNEL32.dll!BaseThreadInitThunk            +0x23     (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:58.992 in thread 2848

Error #20: HANDLE LEAK: KERNEL handle 0x000001ec and 7297 similar handle(s) were opened but not closed:
# 0 system call NtCreateEvent      
# 1 avghookx.dll!?                               +0x0      (0x72d528ab <avghookx.dll+0x28ab>)
# 2 KERNELBASE.dll!CreateEventExW                +0x65     (0x7753cba6 <KERNELBASE.dll+0xbcba6>)
# 3 KERNELBASE.dll!CreateEventA                  +0x32     (0x7753cb03 <KERNELBASE.dll+0xbcb03>)
# 4 WS2_32.dll!WSACreateEvent                    +0xc      (0x7728695d <WS2_32.dll+0x1695d>)
# 5 websockets.dll!lws_plat_insert_socket_into_fds [c:\tools\libwebsockets-2.0.0\lib\lws-plat-win.c:410]
# 6 websockets.dll!insert_wsi_socket_into_fds     [c:\tools\libwebsockets-2.0.0\lib\pollfd.c:151]
# 7 websockets.dll!lws_adopt_socket_vhost         [c:\tools\libwebsockets-2.0.0\lib\server.c:1464]
# 8 websockets.dll!lws_server_socket_service      [c:\tools\libwebsockets-2.0.0\lib\server.c:1844]
# 9 websockets.dll!lws_service_fd_tsi             [c:\tools\libwebsockets-2.0.0\lib\service.c:786]
#10 websockets.dll!lws_service_fd                 [c:\tools\libwebsockets-2.0.0\lib\service.c:1086]
#11 websockets.dll!lws_io_cb                      [c:\tools\libwebsockets-2.0.0\lib\libuv.c:101]
#12 libuv.dll!uv_key_set                         +0x37f    (0x0ff7e560 <libuv.dll+0xe560>)
#13 libuv.dll!uv_async_send                      +0x377    (0x0ff73b78 <libuv.dll+0x3b78>)
#14 libuv.dll!uv_run                             +0x166    (0x0ff744e7 <libuv.dll+0x44e7>)
#15 websockets.dll!lws_libuv_run                  [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
#16 KERNEL32.dll!BaseThreadInitThunk             +0x23     (0x75bb38f4 <KERNEL32.dll+0x138f4>)
Note: @0:02:58.992 in thread 2848
Note: handles created with the same callstack are closed here:
Note: # 0 system call NtClose            
Note: # 1 KERNELBASE.dll!CloseHandle                   +0x57     (0x7752e0d8 <KERNELBASE.dll+0xae0d8>)
Note: # 2 WS2_32.dll!WSACloseEvent                     +0x18     (0x77286909 <WS2_32.dll+0x16909>)
Note: # 3 websockets.dll!lws_plat_delete_socket_from_fds [c:\tools\libwebsockets-2.0.0\lib\lws-plat-win.c:421]
Note: # 4 websockets.dll!remove_wsi_socket_from_fds     [c:\tools\libwebsockets-2.0.0\lib\pollfd.c:233]
Note: # 5 websockets.dll!lws_close_free_wsi             [c:\tools\libwebsockets-2.0.0\lib\libwebsockets.c:423]
Note: # 6 websockets.dll!lws_service_timeout_check      [c:\tools\libwebsockets-2.0.0\lib\service.c:351]
Note: # 7 websockets.dll!lws_service_fd_tsi             [c:\tools\libwebsockets-2.0.0\lib\service.c:701]
Note: # 8 websockets.dll!lws_uv_timeout_cb              [c:\tools\libwebsockets-2.0.0\lib\libuv.c:141]
Note: # 9 libuv.dll!uv_timer_get_repeat                +0x10b    (0x0ff86f7c <libuv.dll+0x16f7c>)
Note: #10 libuv.dll!uv_run                             +0x15f    (0x0ff744e0 <libuv.dll+0x44e0>)
Note: #11 websockets.dll!lws_libuv_run                  [c:\tools\libwebsockets-2.0.0\lib\libuv.c:358]
Note: #12 KERNEL32.dll!BaseThreadInitThunk             +0x23     (0x75bb38f4 <KERNEL32.dll+0x138f4>)

Then handle leaks might be caused by the crash...

alexgille commented 8 years ago

Full report : results_latest.txt

alexgille commented 8 years ago

I found out that in certain benchmark configurations the stress-test will not crash the executable, and in other it will crash in less than a minute.

Now I can reproduce this issue everytime with : ab.exe -n 20000 -c 10000 http://127.0.0.1/index.html

lws-team commented 8 years ago

'crash' doesn't give me any clue. Where does it crash.

I looked at the first error given,it's an lwsl_notice with some args.

I pushed a patch (on master) in case the sockfd is invalid, but that'd be a read, not a write causing heap corruption.

I dunno how to interpret the other ones, valgrind doesn't have similar complaints.

alexgille commented 8 years ago

By 'crash' I mean the console closes and disappear without saying anything while running. I'm short of solutions now...

lws-team commented 8 years ago

The windows stuff is otherwise acting so like the Linux build now, I start to wonder if it's on my side. But I can do the ab test

$ ab -n 20000 -c 10000 http://127.0.0.1:7681/test.html
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests

Server Software:        
Server Hostname:        127.0.0.1
Server Port:            7681

Document Path:          /test.html
Document Length:        18662 bytes

Concurrency Level:      10000
Time taken for tests:   4.578 seconds
Complete requests:      20000
Failed requests:        0
Total transferred:      376020000 bytes
HTML transferred:       373240000 bytes
Requests per second:    4368.93 [#/sec] (mean)
Time per request:       2288.889 [ms] (mean)
Time per request:       0.229 [ms] (mean, across all concurrent requests)
Transfer rate:          80215.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0  266 413.9     68    3004
Processing:    63 1669 697.8   1836    2744
Waiting:       59 1668 698.1   1836    2744
Total:        366 1934 902.1   1837    4300

Percentage of the requests served within a certain time (ms)
  50%   1837
  66%   2056
  75%   2290
  80%   2370
  90%   3314
  95%   3568
  98%   3729
  99%   3739
 100%   4300 (longest request)

The basic problem is the 'crash' is actually a much later abort when the malloc subsystem checks and sees, eg, sentinel bytes before the start of the allocated region got overwritten. But that situation came about because of a much earlier write.that is actually the cause. So it can't be attacked from the abort end - it's unrelated in time.

It could still be a generic library issue that only your setup reveals, or something in the win plat part, but since you have the symptom we have to try to catch where it goes wrong on your side. If you have access to a linux box, you might try the wine valgrind thing, in my experience valgrind is really good at finding and reporting these kinds of problem.

alexgille commented 8 years ago

Hi Andy,

The windows stuff is otherwise acting so like the Linux build now

Is it related to a new commit? If yes I'll test it and come back to you with my results.

I wish I could go deeper in the code to find out what's going wrong but as it appears to happen beyond the limit of my code, and as it looks like it's not coming from my usage of the lib (see code above), I'm kind of blind trying to find a solution. Do you have a Win32 platform for testing your builds at libwebsockets? Did you run the ab benchmark on Win32 or Linux?

I'm setting up a VBox + Linux testing environment to follow your guidance and tell you if I find something special.

lws-team commented 8 years ago

I mean that you are able to otherwise use the library on win32 after the fixes from a couple of weeks ago.

I only build-test it on windows via Appveyor, I don't have any windows boxes here.

I ran ab on Linux, because that's what I develop on. There are no valgrind errors or heap corruption (glibc, who provides the malloc, also looks for these).

If you had this problem on Linux, I would really expect valgrind to find and report the root cause, so combined with your ability to reproduce it, I'm hoping wine + valgrind can do the same trick.

lws-team commented 8 years ago

Has anything happened about this in the meanwhile?

lws-team commented 8 years ago

Well, if there's any new info, please reopen

alexgille commented 8 years ago

Sorry for the late reply, I've been in a rush for weeks. I will not be able to do the wine + valgrind trick on a win32 VBox for the moment as I'm short of time. I think libwebsocket is a promissing library that deserve a full Windows support in the medium term.

lws-team commented 8 years ago

Mmm well, the reason it's like it is for windows is basically driveby level of interest on that platform.