Closed IvanKuchin closed 7 years ago
OK. I appreciate the reproducer.
I will have a look at this over the weekend.
It is surprising wrt Autobahn. When I did the last fix I updated the logs on libwebsockets.org from autobahn, they are all fine.
https://libwebsockets.org/clients/index.html
Autobahn strictly uses echo semantics though. And it always takes data as fast as it provides it. So there are probably significant gaps between what that synthetic test actually stresses and what might have trouble hiding in the real world.
I started looking at this and noticed that the fixes on the server side rx state machine did not get added to the client side one... since the communication on the one connection stalls it can be either side responsible.
I pushed a patch aligning the client rx state machine with the server one on v2.1-stable, v2.2-stable and master, can you see if this makes any difference before I go any further into it?
This probably sheds light on why Autobahn is happy as well, we only test the server side of lws on it.
Will run tests tomorrow
I got the stable-2.2.0 and it still stuck.
I was thinking how to makes reproducing more easy than described in my first message. I ended up with far more easy procedure based on investigation of my team. We tested a lot of files and found a pattern , if take max compressed archive with lot of repeatable patterns in archived files, it brings the issue on the spot.
Send "killer string" to socket_status:
test-server ----------- Killer string 1:
socket_status.send("{ \"RequestType\":\"SendImage\", \"RequestID\":\"283463389\", \"toType\":\"toUser\", \"toID\":\"1036\", \"fileType\":\"image/jpeg\", \"blob\":\"\"}")
----------- Killer string 2:
socket_status.send("{ \"RequestType\":\"SendImage\", \"RequestID\":\"788346414\", \"toType\":\"toUser\", \"toID\":\"1036\", \"fileType\":\"image/jpeg\", \"blob\":\"\"}")
As soon as "killer string" sent to server it stuck and not doing anything anymore. Behavior very similar to what we get in production:
lwsts[30119]: pmd test: RX len 41
jj1Brv6KAPPf+FQwX76jdeJvFWp6xe39gNNW5ljgi
lwsts[30119]: pmd test: RX len 127
aGAMWG0RIqk7jnLA10Wg+F73RZ7SWfxVqmoJa2IsvJn8tY5CGBEpVFA3gDbxgYzxmugooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiii
C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
- the only way to kill server
`ikuchin@dev:~$ killall libwebsockets-test-server`
**test-fraggle**
Test with huge input form client to server also enforce test-fraggile to stop handling any packets/requests.
May I ask you couple questions about socket operations ? (I know this is not the right place for teaching, but give me a chance...)
We look for a way sending files over WebSocket. We can't use the XMLHttpRequest, because of the state-machine and proprietary-protocol used for message exchange.
Also, I tried to send non-Unicode symbols (for ex: russian alphabet) over ws and some symbols get broken. Therefore we encode files to base64 and sending them.
From WS architecture point: what is the right approach to send files over ws ? Is it possible to send binary/non-utf8 data to WebSocket ?
accordingly to callgrind all the costs after stuck associated with lib/service.c:lws_service_flag_pending
ikuchin@dev:/usr/src/libwebsockets-2.2.0-20170325/build_debug/bin$ callgrind_annotate callgrind.out.1562
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.1562' (creator: callgrind-3.11.0)
--------------------------------------------------------------------------------
I1 cache:
D1 cache:
LL cache:
Timerange: Basic block 0 - 472799998
Trigger: Program termination
Profiled target: ./libwebsockets-test-server (PID 1562, part 1)
Events recorded: Ir
Events shown: Ir
Event sort order: Ir
Thresholds: 99
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
10,401,599,950 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
10,401,599,950 /usr/src/libwebsockets-2.2.0-20170325/lib/service.c:lws_service_flag_pending [/usr/src/libwebsockets-2.2.0-20170325/build_debug/bin/libwebsockets-test-server]
Ok... I will try it in the morning.
Ws defines two message types, TEXT and BINARY. TEXT must carry UTF-8. BINARY has no restriction... this is the meaning of the flag you give lws_write().
It should be fixed... I added all three "strings of death" to the test HTML and it can pass them OK.
Patch added on v2.0-stable. v2.1-stable, v2.2-stable and master.
Thank you for fixing.
I run huge-file-testing on test-fraggle ..... as soon as WebSocket client-initiated-closed, libwebsocket-test-fraggle drops "segmentation fault".
Does it looks like an issue ? I can share core-dump.
*** Error in `./libwebsockets-test-fraggle': munmap_chunk(): invalid pointer: 0x0000000000b5e0e0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fd5c167e7e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7fd5c168aae8]
./libwebsockets-test-fraggle[0x414520]
./libwebsockets-test-fraggle[0x41454e]
./libwebsockets-test-fraggle[0x406b19]
./libwebsockets-test-fraggle[0x407e18]
./libwebsockets-test-fraggle[0x407d4a]
./libwebsockets-test-fraggle(lws_service_fd_tsi+0xd1b)[0x40c404]
./libwebsockets-test-fraggle(_lws_plat_service_tsi+0x39c)[0x41ed83]
./libwebsockets-test-fraggle(lws_plat_service+0x25)[0x41ee29]
./libwebsockets-test-fraggle(lws_service+0x20)[0x40c482]
./libwebsockets-test-fraggle[0x406964]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fd5c1627830]
./libwebsockets-test-fraggle(_start+0x29)[0x405d19]
======= Memory map: ========
00400000-00438000 r-xp 00000000 fc:00 927450 /home/ikuchin/src/libwebsockets-2.2.0/libwebsockets/build_debug/bin/libwebsockets-test-fraggle
00637000-00638000 r--p 00037000 fc:00 927450 /home/ikuchin/src/libwebsockets-2.2.0/libwebsockets/build_debug/bin/libwebsockets-test-fraggle
00638000-00639000 rw-p 00038000 fc:00 927450 /home/ikuchin/src/libwebsockets-2.2.0/libwebsockets/build_debug/bin/libwebsockets-test-fraggle
00b5d000-00b8c000 rw-p 00000000 00:00 0 [heap]
7fd5c11ed000-7fd5c1203000 r-xp 00000000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd5c1203000-7fd5c1402000 ---p 00016000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd5c1402000-7fd5c1403000 rw-p 00015000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd5c1403000-7fd5c1406000 r-xp 00000000 fc:00 783666 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd5c1406000-7fd5c1605000 ---p 00003000 fc:00 783666 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd5c1605000-7fd5c1606000 r--p 00002000 fc:00 783666 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd5c1606000-7fd5c1607000 rw-p 00003000 fc:00 783666 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd5c1607000-7fd5c17c6000 r-xp 00000000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fd5c17c6000-7fd5c19c6000 ---p 001bf000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fd5c19c6000-7fd5c19ca000 r--p 001bf000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fd5c19ca000-7fd5c19cc000 rw-p 001c3000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fd5c19cc000-7fd5c19d0000 rw-p 00000000 00:00 0
7fd5c19d0000-7fd5c19e8000 r-xp 00000000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd5c19e8000-7fd5c1be7000 ---p 00018000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd5c1be7000-7fd5c1be8000 r--p 00017000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd5c1be8000-7fd5c1be9000 rw-p 00018000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd5c1be9000-7fd5c1bed000 rw-p 00000000 00:00 0
7fd5c1bed000-7fd5c1e07000 r-xp 00000000 fc:00 784742 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fd5c1e07000-7fd5c2006000 ---p 0021a000 fc:00 784742 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fd5c2006000-7fd5c2022000 r--p 00219000 fc:00 784742 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fd5c2022000-7fd5c202e000 rw-p 00235000 fc:00 784742 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fd5c202e000-7fd5c2031000 rw-p 00000000 00:00 0
7fd5c2031000-7fd5c208f000 r-xp 00000000 fc:00 784746 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7fd5c208f000-7fd5c228f000 ---p 0005e000 fc:00 784746 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7fd5c228f000-7fd5c2293000 r--p 0005e000 fc:00 784746 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7fd5c2293000-7fd5c229a000 rw-p 00062000 fc:00 784746 /lib/x86_64-linux-gnu/libssl.so.1.0.0
7fd5c229a000-7fd5c22b3000 r-xp 00000000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fd5c22b3000-7fd5c24b2000 ---p 00019000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fd5c24b2000-7fd5c24b3000 r--p 00018000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fd5c24b3000-7fd5c24b4000 rw-p 00019000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fd5c24b4000-7fd5c24da000 r-xp 00000000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fd5c26c7000-7fd5c26cc000 rw-p 00000000 00:00 0
7fd5c26d6000-7fd5c26d9000 rw-p 00000000 00:00 0
7fd5c26d9000-7fd5c26da000 r--p 00025000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fd5c26da000-7fd5c26db000 rw-p 00026000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fd5c26db000-7fd5c26dc000 rw-p 00000000 00:00 0
7fffacff4000-7fffad015000 rw-p 00000000 00:00 0 [stack]
7fffad050000-7fffad052000 r--p 00000000 00:00 0 [vvar]
7fffad052000-7fffad054000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted (core dumped)
Yes it's an 'issue' one way or another, either from fraggle or lws. But your backtrace is not useful.
Can you do cmake .. -DCMAKE_BUILD_TYPE=DEBUG
and re-make / retry running under valgrind?
Testing the test server under valgrind here, simply using the 'pmd test' button to send the strings-of-death is completely clean. So it needs something more.
Our app-server can't start with the same stack. Before last change - works perfectly,
I diff'ed my chat-server(based on lws-server-pthread 2.0.x) server and the one from 2.2.0 distributive - they are the same except
1) dumb_increment_protocol frame size (my - 32, yours - 10)
2) my is disabled http_protocol (from enum demo_protocols
and static struct lws_protocols protocols
)
I tried to build lws-server-pthread w/o http and it works.
Here is the valgrind and gdb traceback's. Hope it help. If you wish I can grant you the access to my server. You'll take whatever you need.
--------- valgrind (look into lws_service_fd_tsi : service.c:728)
ikuchin@dev:~/src/dev.connme.ru$ valgrind ./chat-server -d65535
==27279== Memcheck, a memory error detector
==27279== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==27279== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==27279== Command: ./chat-server -d65535
==27279==
lwsts[27279]: libwebsockets test server pthreads - license LGPL2.1+SLE
lwsts[27279]: (C) Copyright 2010-2016 Andy Green <andy@warmcat.com>
Using resource path "/usr/local/share/libwebsockets-test-server"
lwsts[27279]: Initial logging level 65535
lwsts[27279]: Libwebsockets version: 2.2.0 ikuchin@dev.connme.ru-v2.0.0-339-g5dd1b41
lwsts[27279]: IPV6 not compiled in
lwsts[27279]: libev support not compiled in
lwsts[27279]: libuv support not compiled in
lwsts[27279]: LWS_DEF_HEADER_LEN : 4096
lwsts[27279]: LWS_MAX_PROTOCOLS : 5
lwsts[27279]: LWS_MAX_SMP : 32
lwsts[27279]: SPEC_LATEST_SUPPORTED : 13
lwsts[27279]: sizeof (*info) : 392
lwsts[27279]: SYSTEM_RANDOM_FILEPATH: '/dev/urandom'
lwsts[27279]: default timeout (secs): 20
lwsts[27279]: Threads: 1 each 1024 fds
lwsts[27279]: mem: context: 9272 bytes (5176 ctx + (1 thr x 4096))
lwsts[27279]: mem: http hdr rsvd: 30848 bytes (1 thr x (4096 + 3616) x 4))
lwsts[27279]: mem: pollfd map: 8192
lwsts[27279]: mem: platform fd map: 8192 bytes
lwsts[27279]: Compiled with OpenSSL support
lwsts[27279]: SSL disabled: no LWS_SERVER_OPTION_DO_SSL_GLOBAL_INIT
lwsts[27279]: Creating Vhost 'default' port 7681, 3 protocols, IPv6 off
lwsts[27279]: insert_wsi_socket_into_fds: 0xc2be220: tsi=0, sock=8, pos-in-fds=1
lwsts[27279]: Listening on port 7681
lwsts[27279]: LWS_MAX_EXTENSIONS_ACTIVE: 2
lwsts[27279]: mem: per-conn: 568 bytes + protocol rx buf
lwsts[27279]: canonical_hostname = dev.connme.ru
lwsts[27279]: pthread_create: 0 created
lwsts[27279]: lws_protocol_init
==27279== Thread 3:
==27279== Jump to the invalid address stated on the next line
==27279== at 0x328: ???
==27279== by 0x587A134: lws_service_fd_tsi (service.c:728)
==27279== by 0x588DC69: _lws_plat_service_tsi (lws-plat-unix.c:156)
==27279== by 0x587AEAF: lws_service_tsi (service.c:1285)
==27279== by 0x413582: thread_service(void*) (chat-server.cpp:202)
==27279== by 0x56576B9: start_thread (pthread_create.c:333)
==27279== by 0x63B682C: clone (clone.S:109)
==27279== Address 0x328 is not stack'd, malloc'd or (recently) free'd
==27279==
==27279==
==27279== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==27279== Bad permissions for mapped region at address 0x328
==27279== at 0x328: ???
==27279== by 0x587A134: lws_service_fd_tsi (service.c:728)
==27279== by 0x588DC69: _lws_plat_service_tsi (lws-plat-unix.c:156)
==27279== by 0x587AEAF: lws_service_tsi (service.c:1285)
==27279== by 0x413582: thread_service(void*) (chat-server.cpp:202)
==27279== by 0x56576B9: start_thread (pthread_create.c:333)
==27279== by 0x63B682C: clone (clone.S:109)
==27279==
==27279== HEAP SUMMARY:
==27279== in use at exit: 176,919 bytes in 48 blocks
==27279== total heap usage: 245 allocs, 197 frees, 542,065 bytes allocated
==27279==
==27279== LEAK SUMMARY:
==27279== definitely lost: 0 bytes in 0 blocks
==27279== indirectly lost: 0 bytes in 0 blocks
==27279== possibly lost: 37,358 bytes in 33 blocks
==27279== still reachable: 139,561 bytes in 15 blocks
==27279== suppressed: 0 bytes in 0 blocks
==27279== Rerun with --leak-check=full to see details of leaked memory
==27279==
==27279== For counts of detected and suppressed errors, rerun with: -v
==27279== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Killed
--------- gdb (look into lws_protocol_init)
ikuchin@dev:~/src/dev.connme.ru$ gdb ./chat-server ./core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
...skip...
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./chat-server...done.
[New LWP 27286]
[New LWP 27285]
[New LWP 27284]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./chat-server -d65535'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000000328 in ?? ()
[Current thread is 1 (Thread 0x7ff27f8fb700 (LWP 27286))]
(gdb) where
#0 0x0000000000000328 in ?? ()
#1 0x00007ff286ac3558 in lws_protocol_init (context=0x1d8a110)
at /usr/src/libwebsockets-2.2.0-20170326/lib/context.c:198
#2 0x00007ff286abc135 in lws_service_fd_tsi (context=0x1d8a110, pollfd=0x0, tsi=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/service.c:728
#3 0x00007ff286acfc6a in _lws_plat_service_tsi (context=0x1d8a110, timeout_ms=50, tsi=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/lws-plat-unix.c:156
#4 0x00007ff286abceb0 in lws_service_tsi (context=0x1d8a110, timeout_ms=50, tsi=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/service.c:1285
#5 0x0000000000413583 in thread_service (threadid=0x0) at chat-server.cpp:202
#6 0x00007ff286cf06ba in start_thread (arg=0x7ff27f8fb700) at pthread_create.c:333
#7 0x00007ff285fe382d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) quit
Hmmm I sincerely doubt this is anything to do with the last patch from me. At that point there are no incoming connections negotiated or pmd. You can try leave everything the same and revert the patch. But why aren't the lws test apps affected by something so basic?
I can't help you debug your app, that part is your problem. Busted things in lws are arguably my problem and I can usually help with those. However when I see pthreads in the bt I wonder if you are racing something critical during init. If it's v2.2-stable then this seems somehow related to your struct lws_protocols. But I think whatever's up with that is in your app.
Got you, understand. Do you want me to close the case or I can take the debug from server-fraggle ?
A usable bt / valgrind on that, and tell me which lws version it was using, would be good thanks. It looks like it's freeing an invalid pointer, knowing where the free() is in the code might be enough to know where to look.
2.2-stable Fraggle-server under valgrind control is going to be very slowly and not able to work appropriate time. I waited about hour and fraggle took about 2 mins to receive new chunk. Here is GDB bt/bt full/list and local variables. Let me know if this is not enough , I'll work on more attempts with valgrind. My assumption: lib/alloc.c -> _realloc(0)
Spamming session over, len = 196227. sum = 0x17DFC62
Spamming 419 random fragments
Terminating...
[2017/03/26 22:08:41:2592] NOTICE: lws_context_destroy: ctx 0x2582b50
*** Error in `./libwebsockets-test-fraggle': free(): invalid next size (normal): 0x00000000025810d0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe76bef17e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7fe76bef9e0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fe76befd98c]
./libwebsockets-test-fraggle[0x4128d7]
./libwebsockets-test-fraggle[0x412905]
./libwebsockets-test-fraggle(lws_context_destroy+0x3ec)[0x41259b]
./libwebsockets-test-fraggle[0x4050da]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe76be9a830]
./libwebsockets-test-fraggle(_start+0x29)[0x404449]
======= Memory map: ========
00400000-00432000 r-xp 00000000 fc:00 653035 /home/ikuchin/111111111/libwebsockets-test-fraggle
00631000-00632000 r--p 00031000 fc:00 653035 /home/ikuchin/111111111/libwebsockets-test-fraggle
00632000-00633000 rw-p 00032000 fc:00 653035 /home/ikuchin/111111111/libwebsockets-test-fraggle
02581000-025b0000 rw-p 00000000 00:00 0 [heap]
7fe764000000-7fe764021000 rw-p 00000000 00:00 0
7fe764021000-7fe768000000 ---p 00000000 00:00 0
7fe76bc64000-7fe76bc7a000 r-xp 00000000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe76bc7a000-7fe76be79000 ---p 00016000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe76be79000-7fe76be7a000 rw-p 00015000 fc:00 783667 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe76be7a000-7fe76c039000 r-xp 00000000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fe76c039000-7fe76c239000 ---p 001bf000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fe76c239000-7fe76c23d000 r--p 001bf000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fe76c23d000-7fe76c23f000 rw-p 001c3000 fc:00 783670 /lib/x86_64-linux-gnu/libc-2.23.so
7fe76c23f000-7fe76c243000 rw-p 00000000 00:00 0
7fe76c243000-7fe76c25b000 r-xp 00000000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fe76c25b000-7fe76c45a000 ---p 00018000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fe76c45a000-7fe76c45b000 r--p 00017000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fe76c45b000-7fe76c45c000 rw-p 00018000 fc:00 783682 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fe76c45c000-7fe76c460000 rw-p 00000000 00:00 0
7fe76c460000-7fe76c479000 r-xp 00000000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fe76c479000-7fe76c678000 ---p 00019000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fe76c678000-7fe76c679000 r--p 00018000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fe76c679000-7fe76c67a000 rw-p 00019000 fc:00 783603 /lib/x86_64-linux-gnu/libz.so.1.2.8
7fe76c67a000-7fe76c6a0000 r-xp 00000000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fe76c88e000-7fe76c892000 rw-p 00000000 00:00 0
7fe76c89c000-7fe76c89f000 rw-p 00000000 00:00 0
7fe76c89f000-7fe76c8a0000 r--p 00025000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fe76c8a0000-7fe76c8a1000 rw-p 00026000 fc:00 783373 /lib/x86_64-linux-gnu/ld-2.23.so
7fe76c8a1000-7fe76c8a2000 rw-p 00000000 00:00 0
7fff2bcb5000-7fff2bcd6000 rw-p 00000000 00:00 0 [stack]
7fff2bdbf000-7fff2bdc1000 r--p 00000000 00:00 0 [vvar]
7fff2bdc1000-7fff2bdc3000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted (core dumped)
#0 0x00007fe76beaf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007fe76beaf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007fe76beb102a in __GI_abort () at abort.c:89
#2 0x00007fe76bef17ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7fe76c00a2e0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007fe76bef9e0a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>,
str=0x7fe76c00a438 "free(): invalid next size (normal)", action=3) at malloc.c:5004
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3865
#5 0x00007fe76befd98c in __GI___libc_free (mem=<optimized out>) at malloc.c:2966
#6 0x00000000004128d7 in _realloc (ptr=0x25810d0, size=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/alloc.c:60
#7 0x0000000000412905 in lws_realloc (ptr=0x25810d0, size=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/alloc.c:68
#8 0x000000000041259b in lws_context_destroy (context=0x2582b50)
at /usr/src/libwebsockets-2.2.0-20170326/lib/context.c:1037
#9 0x00000000004050da in main (argc=1, argv=0x7fff2bcd3758)
at /usr/src/libwebsockets-2.2.0-20170326/test-server/test-fraggle.c:380
51 static void *_realloc(void *ptr, size_t size)
52 {
53 if (size)
54 #if defined(LWS_PLAT_OPTEE)
55 return (void *)TEE_Realloc(ptr, size);
56 #else
57 return (void *)realloc(ptr, size);
58 #endif
59 else if (ptr)
60 free(ptr);
61 return NULL;
62 }
(gdb) bt full
#0 0x00007fe76beaf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
resultvar = 0
pid = 13091
selftid = 13091
#1 0x00007fe76beb102a in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x3030316364623266, sa_sigaction = 0x3030316364623266},
sa_mask = {__val = {7075830554540059952, 8223625903103763300, 3472328295963457581,
4192904167887482928, 2314885531086893104, 2314885530818453536, 2314885530818453536,
8528445641706184736, 7378697234196427620, 3919933115663279718, 7378634724483280944,
7378697629483820646, 8223625903103619126, 3472328295963457581, 4192904167887482928,
2314885531086893104}}, sa_flags = 538976288, sa_restorer = 0x68}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007fe76bef17ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7fe76c00a2e0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
ap = <error reading variable ap (Attempt to dereference a generic pointer.)>
fd = 6
on_2 = <optimized out>
list = <optimized out>
nlist = <optimized out>
cp = <optimized out>
written = <optimized out>
#3 0x00007fe76bef9e0a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>,
str=0x7fe76c00a438 "free(): invalid next size (normal)", action=3) at malloc.c:5004
buf = "00000000025810d0"
cp = <optimized out>
ar_ptr = <optimized out>
str = 0x7fe76c00a438 "free(): invalid next size (normal)"
action = 3
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3865
size = <optimized out>
fb = <optimized out>
nextchunk = <optimized out>
nextsize = <optimized out>
nextinuse = <optimized out>
prevsize = <optimized out>
bck = <optimized out>
fwd = <optimized out>
errstr = <optimized out>
locked = <optimized out>
#5 0x00007fe76befd98c in __GI___libc_free (mem=<optimized out>) at malloc.c:2966
ar_ptr = <optimized out>
p = <optimized out>
hook = <optimized out>
#6 0x00000000004128d7 in _realloc (ptr=0x25810d0, size=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/alloc.c:60
No locals.
#7 0x0000000000412905 in lws_realloc (ptr=0x25810d0, size=0)
at /usr/src/libwebsockets-2.2.0-20170326/lib/alloc.c:68
No locals.
#8 0x000000000041259b in lws_context_destroy (context=0x2582b50)
at /usr/src/libwebsockets-2.2.0-20170326/lib/context.c:1037
protocol = 0x632378 <protocols+56>
pt = 0x2582c30
vh = 0x0
wsi = {u = {http = {ah = 0x0, ah_wait_list = 0x0, preamble_rx = 0x0, stash = 0x0, preamble_rx_len = 0,
new_wsi_list = 0x0, filepos = 0, filelen = 0, fop_fd = 0x0, range = {start = 0, end = 0,
extent = 0, agg = 0, budget = 0, buf = '\000' <repeats 127 times>, pos = 0,
state = LWSRS_NO_ACTIVE_RANGE, start_valid = 0 '\000', end_valid = 0 '\000', ctr = 0 '\000',
count_ranges = 0 '\000', did_try = 0 '\000', inside = 0 '\000', send_ctr = 0 '\000'},
multipart_content_type = '\000' <repeats 63 times>, request_version = HTTP_VERSION_1_0,
connection_type = HTTP_CONNECTION_CLOSE, content_length = 0, content_remain = 0}, hdr = {ah = 0x0,
ah_wait_list = 0x0, preamble_rx = 0x0, stash = 0x0, preamble_rx_len = 0, ups = URIPS_IDLE,
ues = URIES_IDLE, lextable_pos = 0, current_token_limit = 0, esc_stash = 0 '\000',
post_literal_equal = 0 '\000', parser_state = 0 '\000'}, ws = {hdr = {ah = 0x0,
ah_wait_list = 0x0, preamble_rx = 0x0, stash = 0x0, preamble_rx_len = 0, ups = URIPS_IDLE,
ues = URIES_IDLE, lextable_pos = 0, current_token_limit = 0, esc_stash = 0 '\000',
post_literal_equal = 0 '\000', parser_state = 0 '\000'}, rx_ubuf = 0x0, rx_ubuf_alloc = 0,
rx_draining_ext_list = 0x0, tx_draining_ext_list = 0x0, time_next_ping_check = 0,
rx_packet_length = 0, rx_ubuf_head = 0, mask = "\000\000\000",
ping_payload_buf = '\000' <repeats 140 times>, ping_payload_len = 0 '\000', mask_idx = 0 '\000',
opcode = 0 '\000', rsv = 0 '\000', rsv_first_msg = 0 '\000', close_in_ping_buffer_len = 0 '\000',
utf8 = 0 '\000', stashed_write_type = 0 '\000', tx_draining_stashed_wp = 0 '\000', final = 0,
frame_is_binary = 0, all_zero_nonce = 0, this_frame_masked = 0, inside_frame = 0,
clean_buffer = 0, payload_is_close = 0, ping_pending_flag = 0, continuation_possible = 0,
owed_a_fin = 0, check_utf8 = 0, defeat_check_utf8 = 0, pmce_compressed_message = 0,
stashed_write_pending = 0, rx_draining_ext = 0, tx_draining_ext = 0, send_check_ping = 0}},
pending_timeout_limit = 0, context = 0x2582b50, vhost = 0x258f750, parent = 0x0, child_list = 0x0,
sibling_list = 0x0, protocol = 0x632340 <protocols>, same_vh_protocol_prev = 0x0,
same_vh_protocol_next = 0x0, timeout_list = 0x0, timeout_list_prev = 0x0, user_space = 0x0,
rxflow_buffer = 0x0, trunc_alloc = 0x0, active_extensions = {0x0, 0x0}, act_ext_user = {0x0, 0x0},
desc = {sockfd = 0, filefd = 0}, position_in_fds_table = 0, rxflow_len = 0, rxflow_pos = 0,
trunc_alloc_len = 0, trunc_offset = 0, trunc_len = 0, chunk_remaining = 0, cache_secs = 0,
hdr_parsing_completed = 0, http2_substream = 0, listener = 0, user_space_externally_allocated = 0,
socket_is_permanently_unusable = 0, rxflow_change_to = 0, more_rx_waiting = 0, conn_stat_done = 0,
---Type <return> to continue, or q <return> to quit---
cache_reuse = 0, cache_revalidate = 0, cache_intermediaries = 0, favoured_pollin = 0,
sending_chunked = 0, already_did_cce = 0, told_user_closed = 0, do_ws = 0, chunked = 0,
client_rx_avail = 0, client_http_body_pending = 0, extension_data_pending = 0, c_port = 0,
count_act_ext = 0 '\000', ietf_spec_revision = 0 '\000', mode = 0 '\000', state = 0 '\000',
state_pre_close = 0 '\000', lws_rx_parse_state = 0 '\000', rx_frame_type = 0 '\000',
pending_timeout = 0 '\000', pps = 0 '\000', tsi = 0 '\000', protocol_interpret_idx = 0 '\000',
redirects = 0 '\000', chunk_parser = 0 '\000', reason_bf = 0 '\000'}
n = 0
m = -1
__func__ = "lws_context_destroy"
#9 0x00000000004050da in main (argc=1, argv=0x7fff2bcd3758)
at /usr/src/libwebsockets-2.2.0-20170326/test-server/test-fraggle.c:380
n = 0
port = 7681
use_ssl = 0
context = 0x2582b50
opts = 0
interface_name = '\000' <repeats 127 times>
ads_port = ".N=\366\000\000\000\000\261$@\000\000\000\000\000\377\377\377\377", '\000' <repeats 12 times>, "\020\244\350k\347\177\000\000\000\020\211l\347\177\000\000\210\032\212l\347\177\000\000\240\065\315+\377\177\000\000\000\000\000\000\000\000\000\000\220\065\315+\377\177\000\000&\260be\000\000\000\000GZ\000l\347\177\000\000\377\377\377\377\000\000\000\000\000\000\200\000\000\000\000\000\200\022\334+\377\177\000\000\000\027\212l\347\177\000\000\b\364El\347\177\000\000x!c\000\000\000\000\000 D@\000\000\000\000\000P7\315+\377\177", '\000' <repeats 26 times>, "\366\230hl\347\177\000\000\001", '\000' <repeats 17 times>...
iface = 0x0
wsi = 0x7fe76c245fc5
address = 0x0
server_port = 7681
info = {port = 7681, iface = 0x0, protocols = 0x632340 <protocols>, extensions = 0x427240 <exts>,
token_limits = 0x0, ssl_private_key_password = 0x0, ssl_cert_filepath = 0x0,
ssl_private_key_filepath = 0x0, ssl_ca_filepath = 0x0, ssl_cipher_list = 0x0,
http_proxy_address = 0x0, http_proxy_port = 0, gid = -1, uid = -1, options = 0, user = 0x0,
ka_time = 0, ka_probes = 0, ka_interval = 0, provided_client_ssl_ctx = 0x0, max_http_header_data = 0,
max_http_header_pool = 0, count_threads = 0, fd_limit_per_thread = 0, timeout_secs = 0,
ecdh_curve = 0x0, vhost_name = 0x0, plugin_dirs = 0x0, pvo = 0x0, keepalive_timeout = 0,
log_filepath = 0x0, mounts = 0x0, server_string = 0x0, pt_serv_buf_size = 0,
max_http_header_data2 = 0, ssl_options_set = 0, ssl_options_clear = 0, ws_ping_pong_interval = 0,
headers = 0x0, reject_service_keywords = 0x0, external_baggage_free_on_destroy = 0x0, fops = 0x0,
simultaneous_ssl_restriction = 0, _unused = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
Could you advice how to correctly remove lws instance, if CMAKE_INSTALL_PREFIX:PATH=/usr/local
?
I can do it manually (install_manifest.txt), but I'm not sure that I didn't miss anything (like ld.so.cache update).
One of my thoughts is that some marks from prev installation still alive.
About remove after make install
$ sudo rm -rf `cat build/install_manifest.txt`
will remove everything (check in the txt file first to see what it will remove, it should all be reasonable, not, eg, / ^^). Other general gotchas
your shell may remember it found the file on, eg, /usr/local/bin before and cache that; you can defeat it by running eg, /usr/bin/xxx...
explicitly.
run sudo ldconfig
to update the system's understanding of what libs are found where
The bt says we're dying in context destroy, but the line no for v2.2-stable "lib/context.c:1037" is the end of that function... so no clue which free. It's dying trying to free() 0x25810d0 but there's no clue in the dumps where it got that pointer from.
If I want to reproduce this myself, what exactly do I have to do?
I didn't test the master branch with d58353f-commit.
Reproducing steps: attached array elements in random order with pacing 0.2Hz have to be sent to fraggle-protocol-server about 10-15 mins, then stop, you'll get the crash.
My script provides with simple statistics:
requestsSentCounter (img1 size: 1976597): 53
requestsSentCounter (img2 size: 1267600): 78
requestsSentCounter (img3 size: 18022): 25
requestsSentCounter (img4 size: 24630): 32
responceReceivedCounter: 3187
Do you need it ? Otherwise I can test new patches on your behalf.
About debugging.
line 1037
(gdb) l 1037
1032
1033 lws_libev_destroyloop(context, n);
1034 lws_libuv_destroyloop(context, n);
1035 lws_libevent_destroyloop(context, n);
1036
1037 lws_free_set_NULL(context->pt[n].serv_buf);
1038 if (pt->ah_pool)
1039 lws_free(pt->ah_pool);
1040 if (pt->http_header_data)
1041 lws_free(pt->http_header_data);
This is the macro running lws_realloc(xxx, 0) ./private-libwebsockets.h:#define lws_free_set_NULL(P) do { lws_realloc(P, 0); (P) = NULL; } while(0)
, all that we need to find is the value of context->pt[n].serv_buf
.
1) n = 0 (from bt full) 2) context is the pointer to struct XXXX (you know better than me) 3) we need pt[n-th] element (actually pt[0]) 4) find value of serv_buf-member. (we know that final P=0x25810d0)
Finally context->pt[n].serv_buf = ptr 0x25810d0
GDB can easily provides with memory dump (below is just an example from another core-dump, I lost the original):
(gdb) x /10xw 0x2298c30
0x2298c30: 0x00000000 0x00000000 0x00000000 0x00000000
0x2298c40: 0xffffffff 0x00000000 0x00000000 0x00000000
0x2298c50: 0x00000000 0x00000000
Hm... it's not like I don't know how to use gdb ^^
It's that the source debug line info is misleading if you don't tell me the exact version. Going by your filepaths before
/usr/src/libwebsockets-2.2.0-20170326/lib/context.c:1037
V2.2-stable takes me to the wrong place, the end of the function.
https://github.com/warmcat/libwebsockets/blob/v2.2-stable/lib/context.c#L1037
Now you mentioned you switched to master I can look in the right place.
Anyway this is some form of heap corruption, it's not complaining about the content of serv_buf, it doesn't care. It's complaining that the malloc housekeeping data held in a linked-list just before each allocation is invalid somewhere by that point. The problem happened earlier and valgrind is very good at spotting out-of-bounds writes.
If you disable extensions in cmake, ie, no pmd, still get the problem?
my fault. I forgot to check-out v2.2-stable , but the folder name brings the confusion. Previous test had run on master-branch.
Now I rerun test on v2.2-stable
I pushed a small patch that may help (or may change nothing) on v2.0-stable, v2.1-stable. v2.2-stable and master.
ext-disabled - no crash
With "small patch" integrated-in and ext-enabled - no crash.
Yay... is that it then?
seems you are right long story done. you are amazing !
Hello lws-team,
Nice to meet you again ! This time I actually can't complain on lws-library, due to issue reproducing with no error's in log. But let me rise my concern, hope you'll give me direction where should I dig in.
Our project having very intensive exchange over WebSockets in peak hours. After working few hours it comes to stuck state and can't send/receive/close connection.
Tested libs: 1) 2.1.0 with fix from my previous issue 2) 2.2.0 with fix from my previous issue
------ debug1: last lines in debug (pay attention to draining 1 and final 0) - > stuck
-------------- DEBUG2: last lines in debug (pay attention to draining 1 and final 1) -> stuck
It is easily reproduces on production environment by waiting several hours, but the repro on test-server end up with different log. To reproduce issue on test-fraggle, I took 2 big images (nothing special in there: https://github.com/IvanKuchin/lws-quality-assurance/tree/master/image_example), convert them to blob and sending to test-fraggle. I can't paste source of html here because of the volume, you can get it https://github.com/IvanKuchin/lws-quality-assurance/tree/master/src. ---------------------- Repro step
I looked through autobahn tests and they sending even bigger files over WebSocket (20 MB). This is looks strange to me.