warmcat / libwebsockets

canonical libwebsockets.org networking library
https://libwebsockets.org
Other
4.78k stars 1.49k forks source link

SMP and lws_client_connect_via_info on lws thread #1351

Closed jameschicca closed 5 years ago

jameschicca commented 6 years ago

While looking into issue https://github.com/warmcat/libwebsockets/issues/1349 for more details and attempting to reproduce I encountered this issue which might be related. It appears that there is a race condition with SMP and a failed connection startup. We are invoking lws_client_connect_via_info on the lws thread via LWS_CALLBACK_USER with lws_timed_callback_vh_protocol from the app thread. It looks like during a failed startup, the wsi thread can come around and cleanup while startup is still going.

On LWS thread 1

On LWN thread 2

On LWS thread 1

Here was the ASAN report. Note that allocation and reference of deleted pointer is from same thread (T3). T1 got a time slice and saw the socket as closed and performed cleanup.

=================================================================
==18732==ERROR: AddressSanitizer: heap-use-after-free on address 0x61400001fcb8 at pc 0x7ffff5861442 bp 0x7fffed7fccd0 sp 0x7fffed7fccc8
READ of size 8 at 0x61400001fcb8 thread T3
    #0 0x7ffff5861441 in __lws_set_timeout /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:430
    #1 0x7ffff5861782 in lws_set_timeout /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:462
    #2 0x7ffff58a4817 in lws_client_connect_2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:398
    #3 0x7ffff58a6677 in lws_http_client_connect_via_info2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:916
    #4 0x7ffff587efe6 in lws_header_table_attach /home/blah/repos/babelfish/libwebsockets/lib/roles/http/server/parsers.c:262
    #5 0x7ffff5887bea in rops_client_bind_h1 /home/blah/repos/babelfish/libwebsockets/lib/roles/h1/ops-h1.c:758
    #6 0x7ffff589ce65 in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:188
    #7 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
    #8 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
    #9 0x7ffff5877d52 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:759
    #10 0x7ffff58782fd in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:823
    #11 0x7ffff58ba761 in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:169
    #12 0x7ffff5878d0e in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:976
    #13 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #14 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #15 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #16 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #17 0x7ffff539aa9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

0x61400001fcb8 is located 120 bytes inside of 448-byte region [0x61400001fc40,0x61400001fe00)
freed by thread T1 here:
    #0 0x7ffff6d4f5c7 in __interceptor_free ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7ffff5854eed in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:68
    #2 0x7ffff5854f20 in lws_realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:77
    #3 0x7ffff586045f in __lws_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:201
    #4 0x7ffff5864435 in __lws_close_free_wsi_final /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:952
    #5 0x7ffff586416e in __lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:915
    #6 0x7ffff58644ff in lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:962
    #7 0x7ffff58788db in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:893
    #8 0x7ffff58ba761 in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:169
    #9 0x7ffff5878d0e in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:976
    #10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #14 0x7ffff539aa9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

previously allocated by thread T3 here:
    #0 0x7ffff6d4fa96 in __interceptor_realloc ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:93
    #1 0x7ffff5854ed8 in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:64
    #2 0x7ffff5854f4b in lws_zalloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:82
    #3 0x7ffff589bec1 in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:63
    #4 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
    #5 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
    #6 0x7ffff5877d52 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:759
    #7 0x7ffff58782fd in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:823
    #8 0x7ffff58ba761 in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:169
    #9 0x7ffff5878d0e in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:976
    #10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #14 0x7ffff539aa9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

Thread T3 created by T0 here:
    #0 0x7ffff6d1e84a in __interceptor_pthread_create ../../../../gcc-4.9.1/libsanitizer/asan/asan_interceptors.cc:183
    #1 0x7ffff539abb0 in __gthread_create /home/jshaw/dev/debesys2/ext/src/cpp/gcc-build/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:662
    #2 0x7ffff539abb0 in std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:142

Thread T1 created by T0 here:
    #0 0x7ffff6d1e84a in __interceptor_pthread_create ../../../../gcc-4.9.1/libsanitizer/asan/asan_interceptors.cc:183
    #1 0x7ffff539abb0 in __gthread_create /home/jshaw/dev/debesys2/ext/src/cpp/gcc-build/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:662
    #2 0x7ffff539abb0 in std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:142

SUMMARY: AddressSanitizer: heap-use-after-free /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:430 __lws_set_timeout
Shadow bytes around the buggy address:
  0x0c287fffbf40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c287fffbf50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c287fffbf60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c287fffbf70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c287fffbf80: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
=>0x0c287fffbf90: fd fd fd fd fd fd fd[fd]fd fd fd fd fd fd fd fd
  0x0c287fffbfa0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c287fffbfb0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c287fffbfc0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c287fffbfd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c287fffbfe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Contiguous container OOB:fc
  ASan internal:           fe
==18732==ABORTING
lws-team commented 6 years ago

Hum... so the client connection started on tsi 0 was bound to a different tsi?

IIUI insisting that client wsi started from one tsi must get bound to that tsi seems to be necessary.

There was a similar problem solved with h2 stream connections (logical child stream wsis on a single h2 network connection parent) must all be on the same tsi / pt.

It can probably figure out what tsi the client connect is called from automatically by comparing the pthread tid, and force it to be created on the same as the caller. I'll take a look at it later.

jameschicca commented 6 years ago

Can timers be scheduled for particular threads? If timers are always invoked from tsi 0 then outbound connections would only occur on tsi 0 then. Thanks for looking.

lws-team commented 6 years ago

I pushed a patch on master trying to do what I mentioned yesterday, trying to determine the tsi of the thread creating the client wsi and binding it to that transparently.

[vhost + protocol] timers don't have any natural binding to a particular thread; vhosts don't get serviced like wsi do bound to specific pts. What problem is coming, different pts are treading on each other servicing these? What about only allow tsi 0 to deal with them?

jameschicca commented 6 years ago

I am currently traveling. I pulled the latest tonight and will test it out tomorrow while traveling. Thanks

I might not be fully following but if tsi 0 only deals with timers, doesn't that mean that all outbound connections will be on tsi/thread 0 as well? For any application with multiple outbound connections, they would all be on the same thread. Is that correct? My thought process was that if the app layer could schedule timers for a particular thread then outbound connections could be spread out.

lws-team commented 6 years ago

For the sake of people finding this in google, when we say "timers" it usually is going to mean hrtimers, these are per-wsi. Their events naturally happen inside the pt / tsi thread context the wsi belongs to. So hrtimers have no special affinity and work on any service thread the wsi is bound to... the timer "belongs to the wsi" and acts normally with regard to that wsi's service thread.

Vhosts are a different animal, they are a logical grouping of wsi that might be being serviced by any service thread. A wsi gets bound to a specific service thread and stuff happens to it in that thread context. But a vhost isn't like that and consequently the vhost timers are not like that either, they don't have an association with a particular pt.

There's no way atm to target a specific pt with that callback either... as you point out the pts fight over it at the moment so we have to do something about it. Ideally user code shouldn't have to get involved in making choices about which tsi.

all outbound connections will be on tsi/thread 0 as well?

If you mean you are only creating client connections on the vhost / protocol timer callback, yes if we only let that come on tsi 0 it will end up like that.

However that's not the only use for vhost / protocol timers... I am not sure that specific problem ought to get solved by attempting to put a round-robin on the vhost / protocol timer pt choice... that won't be effective at sharing out the client wsi pt affinities. By the time the timer fires, the choice you made for which tsi to target it at may no longer be good for balancing client connections. I think because the vhosts are logically independent of the service threads, it's probably the wrong way to try to start teaching them to care about them. So solving that with only tsi 0 servicing them sounds like it should be part of the right way to me atm. At the least that will solve the thread safety issue there.

The actual problem - the only reason you care about that - is at the client connection tsi binding. With server incoming connections there is a step after the socket is accepted where it finds the most idle (by connection count) tsi and binds it to that. It feels like doing that step also with client connections is what's needed. The recent patch binding the client connection to the caller tsi is still needed to get it started.... while the caller is still in lws_client_connect_via_info() the client connection must remain bound to the caller's tsi so events happen to it sequentially. Until the dns lookup is done, actually the wsi is not really registered with a particular pt, it's done by inserting the socket fd into the pt fds tables here

https://libwebsockets.org/git/libwebsockets/tree/lib/roles/http/client/client-handshake.c#n354-383

it's just tricky to make sure that if we return to the caller at that point, there is some network event on the new tsi to trigger it to pick up the flow on the new tsi.

jameschicca commented 6 years ago

I pulled latest and got a similar crash but I did confirm (and included) your new logline.

=================================================================
==30899==ERROR: AddressSanitizer: heap-use-after-free on address 0x61400008fce0 at pc 0x7ffff587276a bp 0x7fffec7faca0 sp 0x7fffec7fac98
READ of size 8 at 0x61400008fce0 thread T66
    #0 0x7ffff5872769 in __lws_change_pollfd /home/blah/repos/babelfish/libwebsockets/lib/core/pollfd.c:384
    #1 0x7ffff5872b62 in lws_change_pollfd /home/blah/repos/babelfish/libwebsockets/lib/core/pollfd.c:415
    #2 0x7ffff58a45d0 in lws_client_connect_2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:444
    #3 0x7ffff58a60d4 in lws_http_client_connect_via_info2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:916
    #4 0x7ffff587e771 in lws_header_table_attach /home/blah/repos/babelfish/libwebsockets/lib/roles/http/server/parsers.c:262
    #5 0x7ffff588739d in rops_client_bind_h1 /home/blah/repos/babelfish/libwebsockets/lib/roles/h1/ops-h1.c:758
    #6 0x7ffff589c8ae in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:218
    #7 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
    #8 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
    #9 0x7ffff5877541 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:785
    #10 0x7ffff5877a42 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:844
    #11 0x7ffff58ba08e in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:143
    #12 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
    #13 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #14 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #15 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #16 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #17 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

0x61400008fce0 is located 160 bytes inside of 448-byte region [0x61400008fc40,0x61400008fe00)
freed by thread T62 here:
    #0 0x7ffff6d4f5c7 in __interceptor_free ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7ffff585409d in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:68
    #2 0x7ffff58540d0 in lws_realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:77
    #3 0x7ffff585f795 in __lws_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:201
    #4 0x7ffff586379d in __lws_close_free_wsi_final /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:952
    #5 0x7ffff58634d6 in __lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:915
    #6 0x7ffff5863871 in lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:962
    #7 0x7ffff5878020 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:914
    #8 0x7ffff58ba24d in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:168
    #9 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
    #10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #14 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

previously allocated by thread T66 here:
    #0 0x7ffff6d4fa96 in __interceptor_realloc ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:93
    #1 0x7ffff5854088 in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:64
    #2 0x7ffff58540fb in lws_zalloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:82
    #3 0x7ffff589b688 in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:66
    #4 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
    #5 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
    #6 0x7ffff5877541 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:785
    #7 0x7ffff5877a42 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:844
    #8 0x7ffff58ba08e in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:143
    #9 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
    #10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
    #11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
    #12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
    #13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
    #14 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84

Thread 66

Thread 62

Thread 66

Here are the basic logs merged

2018-07-30 20:23:08.011491 | 066 | level=16 msg=__insert_wsi_socket_into_fds: 0x61400008fc40: tsi=0, sock=15, pos-in-fds=1
2018-07-30 20:23:08.011547 | 066 | level=16 msg=_realloc: size 24: ftp
2018-07-30 20:23:08.011682 | 062 | level=16 msg=_lws_change_pollfd: wsi 0x61400008fc40: fd 15 events 1 -> 1
2018-07-30 20:23:08.011755 | 066 | level=16 msg=__lws_set_timeout: 0x61400008fc40: 20 secs
2018-07-30 20:23:08.011788 | 062 | level=16 msg=Session Socket 0x61400008fc40 (fd=15) dead
2018-07-30 20:23:08.011802 | 062 | level=16 msg=0x61400008fc40: Close and handled
2018-07-30 20:23:08.011824 | 066 | level=256 msg=nonblocking connect retry (errno = 115)
2018-07-30 20:23:08.011873 | 062 | level=8 msg=__lws_close_free_wsi: 0x61400008fc40: caller: close_and_handled
lws-team commented 5 years ago

I pushed a patch on master that binds the vhost timer service to the tsi that asked for it with the call to lws_timed_callback_vh_protocol(), along with some other necessary locking fixes.