Closed jameschicca closed 5 years ago
Hum... so the client connection started on tsi 0 was bound to a different tsi?
IIUI insisting that client wsi started from one tsi must get bound to that tsi seems to be necessary.
There was a similar problem solved with h2 stream connections (logical child stream wsis on a single h2 network connection parent) must all be on the same tsi / pt.
It can probably figure out what tsi the client connect is called from automatically by comparing the pthread tid, and force it to be created on the same as the caller. I'll take a look at it later.
Can timers be scheduled for particular threads? If timers are always invoked from tsi 0 then outbound connections would only occur on tsi 0 then. Thanks for looking.
I pushed a patch on master trying to do what I mentioned yesterday, trying to determine the tsi of the thread creating the client wsi and binding it to that transparently.
[vhost + protocol] timers don't have any natural binding to a particular thread; vhosts don't get serviced like wsi do bound to specific pts. What problem is coming, different pts are treading on each other servicing these? What about only allow tsi 0 to deal with them?
I am currently traveling. I pulled the latest tonight and will test it out tomorrow while traveling. Thanks
I might not be fully following but if tsi 0 only deals with timers, doesn't that mean that all outbound connections will be on tsi/thread 0 as well? For any application with multiple outbound connections, they would all be on the same thread. Is that correct? My thought process was that if the app layer could schedule timers for a particular thread then outbound connections could be spread out.
For the sake of people finding this in google, when we say "timers" it usually is going to mean hrtimers, these are per-wsi. Their events naturally happen inside the pt / tsi thread context the wsi belongs to. So hrtimers have no special affinity and work on any service thread the wsi is bound to... the timer "belongs to the wsi" and acts normally with regard to that wsi's service thread.
Vhosts are a different animal, they are a logical grouping of wsi that might be being serviced by any service thread. A wsi gets bound to a specific service thread and stuff happens to it in that thread context. But a vhost isn't like that and consequently the vhost timers are not like that either, they don't have an association with a particular pt.
There's no way atm to target a specific pt with that callback either... as you point out the pts fight over it at the moment so we have to do something about it. Ideally user code shouldn't have to get involved in making choices about which tsi.
all outbound connections will be on tsi/thread 0 as well?
If you mean you are only creating client connections on the vhost / protocol timer callback, yes if we only let that come on tsi 0 it will end up like that.
However that's not the only use for vhost / protocol timers... I am not sure that specific problem ought to get solved by attempting to put a round-robin on the vhost / protocol timer pt choice... that won't be effective at sharing out the client wsi pt affinities. By the time the timer fires, the choice you made for which tsi to target it at may no longer be good for balancing client connections. I think because the vhosts are logically independent of the service threads, it's probably the wrong way to try to start teaching them to care about them. So solving that with only tsi 0 servicing them sounds like it should be part of the right way to me atm. At the least that will solve the thread safety issue there.
The actual problem - the only reason you care about that - is at the client connection tsi binding. With server incoming connections there is a step after the socket is accepted where it finds the most idle (by connection count) tsi and binds it to that. It feels like doing that step also with client connections is what's needed. The recent patch binding the client connection to the caller tsi is still needed to get it started.... while the caller is still in lws_client_connect_via_info()
the client connection must remain bound to the caller's tsi so events happen to it sequentially. Until the dns lookup is done, actually the wsi is not really registered with a particular pt, it's done by inserting the socket fd into the pt fds tables here
https://libwebsockets.org/git/libwebsockets/tree/lib/roles/http/client/client-handshake.c#n354-383
it's just tricky to make sure that if we return to the caller at that point, there is some network event on the new tsi to trigger it to pick up the flow on the new tsi.
I pulled latest and got a similar crash but I did confirm (and included) your new logline.
=================================================================
==30899==ERROR: AddressSanitizer: heap-use-after-free on address 0x61400008fce0 at pc 0x7ffff587276a bp 0x7fffec7faca0 sp 0x7fffec7fac98
READ of size 8 at 0x61400008fce0 thread T66
#0 0x7ffff5872769 in __lws_change_pollfd /home/blah/repos/babelfish/libwebsockets/lib/core/pollfd.c:384
#1 0x7ffff5872b62 in lws_change_pollfd /home/blah/repos/babelfish/libwebsockets/lib/core/pollfd.c:415
#2 0x7ffff58a45d0 in lws_client_connect_2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:444
#3 0x7ffff58a60d4 in lws_http_client_connect_via_info2 /home/blah/repos/babelfish/libwebsockets/lib/roles/http/client/client-handshake.c:916
#4 0x7ffff587e771 in lws_header_table_attach /home/blah/repos/babelfish/libwebsockets/lib/roles/http/server/parsers.c:262
#5 0x7ffff588739d in rops_client_bind_h1 /home/blah/repos/babelfish/libwebsockets/lib/roles/h1/ops-h1.c:758
#6 0x7ffff589c8ae in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:218
#7 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
#8 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
#9 0x7ffff5877541 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:785
#10 0x7ffff5877a42 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:844
#11 0x7ffff58ba08e in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:143
#12 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
#13 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
#14 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
#15 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
#16 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
#17 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84
0x61400008fce0 is located 160 bytes inside of 448-byte region [0x61400008fc40,0x61400008fe00)
freed by thread T62 here:
#0 0x7ffff6d4f5c7 in __interceptor_free ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:62
#1 0x7ffff585409d in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:68
#2 0x7ffff58540d0 in lws_realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:77
#3 0x7ffff585f795 in __lws_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:201
#4 0x7ffff586379d in __lws_close_free_wsi_final /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:952
#5 0x7ffff58634d6 in __lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:915
#6 0x7ffff5863871 in lws_close_free_wsi /home/blah/repos/babelfish/libwebsockets/lib/core/libwebsockets.c:962
#7 0x7ffff5878020 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:914
#8 0x7ffff58ba24d in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:168
#9 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
#10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
#11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
#12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
#13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
#14 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84
previously allocated by thread T66 here:
#0 0x7ffff6d4fa96 in __interceptor_realloc ../../../../gcc-4.9.1/libsanitizer/asan/asan_malloc_linux.cc:93
#1 0x7ffff5854088 in _realloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:64
#2 0x7ffff58540fb in lws_zalloc /home/blah/repos/babelfish/libwebsockets/lib/core/alloc.c:82
#3 0x7ffff589b688 in lws_client_connect_via_info /home/blah/repos/babelfish/libwebsockets/lib/core/connect.c:66
#4 0x414a21 in InternalConnect /home/blah/repos/debesys/WebSocket.h:1315
#5 0x414a21 in ServiceWebSocketClient /home/blah/repos/debesys/WebSocket.h:1414
#6 0x7ffff5877541 in lws_service_periodic_checks /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:785
#7 0x7ffff5877a42 in lws_service_fd_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:844
#8 0x7ffff58ba08e in _lws_plat_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/plat/unix/unix-service.c:143
#9 0x7ffff587845d in lws_service_tsi /home/blah/repos/babelfish/libwebsockets/lib/core/service.c:997
#10 0x40e7fd in operator() /home/blah/repos/debesys/WebSocket.h:151
#11 0x40e7fd in _M_invoke<> /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1700
#12 0x40e7fd in operator() /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/functional:1688
#13 0x40e7fd in _M_run /home/blah/repos/debesys/ext/linux/x86-64/release/opt/gcc-4.9.1/include/c++/4.9.1/thread:115
#14 0x7ffff5399a9f in execute_native_thread_routine ../../../../../gcc-4.9.1/libstdc++-v3/src/c++11/thread.cc:84
lws_client_connect_via_info
is invoked
level=8 msg=lws_client_connect_via_info: client binds to caller tsi 0
...
level=256 msg=lws_client_connect_2: 0x61400008fc40
level=16 msg=_realloc: size 10: strdup
level=8 msg=lws_client_connect_2: 0x61400008fc40: address 127.0.0.1
level=16 msg=lwsi_set_state(0x61400008fc40, 0x10000201)
level=16 msg=__insert_wsi_socket_into_fds: 0x61400008fc40: tsi=0, sock=15, pos-in-fds=1
...
level=256 msg=nonblocking connect retry (errno = 115)
< Context Switch >
lws_service_tsi
invoked__lws_free_wsi
level=16 msg=Session Socket 0x61400008fc40 (fd=15) dead
level=16 msg=0x61400008fc40: Close and handled
level=8 msg=__lws_close_free_wsi: 0x61400008fc40: caller: close_and_handled
< Context Switch >
__lws_change_pollfd
invoked with bad dataHere are the basic logs merged
2018-07-30 20:23:08.011491 | 066 | level=16 msg=__insert_wsi_socket_into_fds: 0x61400008fc40: tsi=0, sock=15, pos-in-fds=1
2018-07-30 20:23:08.011547 | 066 | level=16 msg=_realloc: size 24: ftp
2018-07-30 20:23:08.011682 | 062 | level=16 msg=_lws_change_pollfd: wsi 0x61400008fc40: fd 15 events 1 -> 1
2018-07-30 20:23:08.011755 | 066 | level=16 msg=__lws_set_timeout: 0x61400008fc40: 20 secs
2018-07-30 20:23:08.011788 | 062 | level=16 msg=Session Socket 0x61400008fc40 (fd=15) dead
2018-07-30 20:23:08.011802 | 062 | level=16 msg=0x61400008fc40: Close and handled
2018-07-30 20:23:08.011824 | 066 | level=256 msg=nonblocking connect retry (errno = 115)
2018-07-30 20:23:08.011873 | 062 | level=8 msg=__lws_close_free_wsi: 0x61400008fc40: caller: close_and_handled
I pushed a patch on master that binds the vhost timer service to the tsi that asked for it with the call to lws_timed_callback_vh_protocol()
, along with some other necessary locking fixes.
While looking into issue https://github.com/warmcat/libwebsockets/issues/1349 for more details and attempting to reproduce I encountered this issue which might be related. It appears that there is a race condition with SMP and a failed connection startup. We are invoking
lws_client_connect_via_info
on the lws thread viaLWS_CALLBACK_USER
withlws_timed_callback_vh_protocol
from the app thread. It looks like during a failed startup, the wsi thread can come around and cleanup while startup is still going.On LWS thread 1
On LWN thread 2
On LWS thread 1
Here was the ASAN report. Note that allocation and reference of deleted pointer is from same thread (T3). T1 got a time slice and saw the socket as closed and performed cleanup.