Thread race between qdr_delivery_anycast_propagate_CT and qdr_connection_process

kgiusti commented 1 year ago

60: WARNING: ThreadSanitizer: data race (pid=5272)
60:   Read of size 1 at 0x7b4c000a0c50 by thread T4:
60:     #0 qdr_connection_process /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:428 (skrouterd+0x4d439c)
60:     #1 AMQP_conn_wake_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:356 (skrouterd+0x51c767)
60:     #2 writable_handler.constprop.0 /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:387 (skrouterd+0x5218bf)
60:     #3 qd_conn_event_batch_complete /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:477 (skrouterd+0x48d4cb)
60:     #4 thread_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1159 (skrouterd+0x52c936)
60:     #5 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Previous write of size 1 at 0x7b4c000a0c50 by thread T2:
60:     #0 qdr_delivery_anycast_propagate_CT /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:851 (skrouterd+0x4e723e)
60:     #1 qdr_update_delivery_CT /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:741 (skrouterd+0x4e723e)
60:     #2 router_core_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core_thread.c:255 (skrouterd+0x4f7c80)
60:     #3 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Location is heap block of size 448 at 0x7b4c000a0ac0 allocated by thread T4:
60:     #0 posix_memalign <null> (libtsan.so.2+0x3f988)
60:     #1 qd_alloc /home/runner/work/skupper-router/skupper-router/skupper-router/src/alloc_pool.c:344 (skrouterd+0x470f9a)
60:     #2 new_qdr_delivery_t /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/delivery.c:24 (skrouterd+0x502ad3)
60:     #3 qdr_link_deliver_to /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/transfer.c:80 (skrouterd+0x502ad3)
60:     #4 AMQP_rx_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:904 (skrouterd+0x51db3a)
60:     #5 do_receive /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:219 (skrouterd+0x48ea85)
60:     #6 qd_container_handle_event /home/runner/work/skupper-router/skupper-router/skupper-router/src/container.c:730 (skrouterd+0x48ea85)
60:     #7 handle /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1105 (skrouterd+0x52aef3)
60:     #8 thread_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1143 (skrouterd+0x52c865)
60:     #9 _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:206 (skrouterd+0x4a88bd)
60: 
60:   Thread T4 'wrkr_1' (tid=5293, running) created by main thread at:
60:     #0 pthread_create <null> (libtsan.so.2+0x5f0e6)
60:     #1 sys_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:228 (skrouterd+0x4ac71a)
60:     #2 qd_server_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1514 (skrouterd+0x52d189)
60:     #3 main_process /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:111 (skrouterd+0x42fb00)
60:     #4 main /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:365 (skrouterd+0x429c8e)
60: 
60:   Thread T2 'core_thread' (tid=5284, running) created by main thread at:
60:     #0 pthread_create <null> (libtsan.so.2+0x5f0e6)
60:     #1 sys_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:228 (skrouterd+0x4ac71a)
60:     #2 qdr_core /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core.c:125 (skrouterd+0x4fc0cc)
60:     #3 qd_router_setup_late /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_node.c:2321 (skrouterd+0x52d778)
60:     #4 ffi_call_unix64 <null> (libffi.so.8+0x7a05)
60:     #5 main_process /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:101 (skrouterd+0x42faa2)
60:     #6 main /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:365 (skrouterd+0x429c8e)
60: 
60: SUMMARY: ThreadSanitizer: data race /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:428 in qdr_connection_process

kgiusti commented 1 year ago

@ted-ross @ganeshmurthy

I think this is relatively benign. The race is caused by the core thread setting the settled flag in a downstream delivery at the same moment the I/O thread is checking that delivery settled flag.

I think this is benign because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag. Since "settled" is a latch (never goes from set to unset) this guarantees the downstream I/O thread will (eventually) see the flag's new state.

This code is old as the hills, so I'm at a loss as to explain why TSAN has started to complain about it now. It may be that some recent change removed a (totally unrelated) mutex that cause TSAN to see a flush. Dunno.

I'm thinking we suppress this. What do you all think?

kgiusti commented 1 year ago

Update: I can repro this race locally by running the system_tests_http1_over_tcp.Http1OverTcpEdge2EdgeTest in a loop.

The fickle finger 'o git-bisect points to the culprit as this commit

That commit does re-arrange the codepath that exhibits this race, but I don't see it actually adding a race. I think that moving the code around a probably the reason TSAN has started complaining.

Still OK with suppression.

ganeshmurthy commented 1 year ago

because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag

Yes. Say qdr_delivery_anycast_propagate_CT does indeed update the settled=true right after the I/O thread calls delivery_update_handler with settled=false, the core thread will eventually wake the connection and the I/O thread will again loop thru the updated_deliveries and call the delivery_update_handler with the correct value for settled (true), so this should be fine.

I am ok with this suppression.

kgiusti commented 1 year ago

Arg - sorry I failed to notice the real reason this race is now showing up.

We changed the name of the function.

The old function name is used in the suppression file! We've already suppressed this error awhile back. My bad.

skupperproject / skupper-router

Thread race between qdr_delivery_anycast_propagate_CT and qdr_connection_process #1199