Closed kgiusti closed 1 year ago
@ted-ross @ganeshmurthy
I think this is relatively benign. The race is caused by the core thread setting the settled flag in a downstream delivery at the same moment the I/O thread is checking that delivery settled flag.
I think this is benign because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag. Since "settled" is a latch (never goes from set to unset) this guarantees the downstream I/O thread will (eventually) see the flag's new state.
This code is old as the hills, so I'm at a loss as to explain why TSAN has started to complain about it now. It may be that some recent change removed a (totally unrelated) mutex that cause TSAN to see a flush. Dunno.
I'm thinking we suppress this. What do you all think?
Update: I can repro this race locally by running the system_tests_http1_over_tcp.Http1OverTcpEdge2EdgeTest in a loop.
The fickle finger 'o git-bisect points to the culprit as this commit
That commit does re-arrange the codepath that exhibits this race, but I don't see it actually adding a race. I think that moving the code around a probably the reason TSAN has started complaining.
Still OK with suppression.
because the core will send a delivery update event to the I/O thread which will cause it to re-read the flag
Yes. Say qdr_delivery_anycast_propagate_CT
does indeed update the settled=true
right after the I/O thread calls delivery_update_handler
with settled=false
, the core thread will eventually wake the connection and the I/O thread will again loop thru the updated_deliveries
and call the delivery_update_handler
with the correct value for settled (true), so this should be fine.
I am ok with this suppression.
Arg - sorry I failed to notice the real reason this race is now showing up.
We changed the name of the function.
The old function name is used in the suppression file! We've already suppressed this error awhile back. My bad.