Closed JacobOaks closed 3 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 98.42%. Comparing base (
74d9643
) to head (d2c9e53
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
On behalf of group code review by @sywhang & @r-hang
A user reported a possible deadlock within the signal receivers (#1219).
This happens by:
(*signalReceivers).Stop()
is called, by Shutdowner for instance.(*signalReceivers).Stop()
acquires the lock.relayer()
is still running at this point if(*signalReceivers).Stop()
has not yet sent along theshutdown
channel.signals
channel.Broadcast()
blocks on trying to acquire the lock.(*signalReceivers).Stop()
blocks on waiting for therelayer()
to finish by blocking on thefinished
channel.Luckily, this is not a hard deadlock, as
Stop
will return if the context times out, but we should still fix it.This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of
signalReceivers
. Specifically, it seems like we can separate protection around the registeredwait
anddone
channels,last
, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicatingsignalReceivers
with multiple locks for different uses, this PR creates a separatebroadcaster
type in charge of keeping track of and broadcasting toWait
andDone
channels. Most of the implementation ofbroadcaster
is simply moved over fromsignalReceivers
.Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts.
Since broadcasting is protected separately, this deadlock no longer happens since
relayer()
is free to finish its broadcast and then exit.In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change.
Before:
(the failure appeared roughly 1/3 of the time)
After:
(no failures appeared)