Fix deadlock caused by race while signal receivers are stopping

JacobOaks commented 3 months ago

A user reported a possible deadlock within the signal receivers (#1219).

This happens by:

(*signalReceivers).Stop() is called, by Shutdowner for instance.
(*signalReceivers).Stop() acquires the lock.
Separately, an OS signal is sent to the program.
There is a chance that relayer() is still running at this point if (*signalReceivers).Stop() has not yet sent along the shutdown channel.
The relayer attempts to broadcast the signal received via the signals channel.
Broadcast() blocks on trying to acquire the lock.
(*signalReceivers).Stop() blocks on waiting for the relayer() to finish by blocking on the finished channel.
Deadlock.

Luckily, this is not a hard deadlock, as Stop will return if the context times out, but we should still fix it.

This PR fixes this deadlock. The idea behind how it does it is based on the observation that the broadcasting logic does not necessarily seem to need to share a mutex with the rest of signalReceivers. Specifically, it seems like we can separate protection around the registered wait and done channels, last, and the rest of the fields, since the references to those fields are easily isolated. To avoid overcomplicating signalReceivers with multiple locks for different uses, this PR creates a separate broadcaster type in charge of keeping track of and broadcasting to Wait and Done channels. Most of the implementation of broadcaster is simply moved over from signalReceivers.

Having a separate broadcaster type seems actually quite natural, so I opted for this to fix the deadlock. Absolutely open to feedback or taking other routes if folks have thoughts.

Since broadcasting is protected separately, this deadlock no longer happens since relayer() is free to finish its broadcast and then exit.

In addition to running the example provided in the original post to verify, I added a test and ran it before/after this change.

Before:

$ go test -v -count=10 -run "TestSignal/stop_deadlock" .
=== RUN   TestSignal/stop_deadlock
    signal_test.go:141:
                Error Trace:
/home/user/go/src/github.com/uber-go/fx/signal_test.go:141
                Error:          Received unexpected error:
                                context deadline exceeded
                Test:           TestSignal/stop_deadlock

(the failure appeared roughly 1/3 of the time)

After:

$ go test -v -count=100 -run "TestSignal/stop_deadlock" .
--- PASS: TestSignal (0.00s)
    --- PASS: TestSignal/stop_deadlock (0.00s)

(no failures appeared)

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.42%. Comparing base (74d9643) to head (d2c9e53).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1220 +/- ## ======================================= Coverage 98.41% 98.42% ======================================= Files 34 35 +1 Lines 2909 2918 +9 ======================================= + Hits 2863 2872 +9 Misses 38 38 Partials 8 8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

lverma14 commented 3 months ago

On behalf of group code review by @sywhang & @r-hang

uber-go / fx

Fix deadlock caused by race while signal receivers are stopping #1220

Codecov Report