Open tombergan opened 7 years ago
/cc @robpike @rsc @broady
A held mutex is not unlocked after the panic because the caller did not use
defer m.Unlock()
.
It's even more insidious than that. The program might be using channels for synchronization instead of mutexes, and while it's fairly easy to spot an m.Unlock()
that ought to be deferred, that's much less obvious for many channel operations.
But our http server is already locked into the safelyDo pattern, so there is precedence for the pattern.
Precedence for a bug-prone pattern does not make it any less bug-prone.
It might be bug prone, but I think it works reasonably well in practice.
Actually, tear down the process on panic is not absolutely safe either if you consider ongoing database updates (that's not in a transaction), inconsistent files (defer buffer.Flush), etc.
I can go on to argue that defer buffer.Flush is not safe if the application could panic in another goroutine. This really depends on our perspective.
I will revise the docs (slightly). There's no need to bikeshed now. There will be plenty of bikeshedding on the CL itself I am sure.
Effective Go gives an example of a server that captures and logs panics so that a single panicking goroutine does not take down the entire server. Copying the example below:
This design is controversial and I recommend that the example be either removed or updated with caveats. In general, there is no guarantee that shared data structures are in a usable and safe state after a panic occurs. It is arguably better to let panics crash the program. Empirically, the above design has led to real bugs in production systems. Specific examples of how bugs can arise:
A held mutex is not unlocked after the panic because the caller did not use
defer m.Unlock()
. This is a specific example of howdo(work)
may not be panic safe. Note thatdo(work)
is especially vulnerable if it calls library code that is not panic safe.The panic is the result of an invariant violation in a shared data structure. Program execution should not continue because the shared data structure is in an invalid state.
The panic is the result of a data race. Technically, program behavior is undefined from this point (especially if the data race involved an interface or slice).
Given these potential problems, I would argue that
safelyDo
is actually not safe. That said, the general design is not entirely invalid. It is possible to write a safe version ofsafelyDo
, but to do so, one of the following properties must be guaranteed:safelyDo
must only recover from explicitly-thrown panics that it knows about. An example of this is shown in theCompile
regexp function just below thesafelyDo
example.The goroutines must not use any mutable shared state, i.e.,
safelyDo(a)
touches completely disjoint state thansafelyDo(b)
for alla != b
. This implies thatsafelyDo
cannot invoke any libraries that use mutable shared state.The implementation of
do(work)
must be completely panic safe, including all library code used by the implementation.