doc: Effective Go's "safelyDo" does not demonstrate safe use of recover

tombergan commented 7 years ago

Effective Go gives an example of a server that captures and logs panics so that a single panicking goroutine does not take down the entire server. Copying the example below:

func server(workChan <-chan *Work) {
    for work := range workChan {
        go safelyDo(work)
    }
}

func safelyDo(work *Work) {
    defer func() {
        if err := recover(); err != nil {
            log.Println("work failed:", err)
        }
    }()
    do(work)
}

This design is controversial and I recommend that the example be either removed or updated with caveats. In general, there is no guarantee that shared data structures are in a usable and safe state after a panic occurs. It is arguably better to let panics crash the program. Empirically, the above design has led to real bugs in production systems. Specific examples of how bugs can arise:

A held mutex is not unlocked after the panic because the caller did not use defer m.Unlock(). This is a specific example of how do(work) may not be panic safe. Note that do(work) is especially vulnerable if it calls library code that is not panic safe.
The panic is the result of an invariant violation in a shared data structure. Program execution should not continue because the shared data structure is in an invalid state.
The panic is the result of a data race. Technically, program behavior is undefined from this point (especially if the data race involved an interface or slice).

Given these potential problems, I would argue that safelyDo is actually not safe. That said, the general design is not entirely invalid. It is possible to write a safe version of safelyDo, but to do so, one of the following properties must be guaranteed:

safelyDo must only recover from explicitly-thrown panics that it knows about. An example of this is shown in the Compile regexp function just below the safelyDo example.
The goroutines must not use any mutable shared state, i.e., safelyDo(a) touches completely disjoint state than safelyDo(b) for all a != b. This implies that safelyDo cannot invoke any libraries that use mutable shared state.
The implementation of do(work) must be completely panic safe, including all library code used by the implementation.

bradfitz commented 7 years ago

/cc @robpike @rsc @broady

bcmills commented 7 years ago

A held mutex is not unlocked after the panic because the caller did not use defer m.Unlock().

It's even more insidious than that. The program might be using channels for synchronization instead of mutexes, and while it's fairly easy to spot an m.Unlock() that ought to be deferred, that's much less obvious for many channel operations.

minux commented 7 years ago

But our http server is already locked into the safelyDo pattern, so there is precedence for the pattern.

bcmills commented 7 years ago

Precedence for a bug-prone pattern does not make it any less bug-prone.

minux commented 7 years ago

It might be bug prone, but I think it works reasonably well in practice.

Actually, tear down the process on panic is not absolutely safe either if you consider ongoing database updates (that's not in a transaction), inconsistent files (defer buffer.Flush), etc.

I can go on to argue that defer buffer.Flush is not safe if the application could panic in another goroutine. This really depends on our perspective.

rsc commented 7 years ago

I will revise the docs (slightly). There's no need to bikeshed now. There will be plenty of bikeshedding on the CL itself I am sure.

golang / go

doc: Effective Go's "safelyDo" does not demonstrate safe use of recover #19070