golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.99k stars 17.67k forks source link

runtime: panic + recover can cancel a call to Goexit #29226

Closed bcmills closed 4 years ago

bcmills commented 5 years ago

The documentation for runtime.Goexit says (emphasis mine):

Goexit terminates the goroutine that calls it. No other goroutine is affected. Goexit runs all deferred calls before terminating the goroutine. Because Goexit is not a panic, any recover calls in those deferred functions will return nil.

However, this is empirically not the case if one of the deferred functions itself triggers a panic: that panic can be recovered, and that recover cancels the termination of the goroutine.

See https://play.golang.org/p/7VrxPDByUNT for a minimal illustration.

Found while investigating #29207.

CC @aclements @randall77

randall77 commented 5 years ago

Panics within panics hurt my brain.

I don't think this would be too hard to fix. We just need another bit in the goroutine to record the "in Goexit" state.

There's a comment in the runtime (from 2014):

        // If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
        // take defer off list. The earlier panic or Goexit will not continue running.

So it looks like the current behavior is at least somewhat intentional. @rsc, who wrote that comment.

bcmills commented 5 years ago

34481 proposes to substantially rework the panic implementation. We should see if we can incidentally address this issue as part of that effort.

danscales commented 5 years ago

This example is actually a bit of a special case, because the Goexit call and the recover are in the same stack frame/function:

func maybeGoexit() {
    defer func() {
        fmt.Println(recover())
    }()
    defer panic("cancelled Goexit!")
    runtime.Goexit()
}

In this case, we are running somewhat into the "abort" rule (which is currently only described in the implementation, not in any spec). If the processing of defers by a recursive panic gets to the frame where an outer panic or Goexit happened, then that outer panic/Goexit is aborted, because there is no mechanism to resume it with a recover (since a recover always finishes the defers of the current frame and then resumes execution with the next outer frame). The abort rule is exactly related to the comment by @rsc above. If we wanted to allow the Goexit to continue, I think we would have add a bunch of mechanism (maybe some stack smashing) just to deal with this case. (The abort case is also mentioned in the Requirements section for the just-added open-coded defer proposal: https://github.com/golang/proposal/blob/master/design/34481-opencoded-defers.md#requirements )

But note that the more likely/useful case (where a panic and recover happens completely within a deferred function during a Goexit() works fine. The recover happens, and the Goexit() continues.

https://play.golang.org/p/kcQBNOAUeSa

So, for instance, it would work fine if a panic-recover happened during a Printf in a defer function (which can happen with nil pointers, etc. of the printed args).

So, I'll think about this a bit more, but I would lean toward not fixing this (because it would require a bunch of added mechanism just for this one case).

zigo101 commented 5 years ago

I doubt the current implementation of Goexit is to create a special panic. In @bcmills's example, the panic is recovered, but in @danscales' example, the panic is not recovered.

If this is true, I lean toward the panic created in Goexit should be always unrecoverable.

zigo101 commented 5 years ago

In fact, the key here is, if Goexit creates a panic, in @bcmills's example, the panic created in Goexit is suppressed by the new panic created in defer panic("cancelled Goexit!"), which is consequently recovered. (If the suppression doesn't happen, the the recover will not happen too, for the panic created in Goexit is unrecoverable and will not cause crashing.) So, if the issue needs to be fixed, then we should avoid letting the new panic created in defer panic("cancelled Goexit!") suppress the special panic created in Goexit.

bcmills commented 5 years ago

note that the more likely/useful case (where a panic and recover happens completely within a deferred function during a Goexit() works fine

I agree that that's more likely, but there are realistic scenarios that could trigger this bug. For example, consider https://play.golang.org/p/vSL3LWEGpm2.

bcmills commented 5 years ago

This example is actually a bit of a special case, because the Goexit call and the recover are in the same stack frame/function

Note that the issue can be reproduced with the Goexit, recover, and panic all in separate functions (and presumably frames): https://play.golang.org/p/M93Kxa4aBhX

danscales commented 5 years ago

I agree that that's more likely, but there are realistic scenarios that could trigger this bug. For example, consider https://play.golang.org/p/vSL3LWEGpm2.

Yes, you are demonstrating the exact panic abort semantics that I mentioned. Just to describe it again, an outer panic/Goexit is aborted if the inner panic reaches & processes any defers in the same frame or in a higher (outer) frame from the frame where the panic/Goexit happened. Once the outer panic/Goexit is aborted, it will no longer apply even if there is a recover() that recovers the inner panic.

I think you want to stop these unspecified abort semantics. Instead, you want to change the implementation so that if an inner panic is recovered in a frame at the same level or even further down in the stack from where the output panic/Goexit happened, you want the defer processing of the frames to continue under the auspices of the outer panic/Goexit.

Note that if the recover happens in a frame which is higher in the stack from the output panic/Goexit, we still want the current behavior, which is that the defers in the current frame are finished, and then "normal" execution continues in the next outer frame, which is some function called by a defer caused by the outer panic. So, the outer panic will continue as expected, but it will get to complete the running of the defer function that caused the panic.

Since the behavior of recursive panics/Goexits (including the abort semantics) was never specified in the first place, I think we have leeway on whether/how we want to fix this.

I've tried to get some feedback on improving the description of panic/recover/recursive panics in the Go language spec, but don't think we have consensus yet about whether or what to improve:

https://go-review.googlesource.com/c/go/+/189377

Feel free to add comments....

danscales commented 5 years ago

Instead, you want to change the implementation so that if an inner panic is recovered in a frame at the same level or even further down in the stack from where the output panic/Goexit happened, you want the defer processing of the frames to continue under the auspices of the outer panic/Goexit.

OK, and I think the hard part about doing this is that we presumably want to clean up the stack of the inner panic and somehow make it look like we are back to the outer Goexit/panic processing. That seems really hard to do without a lot of stack gymnastics (possibly stack smashing) and/or extra stubs to allow us to get control back to the runtime after cleaning up the stack. If we don't require that the stack is cleaned up, then getting rid of the abort behavior seems more doable.

bcmills commented 5 years ago

I think you want to stop these unspecified abort semantics. Instead, […] you want the defer processing of the frames to continue under the auspices of the outer panic/Goexit.

Maybe? Consider https://play.golang.org/p/MyBCSSorYFl: the outer function defers a recover and a cleanup function and then panics, and the cleanup function also panics independently

Intuitively, what I would expect is for the first defer to catch the first panic (“panic!”), and for the second panic (“whoops!”) to continue propagating outward, since that panic occurred after the execution of the function that performed the initial defer.

I could also reasonably see dropping the second panic entirely, since the fact that it occurred is likely a side-effect of some invariant violated as a result of the first panic: if the first panic is fixed, perhaps the second one won't occur at all.

Today, the first recover instead receives the second panic, and the details of the first panic are completely lost.

(I would call the second panic the “inner” one, but maybe I'm using “inner” and “outer” a bit differently than you are.)

zigo101 commented 5 years ago

If panics are maintained in a stack and only the top one can be recovered (this example shows the possibility is high), then the panics created by Goexit calls shouldn't be pushed into the panic stack. Instead, each goroutine should maintain a separated field to record the last (or the first) panic created by Goexit calls.

BTW, the following program proves that if a panic created by Goexit is at the top of the panic stack, then all other panics will also not get recovered.

package main

import "fmt"
import "runtime"

func main() {
    go func() {
        defer func() {
            fmt.Println("bb", recover()) // <nil>
        }()
        defer func() {
            defer func() {
                fmt.Println("aa", recover()) // <nil>
            }()
            defer runtime.Goexit()
            panic(2)
        }()
        panic(1)
    }()

    select{}
}
zigo101 commented 5 years ago

@ianlancetaylor It looks gccgo and gc disagree on the last example.

ianlancetaylor commented 5 years ago

@go101 Thanks for noticing that difference. I think we should work out this issue, and then we can make sure that the implementations do the same thing.

zigo101 commented 5 years ago

It looks the difference is, in gccgo, the panic created in Goexit will not shadow the newest panic, whereas gc will. In gccgo, the panic created in Goexit will shadow the panics before the newest panic.

ianlancetaylor commented 5 years ago

Yes, but let's first figure out what is correct, and then implement that.

zigo101 commented 5 years ago

Personally, I think panicking and exiting should be two different properties of gorotines.

gopherbot commented 5 years ago

Change https://golang.org/cl/200081 mentions this issue: runtime: ensure that Goexit cannot be aborted by a recursive panic/recover

danscales commented 5 years ago

I figured out how to make sure that Goexits are not aborted by recovers that happen in defers below them... It didn't require any stack smashing or extra stubs -- just saving the pc/sp when making calls to deferred functions from the Goexit defer processing loop. See the mentioned change for the fix.

We discussed the abort behavior with respect to panic and Goexit, and are leaning toward only changing the Goexit behavior (which is the mentioned changed). As summarized well by Robert at https://go-review.googlesource.com/c/go/+/189377/7/doc/go_spec.html#6169 [paraphrased here]

From a programmer's point of view, if a defer contains a recover() call, I expect it to catch any pending panic, no matter where they are coming from. Otherwise it's impossible to write code that protects against (expected and unexpected) panics leaving a function/API entry point. So that would argue against "only recover the panic initiated in its own function".

And the abort (replace) behavior for panics exactly coalesces panics down so that we make sure that a recover in a function always stops all panics within that function or its callees.

zigo101 commented 5 years ago

There is another related question: should a Goexit call save the a panicking goroutine from crashing? The current gc and gccgo both do this. Personally, I think it is not very reasonable.

package main

import "runtime"
import "time"

func main() {
    go func() {
        defer runtime.Goexit() // [edit]: add the "defer"
        panic(1) // will not crash the program
    }()

    for {time.Sleep(100000)}
}
bcmills commented 5 years ago

@go101, in that snippet the runtime.Goexit() exits the goroutine immediately, so the panic line is not even reached.

bcmills commented 5 years ago

But if you defer the call to runtime.Goexit(), that does suppress the panic (https://play.golang.org/p/JeXz2KhA5f_h), and I agree that it should not.

bcmills commented 5 years ago

A more realistic example illustrating that problem (https://play.golang.org/p/Gx4zGz5L79i):

func TestSkipAfterPanic(t *testing.T) {
    defer func() {
        if !t.Failed() {
            t.Skip("huh?")
        }
    }()

    panic("serious bug that should fail this test!")
}
zigo101 commented 5 years ago

Yes, I missed the defer before Goexit. ;D

It looks the docs says T.Skip() calls Goexit(). It also says if a test has already failed, Skip doesn't change the failed state. So panicking is not viewed as testing failed?

bcmills commented 5 years ago

The test doesn't fail per se until the panic propagates out of the Test function, so an active panic is indeed not a failure at that point.

danscales commented 5 years ago

Feel free to file a separate bug that Goexit can cancel a Panic. I'm not sure that fix is very high priority, since it seems very unusual to put a Goexit (or something that would do a Goexit) in a defer statement. Also, I'm not sure what a reasonable change in semantics would be. Should Goexit() become a no-op if you are in a panicking sequence? I think that would be surprising to most programmers. Or would you do something more dramatic, like aborting the entire defer call that contains the Goexit (and is being run by the panicking sequence)?

bcmills commented 5 years ago

@danscales, probably the Goexit should allow the panic to continue, but switch to an orderly Goexit if all of the active panics are eventually recovered.

danscales commented 5 years ago

Yes, but I think a Goexit becoming a no-op in the panic case may be quite surprising to programmers and hence lead to other weird failures in the code immediately following the Goexit.

bcmills commented 5 years ago

In practice the ~only real user of runtime.Goexit is the testing package, and for that case it seems decidedly less weird for an unrecovered panic to cause a test failure than for t.Skip to cancel an unrecovered panic.

But you're right that I would need to update the double defer sandwich to account for the possibility of a Goexit resuming after a successful recover. Perhaps it will need to become the “triple defer with cheese”. 😃

Another option would be to make runtime.Goexit itself panic if the goroutine is already panicking. That would at least make the t.Skip example do something more reasonable (specifically, fail the test if it tries to Skip during an unrecovered panic).

zigo101 commented 5 years ago

Maybe two panic stacks are needed. One for normal panics, the other for Goexit panics. Or each non-exited function call may associate with at most two unrecovered panics, one is normal panic, the other is Goexit panic. Goexit panics are not recoverable. Either of two panics existing will make the associated function call enter its terminating execution phase,

zigo101 commented 5 years ago

I just modified my explanation on the panic/recover mechanism to include Goexit signals.

One reason I think Goexit signals should not shadow normal panics is that, if we put a defer runtime.Goexit() call at the beginning of a goroutine entry function, then the defrerred call acts as an ultimate recover call recover operation.

zigo101 commented 5 years ago

By describing it in non-optimized code:

type _panic struct {
    // Both of the two can be true.
    // causedByPanic may be changed from true
    // to false, but causedByExit may not.
    causedByPanic  bool
    causedByGoExit  bool

    headReason *_panicReason
    tailReason *_panicReason
}

type _panicReason struct {
    value interface{}
    prev  *_panicReason
}

// When a function call exits, its associating `_panic`
// will be merged with the `_panic` associating with its caller.
...
    if exitedCall.panic.causedByPanic {
        caller.panic.causedByPanic = true
        if exitedCall.panic.tailReason != nil {
            exitedCall.panic.tailReason = caller.panic.headReason
        }
        caller.panic.headReason = exitedCall.panic.headReason
        if caller.panic.tailReason == nil {
            caller.panic.tailReason = caller.panic.headReason
        }
    }
    caller.panic.causedByGoExit = caller.panic.causedByGoExit || exitedCall.panic.causedByGoExit
    if !caller.isExiting && (caller.panic.causedByGoExit || caller.panic.causedByPanic) {
        caller.startExiting()
    }
... 

[edit]: to optimize, the panic reason list can be move the the _goroutine struct as a stack (just as the current implementation), and the two bools can be replaced with bits.

zigo101 commented 5 years ago

@griesemer @ianlancetaylor Is it possible that gc and gccgo can make agreement on the panic and Goexist mechenism in Go 1.14? Specifically,

  1. the result of which implementation of this example is correct?
  2. should Goexit shadow normal panics?

In my opinion, the earlier the points are clarified, the better for the Go community.

ianlancetaylor commented 5 years ago

I don't understand precisely what your questions are, but I agree that these questions need to be resolved in the language spec.

That said, it does not seem urgent to me. These issues have been confused for many years and nobody has cared. If we sort them out for 1.14, great. If not, OK.

zigo101 commented 5 years ago

I wrote a book and one chapter in it explains the panic/recover mechanism. And I plan to make it cover runtime.Goexit calls. So I very care about this. I really don't want to spread wrong explainations into the Go community.

I think the implementation is not urgent to be corrected (if the current implementation is really not perfect), but the mechanism should be settled down as earlier as possible, if there are no special difficulties to prevent it being done.

BTW, @ianlancetaylor , thanks for your explanation in another thread. I will write down the confirmed part in my book. I asked this question is because I need to finish the chapter in my book. If this is impossible in Go 1.14, then I will mention the Goexit related part is still not formalized yet., otherwise, I will wait unitl it is formalized.

danscales commented 5 years ago

@go101 I think we have mostly settled that the behavior of panic and recover will stay as it is right now. My current proposed changes to the Go language spec are here (don't know if you've seen the latest changeset):

https://go-review.googlesource.com/c/go/+/189377

I/we are not trying to specify every possibility for panics/recovers in the spec, but just want to specify better when recovers actually apply (only in a defer directly called by a panicking sequence) and then describe the the typical case of recursive panics and also the behavior of when one panic can replace another panic. See also my comment on Oct 9th above on why the panic replacement behavior makes sense.

Goexit is a library routine, so it will not be specified in the language spec. However, as I mentioned above, we are planning to fix the current bug and never allow a panic/recover to cancel a Goexit. We have to discuss more about further interaction between panic and Goexit (as your example in your Sept 26 comment and examples from Bryan Mills illustrate). The fix for the current bug may make it into Go 1.14, but not definitely. Since we still have to discuss things, I'm fairly sure that no further changes in Goexit/panic/recover interaction will make it into Go 1.14.

zigo101 commented 5 years ago

@danscales Thanks for the clarification.

we are planning to fix the current bug and never allow a panic/recover to cancel a Goexit.

What about the inverse? Will Goexit calls cancel unrecovered panics?

ianlancetaylor commented 5 years ago

I am guessing that you are asking whether defer Goexit() should recover a panic that is currently being thrown. I don't think it should. Goexit is not recover.

zigo101 commented 5 years ago

Yes, that is what I asked. Thanks for the clarification.

zigo101 commented 5 years ago

Should this following program print should be unreachable or not. I tried tip, it prints it.

edit: sorry, the following program is a wrong example. Please see this comment for reason. ;D

package main

import "fmt"
import "runtime"

func f() {
    defer func() {
        recover()
    }()
    panic("will cancel Goexit but should not")
    runtime.Goexit()
}

func main() {
    c := make(chan struct{})
    go func() {
        defer close(c)
        f()
        fmt.Println("should be unreachable")
    }()
    <-c
}
zigo101 commented 5 years ago

And should the following program crash? I tried tip. It doesn't, though I feel it should crash by @ianlancetaylor's clarification above.

package main

import "runtime"

func main() {
    c := make(chan struct{})
    go func() {
        defer close(c)
        // The Goexit signal shadows the
        // "bye" panic, but it should not.
        defer runtime.Goexit()
        panic("bye")
    }()
    <-c
}
bcmills commented 5 years ago

Should this following program print should be unreachable or not.

I agree that it should not, and I can reproduce the same behavior you observe using gotip.

bcmills commented 5 years ago

And should the following program crash?

That one is less clear to me, although I agree that @ianlancetaylor's comment suggests that it should crash.

ianlancetaylor commented 5 years ago

I agree with @bcmills.

danscales commented 5 years ago

Should this following program print should be unreachable or not. I tried tip, it prints it.

package main

import "fmt"
import "runtime"

func f() {
  defer func() {
      recover()
  }()
  panic("will cancel Goexit but should not")
  runtime.Goexit()
}

func main() {
  c := make(chan struct{})
  go func() {
      defer close(c)
      f()
      fmt.Println("should be unreachable")
  }()
  <-c
}

This is exactly performing to spec and is unrelated to the current bug. https://golang.org/ref/spec#Handling_panics and https://go-review.googlesource.com/c/go/+/189377

The panic() call starts a panic sequence. By spec, the function that called panic will never continue running normally (hence you will never reach the Goexit). If a recover occurs in a defer directly called by the panicking sequence, then the panic will be recovered, and (after finishing any remaining defers in that frame), execution of the goroutine will continue in the caller of that frame (which in this case is f).

danscales commented 5 years ago
package main

import "runtime"

func main() {
  c := make(chan struct{})
  go func() {
      defer close(c)
      // The Goexit signal shadows the
      // "bye" panic, but it should not.
      defer runtime.Goexit()
      panic("bye")
  }()
  <-c
}

This is the issue that we mentioned above, which is that a Goexit can cancel a panic. I just filed a separate bug for this (https://github.com/golang/go/issues/35378), and we can discuss there. As I mentioned, any fix so that a Goexit does not cancel a panic will not make it into 1.14.

bcmills commented 5 years ago

Ah, yeah. I somehow read the panic("will cancel Goexit […]") as defer panic([…]) instead. 😞

zigo101 commented 5 years ago

Ah, sorry, I also acted as it is a deferred call. ;D

zigo101 commented 4 years ago

It looks this problem has been fixed in Go SDK 1.14. Should it be closed?

danscales commented 4 years ago

Yes, let's close. This was fixed by my change 7dcd343 . I already opened a separate issue #35378 about the converse situation that a Goexit call can cancel a panic.