context: ease debugging of where a context was canceled? - Githubissues

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

124k stars 17.67k forks source link

context: ease debugging of where a context was canceled? #26356

Closed matthewceravolo closed 2 years ago

matthewceravolo commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go version go1.10 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64" GOBIN="" GOCACHE="/home/matthew/.cache/go-build" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/matthew/work" GORACE="" GOROOT="/usr/local/go" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" CXX="g++" CGO_ENABLED="1" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build356255000=/tmp/go-build -gno-record-gcc-switches"

What did you do?

used context.WithTimeout() to make requests to google calendar api and outlook calendar api

If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

What did you expect to see?

Making requests using contexts with timeouts should cancel when the timeout is reached

What did you see instead?

Contexts with timeouts are instantly failing with "context canceled" even though the timeout is set to time.Minute. The error goes away if I remove the timeout context and use one without any limit. It also seems to be transient to some extent

Sajmani commented 2 years ago

@mitar No, at this stage I'm just prototyping to get a feel for various ideas. My preference is to minimize API changes and improve debuggability in-place, but I understand performance concerns might make this difficult. Let's keep the discussion here for now

andreimatei commented 2 years ago

Capturing the stack trace at the cancel() call is likely to be uninteresting: usually cancel is called either via timer.AfterFunc (in the case of a deadline)

I don't have API specifics in mind, but I'd consider splitting the deadline case from the explicit cancellation case. When I get a DeadlineExceeded error, I'm mostly interested in what the respective deadline/timeout was. When an operation is explicitly canceled, I want to know who canceled it and why.

or via defer, which may make it harder to debug if there are multiple defers in the same function.

Implicitly or explicitly, I'd try to differentiate between the cancel function being called after the respective operation finished (because the API says that one must always call the cancel function), or if the cancel is called affirmatively to cancel an ongoing operation. The former case should be as cheap as possible, because nobody is expected to get a Canceled error because of it (the ctx in question is not expected to be in use by anybody). The implementation of the latter can be more expensive, because it's expected to only happen on unhappy paths.

I know you want minimal API changes. I guess I personally would be inclined towards maximal changes :).

andreimatei commented 2 years ago

If you're looking to pass a reason there's nothing stopping you from creating a custom context implementation.

We have such a context implementation, but it's not a nice one because ctx.Err() still returns the vanilla context.Canceled error (as it also does in @Sajmani 's proposed patch). It's also not nice because it forces us to allocate both our own context struct, and a WithCancel() context. The root of the problem for wanna-be Context implementers is the cancellation propagation code in the standard library, that makes it very expensive for stdlib implementations to interact with custom implementations: https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/context/context.go;l=264 That code spawns a new goroutine to propagate cancellation from a custom ctx to a stdlib one. This is very expensive. I've been timidly thinking of proposing to the go team to provide the required interfaces for custom implementations to participate in propagation as efficiently as the stdlib one, so that stdlib ctx can co-exist with the cowboys. I guess I'm asking now.

Sajmani commented 2 years ago

@andreimatei Others have suggested similar hooks, and IIRC we even prototyped some. That might be worthwhile, but IMO debugging context cancelation is a common enough problem that I'd like to make it better for everyone by default, if we can do so with acceptable cost.

jtolio commented 2 years ago

I've been timidly thinking of proposing to the go team to provide the required interfaces for custom implementations to participate in propagation as efficiently as the stdlib one, so that stdlib ctx can co-exist with the cowboys. I guess I'm asking now.

Perhaps this deserves its own issue. I'd certainly be for it.

Sajmani commented 2 years ago

@jtolio There's prior discussion on this in #28728

bcmills commented 2 years ago

Capturing the stack trace at the cancel() call is likely to be uninteresting: usually cancel is called either via timer.AfterFunc (in the case of a deadline) or via defer, which may make it harder to debug if there are multiple defers in the same function.

A given WithContext call site may have many possible cancel call sites, and while it's true that the cancel call is often just a deferred function, in some of the most interesting cases it is decidedly nontrivial.

For example, a server might construct a context.WithCancel for a background operation and store the cancel function in a map, and only invoke that cancel function when either the background operation completes or an RPC comes in to explicitly cancel it. In such a case it seems very important to distinguish between “canceled because the operation completed” and “canceled by RPC”, but the WithCancel call site would not distinguish them at all.

Sajmani commented 2 years ago

Based on the feedback from several of you, I'll prototype an alternative that allows passing a value in via the cancel function.

On Fri, Jan 7, 2022 at 2:33 PM Bryan C. Mills @.***> wrote:

Capturing the stack trace at the cancel() call is likely to be uninteresting: usually cancel is called either via timer.AfterFunc (in the case of a deadline) or via defer, which may make it harder to debug if there are multiple defers in the same function.

A given WithContext call site may have many possible cancel call sites, and while it's true that the cancel call is often just a deferred function, in some of the most interesting cases it is decidedly nontrivial.

For example, a server might construct a context.WithCancel for a background operation and store the cancel function in a map, and only invoke that cancel function when either the background operation completes or an RPC comes in to explicitly cancel it. In such a case it seems very important to distinguish between “canceled because the operation completed” and “canceled by RPC”, but the WithCancel call site would not distinguish them at all.

— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/26356#issuecomment-1007681687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKIVXOCQENNWK5TYHTYNDTUU45W7ANCNFSM4FJWYJ3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

andreimatei commented 2 years ago

@jtolio There's prior discussion on this in #28728

I had missed the developments around #28728 and their implications. Those changes seem to let us avoid the extra goroutine, which is great! That is, as long as we don't want to override the Done() method. Not being able to override it means that we have to pay some extra allocations because we have to pair a stdlib WithCancel ctx with our implementation, but that's not the end of the world. With this in mind, I've given a renewed push to our attempts to implement richer cancellation, linked below.

Based on the feedback from several of you, I'll prototype an alternative that allows passing a value in via the cancel function.

I'm excited to see what comes out of it!

FWIW, here's what we're trying within CockroachDB: a colleague's PR and my commit on top of it.

Sajmani commented 2 years ago

I've updated my CL with a new version that prototypes adding *Cause variants of WithCancel, Deadline, and Timeout. The purpose of this design is to provide backwards-compatibility for existing code while enabling users to add cancelation "causes" where they need them. I welcome feedback on the API with respect to whether it addresses this issue: https://go-review.googlesource.com/c/go/+/375977/3/src/context/context.go

I need to write more tests, particularly for cases where there are multiple causes in a context chain, and with mixed cancelation orders. I also need to write benchmarks so we can measure slowdowns on critical paths and any new allocations.

andreimatei commented 2 years ago

This is wonderful!

Unless I'm missing something, your Cause(ctx1) implementation can return an error from ctx2's cancel even if ctx1 was not canceled when ctx2 was canceled. In fact, it might return an error even if ctx1.Error() == nil (i.e. ctx1 is not canceled). This is when a ctx on the chain from ctx1 to ctx2 overrode the Done() channel. Right? This seems surprising. Have you considered adding protection against this in the way that parentCancelCtx() does? If you don't want to do that, consider exporting cancelCtxKey so that 3rd party implementors can hijack Value(cancelCtxKey).

Sajmani commented 2 years ago

@andreimatei I need to write some tests with chains of contexts and multiple causes. Hopefully that will help us see whether the returned causes are sensible or surprising. I'm explicitly propagating causes from parents to children in propagateCancel and cancelCtx.cancel—the same places where the context's error gets propagated. Glad to hear this API is directionally correct. If others agree, we'll need to take this up for proposal review.

Sajmani commented 2 years ago

@andreimatei I added some tests: https://go-review.googlesource.com/c/go/+/375977/4/src/context/context_test.go#798 Do these exercise the case you were describing? (I'm not sure what ctx1 and ctx2 mean in your description).

Sajmani commented 2 years ago

I added test for cause vs. causeless cancels and more orderings. I also renamed things so the test is easier to follow. I think the behavior is pretty clear now: the first cancel cause sticks for all children, even if it's a nil cause.

andreimatei commented 2 years ago

Do these exercise the case you were describing?

I don't think so. The cases I'm thinking of are about a custom implementation of context that overrides the Done() channel - e.g. an implementation that doesn't inherit the cancelation of its parent. In such cases, your Cause implementation can return a non-nil cause for a context who's Err() is nil. This is true for both Cause(<custom ctx impl>), and for Cause(stdlib ctx derived from custom ctx). I think this is surprising. In the particular case of a stdlib ctx, I think we should guarantee that, if Err() is nil, Cause() is also nil. Would you agree?

Here's a small program that demonstrates what I'm talking about.

package main

import (
    "context"
    "fmt"
    "time"
)

type uncanceledCtx struct {
    p context.Context
}

var _ context.Context = uncanceledCtx{}

func (u uncanceledCtx) Deadline() (deadline time.Time, ok bool) {
    return time.Time{}, false
}

func (u uncanceledCtx) Done() <-chan struct{} {
    return nil
}

func (u uncanceledCtx) Err() error {
    return nil
}

func (u uncanceledCtx) Value(key any) any {
    return u.p.Value(key)
}

func main() {
    ctx1, cancel1 := context.WithCancelCause(context.Background())
    uc := uncanceledCtx{p: ctx1}
    ctx2 := context.WithValue(uc, "key", "val")
    cancel1(fmt.Errorf("cancel 1"))

    fmt.Printf("uncanceled: Err: %v, cause: %v\n", uc.Err(), context.Cause(uc)) // uncanceled: Err: <nil>, cause: cancel 1
    fmt.Printf("ctx2: Err: %v, cause: %v\n", ctx2.Err(), context.Cause(ctx2))   // ctx2: Err: <nil>, cause: cancel 1
}

The output of Cause(uc) and Cause(ctx2) is surprising to me.

Sajmani commented 2 years ago

Ah, thanks for the reproduction case. I would state the expectation in the reverse: Cause(ctx) is only valid (defined) if ctx.Err() != nil (which is the same as when ctx.Done() is closed). If ctx.Err() is nil, then Cause(ctx) is undefined. I can add this to the spec in the CL.

We could strengthen the spec for Cause to mirror what we say for Err:

    // If Done is not yet closed, Err returns nil.
    // If Done is closed, Err returns a non-nil error explaining why:
    // Canceled if the context was canceled
    // or DeadlineExceeded if the context's deadline passed.
    // After Err returns a non-nil error, successive calls to Err return the same error.
    Err() error

So we could add (and implement) "If ctx.Err returns nil, the Cause(ctx) returns nil." I'd prefer not to add this strengthening of the spec unless there's a good reason to do so. I expect most uses of Cause to check <-ctx.Done() or ctx.Err() != nil before calling Cause(ctx).

andreimatei commented 2 years ago

I would state the expectation in the reverse: Cause(ctx) is only valid (defined) if ctx.Err() != nil (which is the same as when ctx.Done() is closed). If ctx.Err() is nil, then Cause(ctx) is undefined. I can add this to the spec in the CL.

I expect most uses of Cause to check <-ctx.Done() or ctx.Err() != nil before calling Cause(ctx).

In my opinion, we should aim to relegate ctx.Err() to the past. I think we should aim for Cause(ctx) to completely replace it for new programs (because it's much better) (*). So I personally do not agree with the premise that any Cause() call should be preceded by an Err() call (or, rather, I don't think we should design Cause() with that mindset). There may be a preceding Done() call, but not always.

Another thing I wanted to bring up is making sure that custom ctx impls can participate in Cause() determination for themselves and their children. I think with your current patch that's not possible, because the key that Cause() looks for is not exported?

(*) If I'd have my druthers, we would make Cause() a new method on ctx (even if that means that custom Ctx impls will need to add that method to compile under the new go version. I'm not very familiar with the go compatibility promise but, as far as I'm personally concerned, there's not that many custom Ctx impls out there and the changes to them would be trivial (e.g. they can delegate to Err()). I'd also consider adding a version of the Done() method that returns the error. It'd have to be a new channel per caller; I haven't thought through the performance implications.

bcmills commented 2 years ago

@andreimatei, the Go compatibility policy does not allow new methods to be added to exported interfaces. So there is no plausible way for Cause to replace Err outright.

The best we could perhaps do is retrofit a top-level func Cause(Context) error function into the context package, which would use Cause if defined or Err otherwise.

Sajmani commented 2 years ago

Regarding compatibility, my goal with this CL is to avoid breaking anything, including existing implementations of the Context interface. Therefore I'm not adding Cause() to the Context interface.

My goal with Cause is not to replace Err, but to allow people to supplement Err with additional information.

I think you're saying that you'd like to check Cause(ctx) != nil instead of ctx.Err() != nil. If so, then indeed we'd need to make sure they are nil / not-nil at the same times. We'd probably also want to make Cause(ctx) == ctx.Err() when the user hasn't provided a cause. But this then makes it difficult to determine whether ctx.Done was closed due to cancelation or deadline exceeded (the user would need to encode this into their cause).

Regarding implementing Cause for custom contexts, users can do this easily by wrapping a context created using WithCancelCause. I'd like to keep things simple until there's a demonstrated need to support anything fancier.

andreimatei commented 2 years ago

We'd probably also want to make Cause(ctx) == ctx.Err() when the user hasn't provided a cause.

That's what I was thinking.

But this then makes it difficult to determine whether ctx.Done was closed due to cancelation or deadline exceeded (the user would need to encode this into their cause).

Right. But isn't that a natural thing to do, since the user has to pass an error to WithDeadlineCause(...,err)? One thing to consider is having the Cause() for a deadline be a wrapper error type (say, CtxDeadlineExceeded), which wraps the user-provided error (such that errors.Is/As still recognize the provided error). And similarly for causes stemming from cancellation.

Regarding implementing Cause for custom contexts, users can do this easily by wrapping a context created using WithCancelCause. I'd like to keep things simple until there's a demonstrated need to support anything fancier.

When people implement custom contexts, I think they primarily do so for efficiency reasons. In such cases, forcing them to wrap a stdlib ctx defeats the purpose. There's also a more philosophical contradiction here, in my opinion. Context is an interface, presumably, because you want people to be able to implement their own. But you seem to also want to tell people that some functionality is only available to the stdlib, so you have to jump through hoops to get it in a custom impl. Aren't these two things at odds with each other? There are similar nuances at play around #28728 and the performance implications of propagating cancellation across context types: in that case it's not strictly about what an implementer can do, but rather the degree to which the stdlib plays nicely with others. Pragmatically speaking, since every library out there uses the stdlib ctx, whether or not it's feasible for other to write custom implementations largely depends on how well the stdlib impl interacts with them.https://github.com/golang/go/commit/0ad368675bae1e3228c9146e092cd00cfb29ac27 addressed the concerns to some degree, but I think not fully. So, as a matter of principle, FWIW, my opinion is that the performance and ergonomics of 3rd party impls should be a concern whenever mucking with the context library.

I should say that I can nitpick and form semi-informed opinions on proposals, but ultimately I'll gladly take most improvements around the cause, compared to nothing.

Sajmani commented 2 years ago

All good points, @andreimatei , thank you. I suggest at this point I write up the current state as a Proposal and document your suggestions as Alternatives to Consider.

rsc commented 2 years ago

Please take a look at #51365 and see if it addresses your use cases. Thanks!

Previous