golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.02k stars 17.54k forks source link

context: ease debugging of where a context was canceled? #26356

Closed matthewceravolo closed 1 year ago

matthewceravolo commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.10 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64" GOBIN="" GOCACHE="/home/matthew/.cache/go-build" GOEXE="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/matthew/work" GORACE="" GOROOT="/usr/local/go" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" CXX="g++" CGO_ENABLED="1" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build356255000=/tmp/go-build -gno-record-gcc-switches"

What did you do?

used context.WithTimeout() to make requests to google calendar api and outlook calendar api

If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

What did you expect to see?

Making requests using contexts with timeouts should cancel when the timeout is reached

What did you see instead?

Contexts with timeouts are instantly failing with "context canceled" even though the timeout is set to time.Minute. The error goes away if I remove the timeout context and use one without any limit. It also seems to be transient to some extent

ianlancetaylor commented 6 years ago

Sorry, but there isn't enough information here to say anything useful. Please show us your code, or tell us how to recreate the problem. Thanks.

matthewceravolo commented 6 years ago
import (
    "net/http"
    "os"
    "time"

    "golang.org/x/net/context"
    "golang.org/x/oauth2"
    "golang.org/x/oauth2/google"
)

var (
    backgroundContext = context.Background()
    oauthConfig       = &oauth2.Config{
        ClientID:     os.Getenv("GOOGLE_PROJECT_CLIENT_ID"),
        ClientSecret: os.Getenv("GOOGLE_PROJECT_SECRET_KEY"),
        Endpoint:     google.Endpoint,
    }
)

type calendarClient struct {
    *calendar.Service
    requestTimeout time.Duration
}

// NewCalendarClient creates a new authenticated Google Calendar client.
func NewCalendarClient(refreshToken string) (wf.Client, error) {
    client, err := newClient(refreshToken)
    if err != nil {
        return nil, err
    }

    service, err := calendar.New(client)
    if err != nil {
        return nil, err
    }

    return &calendarClient{Service: service, requestTimeout: time.Minute}, nil
}

func newClient(refreshToken string) (*http.Client, error) {
    contextWithTimeout, cancel := context.WithTimeout(backgroundContext, time.Minute)
    defer cancel()
    return oauthConfig.Client(contextWithTimeout, &oauth2.Token{RefreshToken: refreshToken}), nil
}

func (client *calendarClient) GetCalendars() (*calendar.CalendarList, error) {
    contextWithTimeout, cancel := context.WithTimeout(backgroundContext, time.Minute)
    defer cancel()
    return client.CalendarList.List().Context(contextWithTimeout).Do()
}

The issue is that when we call GetCalendars for example, it will instantly return "context canceled". When I use an empty context with no timeout, everything works as expected. This all began failing in the past couple of days since updates were made in the /x/net package

meirf commented 6 years ago

I think the behavior you're seeing is expected. I don't see a Go bug.

NewClient (called by oauthConfig.Client):

The returned client is not valid beyond the lifetime of the context.

Your code has:

func newClient(refreshToken string) (*http.Client, error) {
    contextWithTimeout, cancel := context.WithTimeout(backgroundContext, time.Minute)
    defer cancel()
    return oauthConfig.Client(contextWithTimeout, &oauth2.Token{RefreshToken: refreshToken}), nil
}

The defer cancel() causes the context to be cancelled before the client even gets to calendar.New.

"When I use an empty context with no timeout, everything works as expected. This all began failing in the past couple of days since updates were made in the /x/net package." You are describing to things that might have changed in your code (no longer using empty context, changes in x/net). If my theory above is correct: if you undo the x/net changes (to only change one variable) you'll see that any non-empty context will still have the problem you're seeing.

bcmills commented 6 years ago

I've seen this sort of issue crop up several times now.

I wonder if context.Context should record a bit of caller information (say, the PC of the immediate calling function) in context.WithTimeout and in calls to the CancelFunc returned by context.WithCancel. Then we could add a debugging hook to interrogate why a particular context.Context was cancelled.

Perhaps something along the lines of:

package context

// A DoneReasoner describes the reason a Context is done.
// The Context implementations returned by this package implement DoneReasoner.
type DoneReasoner interface {
    DoneReason() string
}

// DoneReason returns a human-readable description of the reason that ctx.Done() is closed,
// or the empty string if ctx.Done() is not closed.
func DoneReason(ctx context.Context) string {
    select {
    case <-ctx.Done():
    default:
        return ""
    }
    if r, ok := ctx.(DoneReasoner); ok {
            return r.DoneReason()
    }
    return ctx.Err().Error()
}
bcmills commented 6 years ago

(CC @Sajmani @bradfitz for Context debuggability.)

matthewceravolo commented 6 years ago

@meirf I think there was a misunderstanding in how I described the problem. This code has always been using two contexts like I described but only in the past two days since we pulled updates to the vendor package did it start failing. There were not any changes whatsoever related to any of this code for several months besides pulling in the package update.

Only after we pulled in the update and everything started failing, we switched to contexts without timeouts (still multiple) and started seeing things succeed. If your explanation is correct, the defer cancel was not previously working as expected

However, I believe your explanation is incorrect because the error message is returned when GetCalendars is called, not on client retrieval. We always receive a calendarClient back, so calendar.New is being called correctly

ianlancetaylor commented 6 years ago

Do you happen to have a record of which revision of the context package you were using previously, and which one you updated to?

bcmills commented 6 years ago

only in the past two days since we pulled updates to the vendor package did it start failing.

Which vendored packages did you update? (Just x/net, or also x/oauth2?)

matthewceravolo commented 6 years ago

I've narrowed down the precise commit where things began to fail, and there are more vendor updates here than I'd previously thought. It looks like both packages (x/net and x/oauth2) were updated and that the previous version was decently old such that it may be very difficult to narrow anything down, but here are a couple of samples from the revisions:

"checksumSHA1": "dr5+PfIRzXeN+l1VG+s0lea9qz8=", => "checksumSHA1": "GtamqiJoL7PGHsN454AoffBFMa8=", "path": "golang.org/x/net/context", => "path": "golang.org/x/net/context", "revision": "f5079bd7f6f74e23c4d65efa0f4ce14cbd6a3c0f", => "revision": "1e491301e022f8f977054da4c2d852decd59571f", "revisionTime": "2017-07-19T21:11:51Z" => "revisionTime": "2018-05-30T06:29:46Z"

"checksumSHA1": "hyK05cmzm+vPH1OO+F1AkvES3sw=", => "checksumSHA1": "xaHFy3C2djwUXtiURpX3/ruY8IA=", "path": "golang.org/x/oauth2", => "path": "golang.org/x/oauth2", "revision": "314dd2c0bf3ebd592ec0d20847d27e79d0dbe8dd", => "revision": "1e0a3fa8ba9a5c9eb35c271780101fdaf1b205d7", "revisionTime": "2016-12-14T09:25:55Z" => "revisionTime": "2018-06-02T17:55:33Z"

Also, I re-tested the above code by leaving everything the same and completely removing the second context.WithTimeout inside of GetCalendars such that only the client has a timeout configured, and I see the same error

bcmills commented 6 years ago

https://golang.org/cl/45370 is in that range and loosely matches @meirf's diagnosis. It looks like your vendor update uncovered a latent bug within your code.

matthewceravolo commented 6 years ago

I've confirmed that commenting out the defer cancel() inside of newClient does in fact successfully return our calendar data as expected, but it is still not matching what was diagnosed above since when the error occurs it is on GetCalendars, it isn't canceling prior to Calendar.New. I don't feel I have a great understanding of exactly what's happening here other than the pre-mature cancel seems to be killing the client's ability to make any future successful requests

Is the expectation that we exercise the client we retrieve to its fullest extent only to call cancel() at the last moment? And that if it's set on this client we shouldnt even need to bother with client.CalendarList.List().Context(), correct?

Can we rely on GC and just not bother holding onto the cancel func?

fraenkel commented 6 years ago

I think you have it backwards. Get rid of the context in newClient. I assume you want to have a timeout per service call rather than across the Client use.

matthewceravolo commented 6 years ago

@fraenkel I'm fine with whichever, but doesn't putting it on the client prevent me from needing to duplicate this context code across all calls?

bradfitz commented 6 years ago

@matthewceravolo, please don't retitle the bug. We have a convention for how we use titles.

It still reads above like this is user error and you got lucky in the past, so I'd like to keep this bug about thinking about ways to make this debugging easier for other users making similar errors.

Sajmani commented 6 years ago

@bcmills I wouldn't use type assertion to introspect a Context because they are wrapped so often. If you want to attach a Done reason to a Context, I'd use a Context value. This is overhead, though, so I'd make it opt-in in somehow. Perhaps provide a context.WithDebug method that adds a Context value for storing debugging info. The code in WithTimeout and WithCancel can check for this debug value to decide whether to walk the call stack and record any information there. The context debug value needs to be installed at the top of the context stack to be seen by all derived contexts, so I'd enable it in the request handler (say) with a parameter like ?debugcontext=1.

bcmills commented 6 years ago

I wouldn't use type assertion to introspect a Context because they are wrapped so often.

Good point. A Context.Value call with an exported (but opaque) key to retrieve the reason code should be fine.

If you want to attach a Done reason to a Context, I'd use a Context value. This is overhead, though, so I'd make it opt-in in somehow. Perhaps provide a context.WithDebug method that adds a Context value for storing debugging info. The code in WithTimeout and WithCancel can check for this debug value to decide whether to walk the call stack and record any information there.

Ironically, Context.Value is O(N) with the depth of Context nesting, whereas runtime.Caller(1) is O(1): the mere act of checking whether debugging is enabled may be significantly more costly than storing the debug information unconditionally.

Sajmani commented 6 years ago

I would store a *debug in the context to flatten the lookup, leaving it nil when debugging is disabled. Associating that debug value with an unexported context key allows the debug info to propagate through Context implementations outside the context package.

On Tue, Jul 17, 2018 at 3:52 PM Bryan C. Mills notifications@github.com wrote:

I wouldn't use type assertion to introspect a Context because they are wrapped so often.

Good point. A Context.Value call with an exported (but opaque) key to retrieve the reason code should be fine.

If you want to attach a Done reason to a Context, I'd use a Context value. This is overhead, though, so I'd make it opt-in in somehow. Perhaps provide a context.WithDebug method that adds a Context value for storing debugging info. The code in WithTimeout and WithCancel can check for this debug value to decide whether to walk the call stack and record any information there.

Ironically, Context.Value is O(N) with the depth of Context nesting, whereas runtime.Caller(1) is O(1): the mere act of checking whether debugging is enabled may be significantly more costly than storing the debug information unconditionally.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/26356#issuecomment-405705926, or mute the thread https://github.com/notifications/unsubscribe-auth/AJSK3TSfBWY21reU6GLskizNeC3fM3mEks5uHkBugaJpZM4VNsJ2 .

Xopherus commented 5 years ago

I wonder if errors might be a good solution for this problem? Rather than have duplicate information propagated in the context, just rely on the error returned when the context is cancelled to get a stack trace.

bcmills commented 5 years ago

@Xopherus, that's a different problem from the one here. The first error upon cancellation arises at a call to ctx.Err(), which is arbitrarily far away from the call to the cancel() function.

This issue centers on identifying the call to the cancel function, not the call to the Err method.

Xopherus commented 5 years ago

@bcmills I think they're actually both the same - at least the implementation would the same for context.WithCancel. The timeout/deadline contexts would have slightly different implementations.

The cancelFunc returned by context.WithCancel is an anonymous function which sends an error to c.cancel which triggers the cancellation (and ultimately gets returned by ctx.Err()). That error is currently a global var, but what if we instead created that error within the cancelFunc using the new experimental errors package. We'd get information about which cancelFunc was called within the error itself. I can imagine with some slight refactoring we may be able to do the same for the other contexts.

bcmills commented 5 years ago

@Xopherus, lots of code today uses == to compare against context.Canceled. We can't realistically change that to instantiate a new error for each call site — it would be a massive breaking change.

Xopherus commented 5 years ago

Forgive my ignorance, but isn't that a big part of the errors experiment that's going on right now? I was under the impression that Go would be moving away from == for error inspection and to errors.As or errors.Is. I 100% agree with you that we should avoid breaking changes, so I'd like to find a way to do this in a graceful manner. Maybe that's already being discussed in other areas of the language?

bcmills commented 5 years ago

errors.Is would allow us to return per-instance errors in a new API, but we cannot assume that existing code (using the existing API) will be updated to use that check. Compatibility requires that we not break programs that rely on today's documented behavior.

neetle commented 4 years ago

IMHO it's a common pain point within the community to arbitrarily receive a context.Canceled and not know what to do with it. I feel like amending what context.WithCancel is the wrong solution though. The amount of code manually checking for cancellation is likely high.

Would a context.WithCanceler(context.Context) (cxt context.Context, canceler func(error)) be a decent compromise, with the argument to canceler propagated to all calls of ctx.Err()? (working title!)

There is likely to be concerns of memory leaks, but having documentation that asks users to use constants and corresponding go vet checks could work well.

bentcoder commented 4 years ago

I have same problem and don't know how to debug this. The "CURRENT" below randomly produces:

&{%!t(string=context canceled)}
*errors.errorString

CURRENT

// ctx is `*http.Request.Context` which come from the handler.
func Insert(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 3 * time.Millisecond)
    defer cancel()

    qry := `INSERT INTO users (name) VALUES (?)`

    res, err := r.database.ExecContext(ctx, qry, "hello")
    // ....
}

WORKS FINE

func Insert() error {
    ctx, cancel := context.WithTimeout(context.Background(), 3 * time.Millisecond)
    defer cancel()

    qry := `INSERT INTO users (name) VALUES (?)`

    res, err := r.database.ExecContext(ctx, qry, "hello")
    // ....
}
Sajmani commented 4 years ago

@BentCoder the difference between your CURRENT and WORKS FINE is the use of context.Background() instead of ctx. The context.Canceled error must be coming from ctx, since context.Background() is never canceled. (If the WithTimeout had expired, the error would be context.DeadlineExceeded).

julien-may commented 3 years ago

Are there any updates on how to continue with this topic?

While I like the proposals from @bcmills to add a DoneReasoner or @neetle's one to add a description parameter to the CancelFunc to describe the reason for canceling a context, it only once gets beneficial when the ecosystem uses them in order to improve logging.

Adding a "stacktrace" as a context value on the other hand would already allow us to see where a new context was created and when resolving tha stacktrace context value, we could also see which one got canceled.

I am available to create a PR for this.

nya3jp commented 3 years ago

My project has been suffering from this issue, so I really like to see it resolved.

I think https://github.com/golang/go/issues/35791#issuecomment-649660330 by @Sajmani is a great summary of compatibility requirements. Unfortunately @neetle's proposed context.WithCanceler in https://github.com/golang/go/issues/26356#issuecomment-592342591 breaks API compatibility. @bcmills's proposed DoneReasoner in https://github.com/golang/go/issues/26356#issuecomment-404870718 keeps API compatibility, but IIUC optional interfaces are generally not liked (e.g. https://github.com/golang/go/issues/28728#issuecomment-532834952).

Just my two cents, but I have another idea. We can annotate errors if we could check if a deadline applied to a context was expired. I'm thinking of functions like this (pardon me for naive naming, it's just for explanation):

type CheckFunc func() bool

// WithCheckableDeadline is similar to WithDeadline, but it also returns CheckerFunc,
// which returns true if the current time is after t.
func WithCheckableDeadline(ctx Context, t time.Time) (Context, CheckFunc, CancelFunc)

// WithCheckableTimeout is similiar to WithCheckableDeadline above, but accepts a timeout.
func WithCheckableTimeout(ctx Context, d time.Duration) (Context, CheckFunc, CancelFunc)

We can use these functions to annotate errors on hitting timeouts, e.g.

func f(ctx context.Context) error {
  ctx, check, cancel := context.WithCheckableTimeout(ctx, 10 * time.Second)
  defer cancel()

  if err := someLongOperation(ctx); err != nil {
    if check() {
      return fmt.Errorf("f timeout reached: %w", err)
    }
    return err
  }
  return nil
}

Please note that calling CheckFunc is different from calling Deadline of an associated context and comparing it with the current time, because Deadline may return a deadline earlier than one applied with context.WithDeadline when a parent context has a deadline.

Sajmani commented 3 years ago

I'm going to think on this for a bit.

bullgare commented 3 years ago

Are there any news on it? I see a lot of "context canceled" errors which are not handled as there is no good way to investigate the root cause for those.

mikeschinkel commented 3 years ago

I am running into this trying to track down why a newly generated Buffalo app gives me context cancelled when trying to run in terminal on macOS, but not when running within GoLand where I could more easily debug it.:-(

bullgare commented 3 years ago

Yes, agree. On one of our enterprise systems we have random spikes from time to time. And it's very hard to prove network administrators that it's network related because I'm not sure 100% about the root cause.

jtolio commented 3 years ago

For those of you on this thread looking for an immediate partial solution to finding even where the context.Canceled error itself came from, using an errors.Is-compatible library that wraps errors with stack traces could be a good solution. You would need to make sure all of your code uses errors.Is and errors.As instead of error equality tests, and you would need to return Wrap(ctx.Err()) instead of ctx.Err() directly, but it would otherwise give you stack traces. I personally like http://pkg.go.dev/github.com/zeebo/errs/v2. This admittedly doesn't tell you who called cancel().

bullgare commented 3 years ago

For me, it is potentially returned from the standard library (http). So I am not controlling it.

neetle commented 3 years ago

Half the issue tends to be that we're compensating for poorly-written code here, rather than well-written code. "Just wrap the error" doesn't give an appropriate error when the context is cancelled by another thread entirely (see: spooky-action-at-a-distance)

Sajmani commented 3 years ago

I had less time to work on this than I hoped, sorry. Unassigning myself to open this up for others.

paulmil1 commented 2 years ago

this is the problem i have currently and its beyond annoying, i question myself: "Who the hell canceled my context and why?" and it all happens in 3rd party code you cant control, good luck me at debugging this on a remote environment

paulmil1 commented 2 years ago

imo context's Cancel() should receive argument 'reason string' so its somewhat debuggable, currenetly theres just stupid 'context canceled' constant that does not say anything about the context of cancellation

sfllaw commented 2 years ago

imo context's Cancel() should receive argument 'reason string' so its somewhat debuggable, currenetly theres just stupid 'context canceled' constant that does not say anything about the context of cancellation

46273 suggests this as context.WithCancelReason.

gopherbot commented 2 years ago

Change https://golang.org/cl/375977 mentions this issue: context: prototype capturing the call stack of WithDeadline/Timeout/Cancel

Sajmani commented 2 years ago

I prototyped an API-compatible idea to just capture the call stack of WithCancel/Timeout/Deadline and provide a way to access it as a string. Would this be enough? We'd have to measure the time & space overheads: https://go-review.googlesource.com/c/go/+/375977

mitar commented 2 years ago

I would really like that one could pass optional value to cancel and then one can include whatever you want there (an error, an error with a stack trace, a string). So the same as how you can pass to panic something.

andreimatei commented 2 years ago

I prototyped an API-compatible idea to just capture the call stack of WithCancel/Timeout/Deadline and provide a way to access it as a string. Would this be enough?

It's much better than nothing but, in the context of CockroachDB, it's not ideal. We too would like to pass a string as the cancellation reason. The stack trace will likely be quite ambiguous in many places, as various short-circuits will result in the same defer cancel() to fire. I'm glad this issue is on the radar, though. It's a major pain for us, and a constant source of fights about the technicalities of how to hack around it.

Sajmani commented 2 years ago

@andreimatei @mitar I can prototype a way to pass in the string; my focus here is keeping the existing API backwards-compatible. My interest in using the stack trace is that it improves debuggability for all existing code you don't need to change any of the cancelation sites to start getting the stack traces. This comes at the expense of capturing a stack trace at each call to WithCancel/Timeout/Deadline. If that's mostly happening close to RPCs and I/O operations, this might be fine.

andreimatei commented 2 years ago

My interest in using the stack trace is that it improves debuggability for all existing code you don't need to change any of the cancelation sites to start getting the stack traces. This comes at the expense of capturing a stack trace at each call to WithCancel/Timeout/Deadline. If that's mostly happening close to RPCs and I/O operations, this might be fine.

Oh, I had misunderstood what you change does. I though it captures a stack when cancel() is called, not when WithCancel() is called. The latter I think would be less useful for us than the former. I imagine the former should be doable?

This comes at the expense of capturing a stack trace at each call to WithCancel/Timeout/Deadline.

I also hadn't fully internalized this. A call to runtime.Callers() would likely be too expensive for the CockroachDB code. I've just measured it on a beefy Linux machine, and for a few frames it costs 350ns per call, and the cost seems to go up steadily with the number of frames. I think this would be too much :S. We make plenty of cancelable contexts. Also I think your patch would do one extra allocation for the callers buffer, in addition to the cancelCtx alloc. That's be unfortunate; we're already suffering significantly because contexts basically always escape to the heap - they're our number one source of allocations, as measured by allocation source. Consider allocating the two together; something like:

type cancelCtx struct {
   ...
   callers []uintptr
   callersAlloc [30]uitptr
}

func newCancelCtx(parent Context) cancelCtx {
        ctx = new(cancelCtx)
        ctx.Context = parent
    ctx.callers := ctx.callersAlloc[:]
    runtime.Callers(2, ctx.callers) // skip Callers and newCancelCtx
    return ctx
}
Sajmani commented 2 years ago

Capturing the stack trace at the cancel() call is likely to be uninteresting: usually cancel is called either via timer.AfterFunc (in the case of a deadline) or via defer, which may make it harder to debug if there are multiple defers in the same function. However, if your code often calls cancel directly from functions other than the ones that called WithCancel, I can see the use of capturing the stack trace at the cancel() call.

The same question arises for passing a string: should the string be passed to WithCancel/Deadline/Timeout, or to cancel? In either case, this would be a pretty messy API extension, so my preference would be to avoid making any API changes at all. I understand your performance concerns, though. Ideally we'd make this zero-cost by default.

neetle commented 2 years ago

If you're looking to pass a reason there's nothing stopping you from creating a custom context implementation.

My core want with this would be to deal with poorly-written functions that take a context. Having an opt-in higher up the stack would likely be a decent trade off between performance and ease of use.

Sajmani commented 2 years ago

@neetle Yes, we could provide a way to enable stack-trace capture via a parent context, and propagate bit that down to child context (e.g. in propagateCancel). So this would be off by default, and users could enable it by setting a parent context with ctx = context.EnableStackTraces(ctx) or somesuch. This would turn on stack tracing for all child contexts, providing people with programmatic control over the whole tree (e.g. to debug specific requests), and it avoids having to add "reason" strings everywhere WithCancel or WithDeadline are called (though in theory there shouldn't be a huge number of these, either).

jtolio commented 2 years ago

I like the current proposal (context.EnableStackTraces).

May I add though, that ultimately if a "WithReason" style method is added, can I request that it take an error type instead of a string? The error could be wrapped with go1.13 errors package wrapping so what ctx.Err() returns is still errors.Is(err, context.Canceled), but that way more structured information could also be passed than just a string.

mitar commented 2 years ago

So we are then all arguing that #46273 should be reopen?