Open dastrobu opened 1 month ago
Hi @dastrobu, thanks for submitting this proposal. It seems useful to me, I'm discussing this with other maintainers to see if we want to make this change.
@dastrobu are you trying to propagate the reason for context cancellation to service method handlers? If yes, can you describe some use cases in which the handlers would react differently to different causes
?
If the cause
is not intended for method handlers, can you explain how it will be used?
@arjan-bal sure, let me try to detail the described use case a bit.
The Grafana Tempo distribute exposes a gRPC API to consume data.
In the handler, the context is propagated and some goroutines do some data processing in parallel.
When one of the goroutines fails (for whatever reason), it cancels the context, which in turn cancels all other goroutines.
In the handler, the error is reported so that operations teams can react to it.
However, the error message is just saying "context canceled". So what happened?
The proposal would make it transparent if the context cancellation was due to a client or a server error.
To illustrate, I tried to sketch a simplified version of the code:
func handle(ctx context.Context, d []string) {
err := processData(ctx, d)
if err != nil {
fmt.Printf("Error processing data: %v\n", err) // Error processing data: context cancelled
}
}
func processData(ctx context.Context, d []string) error {
ctx, cancel := context.WithCancelCause(ctx)
defer cancel(nil)
doneChan := make(chan struct{}, 1)
errChan := make(chan error, 1)
var wg sync.WaitGroup
wg.Add(len(d))
for _, di := range d {
go func() {
defer wg.Done()
err := doStuff(ctx, di)
if err != nil {
cancel(err) // interrupt other goroutines
errChan <- err
return
}
}()
}
go func() {
wg.Wait()
doneChan <- struct{}{}
}()
select {
case err := <-errChan:
return err
case <-doneChan:
return nil
case <-ctx.Done():
return context.Cause(ctx)
}
}
In the real-world situation described in https://github.com/grafana/tempo/issues/3957, it took us several weeks to find out that context cancellation was actually caused by a misconfigured client.
We did not find the reason until patching the Go SDK to log stack traces of context cancellation.
I would like to avoid this hassle for my team and others in the future, and I think implementing this small feature could contribute a lot.
Use case(s) - what problem will this feature solve?
When a client resets a connection, the http2_server cancels the current context to interrupt all "server internal processing". Relevant code is
t.closeStream(s, false, 0, false)
ands.cancel()
.When there is a big context stack within the server's logic, it can be quite hard to find out why the context was canceled. Was it due to some internal error? Timeout? Or is it due to a client sending the RST frame?
A real-world problem analysis is described in https://github.com/grafana/tempo/issues/3957.
Proposed Solution
I am proposing to replace
context.CancelFunc
withcontext.CancelCauseFunc
and adjust all context cancellations to report the cause.For the example described above, it could be something like this
Alternatives Considered
nil
Additional Context
nil