grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.9k stars 4.34k forks source link

Add cancel cause when client sends RST_STREAM frame #7541

Open dastrobu opened 1 month ago

dastrobu commented 1 month ago

Use case(s) - what problem will this feature solve?

When a client resets a connection, the http2_server cancels the current context to interrupt all "server internal processing". Relevant code is t.closeStream(s, false, 0, false) and s.cancel().

When there is a big context stack within the server's logic, it can be quite hard to find out why the context was canceled. Was it due to some internal error? Timeout? Or is it due to a client sending the RST frame?

A real-world problem analysis is described in https://github.com/grafana/tempo/issues/3957.

Proposed Solution

I am proposing to replace context.CancelFunc with context.CancelCauseFunc and adjust all context cancellations to report the cause.

For the example described above, it could be something like this

func (t *http2Server) handleRSTStream(f *http2.RSTStreamFrame) {
    // If the stream is not deleted from the transport's active streams map, then do a regular close stream.
    if s, ok := t.getStream(f); ok {
        t.closeStream(s, false, 0, false, errors.New("RST_STREAM frame received"))
        return
    }

// ... 

func (t *http2Server) closeStream(s *Stream, rst bool, rstCode http2.ErrCode, eosReceived bool, cause error) {
    // In case stream sending and receiving are invoked in separate
    // goroutines (e.g., bi-directional streaming), cancel needs to be
    // called to interrupt the potential blocking on other goroutines.
    s.cancel(cause)

Alternatives Considered

nil

Additional Context

nil

arjan-bal commented 4 weeks ago

Hi @dastrobu, thanks for submitting this proposal. It seems useful to me, I'm discussing this with other maintainers to see if we want to make this change.

arjan-bal commented 4 weeks ago

@dastrobu are you trying to propagate the reason for context cancellation to service method handlers? If yes, can you describe some use cases in which the handlers would react differently to different causes?

If the cause is not intended for method handlers, can you explain how it will be used?

dastrobu commented 3 weeks ago

@arjan-bal sure, let me try to detail the described use case a bit.

The Grafana Tempo distribute exposes a gRPC API to consume data.
In the handler, the context is propagated and some goroutines do some data processing in parallel.
When one of the goroutines fails (for whatever reason), it cancels the context, which in turn cancels all other goroutines.

In the handler, the error is reported so that operations teams can react to it.
However, the error message is just saying "context canceled". So what happened?

  1. The client sent a RST frame. So the server is behaving correctly. The client may have canceled intentionally or the client is somehow misbehaving.
  2. The server had an internal error and canceled the context. The ops team needs to investigate and fix it.

The proposal would make it transparent if the context cancellation was due to a client or a server error.

To illustrate, I tried to sketch a simplified version of the code:

func handle(ctx context.Context, d []string) {
    err := processData(ctx, d)
    if err != nil {
        fmt.Printf("Error processing data: %v\n", err) // Error processing data: context cancelled
    }
}

func processData(ctx context.Context, d []string) error {
    ctx, cancel := context.WithCancelCause(ctx)
    defer cancel(nil)

    doneChan := make(chan struct{}, 1)
    errChan := make(chan error, 1)

    var wg sync.WaitGroup
    wg.Add(len(d))
    for _, di := range d {
        go func() {
            defer wg.Done()
            err := doStuff(ctx, di)
            if err != nil {
                cancel(err) // interrupt other goroutines
                errChan <- err
                return
            }
        }()
    }

    go func() {
        wg.Wait()
        doneChan <- struct{}{}
    }()

    select {
    case err := <-errChan:
        return err
    case <-doneChan:
        return nil
    case <-ctx.Done():
        return context.Cause(ctx)
    }
}

In the real-world situation described in https://github.com/grafana/tempo/issues/3957, it took us several weeks to find out that context cancellation was actually caused by a misconfigured client.
We did not find the reason until patching the Go SDK to log stack traces of context cancellation.
I would like to avoid this hassle for my team and others in the future, and I think implementing this small feature could contribute a lot.