golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.17k stars 17.57k forks source link

x/net/http2: make Transport return nicer error when Amazon ALB hangs up mid-response? #18639

Open bfallik opened 7 years ago

bfallik commented 7 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version go version go1.8rc1 darwin/amd64

What operating system and processor architecture are you using (go env)?

Linux AMD64

What did you do?

If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

We have http client code that has started to return errors when the corresponding server uses HTTP2 instead of HTTP.

What did you expect to see?

Identical behavior.

What did you see instead?

http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""

bfallik commented 7 years ago

To add some color on this, what I'm wondering is if go http clients are expected to handle the GOAWAY frames potentially returned from http2 servers or should the transports somehow manage this case similar to a remotely closed connection. Ideally our program would just re-try the request for a NO_ERROR+GOAWAY scenario.

I've asked a related question on SO and golang-nuts.

bradfitz commented 7 years ago

You don't provide enough information in this bug report.

Are you importing golang.org/x/net/http2 at all, or only using Go's net/http package?

This was fixed in December via https://github.com/golang/net/commit/8dab9293431a241c65a8e8d46bbaa0492ec94849

Are you perhaps using an old version of golang.org/x/net/http2 directly?

What HTTP method are these? If they're not idempotent requests, how are you creating the requests?

/cc @tombergan

bfallik commented 7 years ago

@bradfitz We're using the standard net/http package. I miscommunicated the version information. We originally observed this using go 1.7 but since then I was able to reproduce it using go1.8 beta1. Yesterday I upgraded to go 1.8rc1 but I'm not certain if I rebuilt the test case. I can retry the test case with 1.8rc1 if that includes https://github.com/golang/net/commit/8dab9293431a241c65a8e8d46bbaa0492ec94849.

The call stack is roughly:

        request, err := http.NewRequest("GET", endpoint.String(), nil)
        if err != nil {
                return nil, err
        }
        request.Header.Set("Accept", "application/json")
        if len(jwtToken) > 0 {
                request.Header.Add("Authorization", jwtToken)
        }
        response, err := http.DefaultClient.Do(request)
        if err != nil {
                return nil, errors.WithStack(err)
        }
        return response, nil
bfallik commented 7 years ago

I was just able to observe the same error using a test binary built with go1.8 rc1.

bradfitz commented 7 years ago

Can you capture the stderr output with GODEBUG=http2debug=2 set in your environment when it occurs?

How "rough" is that "roughly" call stack? Is it always a GET with no body?

bfallik commented 7 years ago

@bradfitz the NewRequest() line is coped verbatim, the body is always nil.

I can collect the stderr output later today or tomorrow.

bradfitz commented 7 years ago

And where do you see http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""? Is that stderr output, or the error returned by a specific call? Which?

bfallik commented 7 years ago

@bradfitz that is the error returned (and logged by us) after a specific call to http.Client.Do().

bfallik commented 7 years ago

@bradfitz below is stderr with GODEBUG=http2debug=2. I removed some sensitive data, replaced with 'XXX'. Hopefully those changes won't interfere.

2017/01/13 00:30:04 http2: Transport failed to get client conn for XXX.clypd.com:443: http2: no cached connection was available
2017/01/13 00:30:04 http2: Transport creating client conn 0xc420171380 to XXX:443
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2017/01/13 00:30:04 http2: Transport encoding header ":authority" = "XXX.clypd.com"
2017/01/13 00:30:04 http2: Transport encoding header ":method" = "GET"
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2017/01/13 00:30:04 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2017/01/13 00:30:04 http2: Transport encoding header ":path" = "XXX?end_date=01%2F12%2F2022&start_date=01%2F13%2F2017"
2017/01/13 00:30:04 http2: Transport encoding header ":scheme" = "https"
2017/01/13 00:30:04 http2: Transport encoding header "accept" = "application/json"
2017/01/13 00:30:04 http2: Transport encoding header "authorization" = "XXX.XXX.XXX"
2017/01/13 00:30:04 http2: Transport encoding header "accept-encoding" = "gzip"
2017/01/13 00:30:04 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=265
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: wrote SETTINGS flags=ACK len=0
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read WINDOW_UPDATE len=4 (conn) incr=2147418112
2017/01/13 00:30:04 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=2147418112
2017/01/13 00:30:04 http2: Framer 0xc4203ac8f0: read SETTINGS flags=ACK len=0
2017/01/13 00:30:04 http2: Transport received SETTINGS flags=ACK len=0
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=1 len=255
2017/01/13 00:30:11 http2: decoded hpack field header field ":status" = "200"
2017/01/13 00:30:11 http2: decoded hpack field header field "date" = "Fri, 13 Jan 2017 00:30:11 GMT"
2017/01/13 00:30:11 http2: decoded hpack field header field "content-type" = "application/json; charset=utf-8"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-frame-options" = "SAMEORIGIN"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-xss-protection" = "1; mode=block"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-content-type-options" = "nosniff"
2017/01/13 00:30:11 http2: decoded hpack field header field "etag" = "\"19a8a0bf510ccd2fc61c6f74d8bac94b\""
2017/01/13 00:30:11 http2: decoded hpack field header field "cache-control" = "max-age=0, private, must-revalidate"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-request-id" = "530d5473-f35c-4dd8-9f08-0c0e62b8f4ea"
2017/01/13 00:30:11 http2: decoded hpack field header field "x-runtime" = "6.870694"
2017/01/13 00:30:11 http2: decoded hpack field header field "server" = "thin"
2017/01/13 00:30:11 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=255
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=396 data="XXX (140 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=396 data="XXX (140 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=7680
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=8192
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=16384
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=1514 data="XXX (1258 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=12128
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=8949
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA stream=1 len=6264 data="XXX (6008 bytes omitted)
2017/01/13 00:30:11 http2: Transport received DATA stream=1 len=6264 data="XXX (6008 bytes omitted)
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: read DATA flags=END_STREAM stream=1 len=0 data=""
2017/01/13 00:30:11 http2: Transport received DATA flags=END_STREAM stream=1 len=0 data=""
2017/01/13 00:30:11 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=1 len=4 incr=11691
2017/01/13 00:30:12 http2: Transport encoding header ":authority" = "XXX.clypd.com"
2017/01/13 00:30:12 http2: Transport encoding header ":method" = "GET"
2017/01/13 00:30:12 http2: Transport encoding header ":path" = "XXX?end_date=01%2F12%2F2022&start_date=01%2F13%2F2017"
2017/01/13 00:30:12 http2: Transport encoding header ":scheme" = "https"
2017/01/13 00:30:12 http2: Transport encoding header "authorization" = "XXX.XXX.XXX"
2017/01/13 00:30:12 http2: Transport encoding header "accept" = "application/json"
2017/01/13 00:30:12 http2: Transport encoding header "accept-encoding" = "gzip"
2017/01/13 00:30:12 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=3 len=167
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: read GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:12 http2: Transport received GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255
2017/01/13 00:30:18 http2: decoded hpack field header field ":status" = "200"
2017/01/13 00:30:18 http2: decoded hpack field header field "date" = "Fri, 13 Jan 2017 00:30:18 GMT"
2017/01/13 00:30:18 http2: decoded hpack field header field "content-type" = "application/json; charset=utf-8"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-frame-options" = "SAMEORIGIN"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-xss-protection" = "1; mode=block"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-content-type-options" = "nosniff"
2017/01/13 00:30:18 http2: decoded hpack field header field "etag" = "\"19a8a0bf510ccd2fc61c6f74d8bac94b\""
2017/01/13 00:30:18 http2: decoded hpack field header field "cache-control" = "max-age=0, private, must-revalidate"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-request-id" = "b834b8d7-cb49-45e5-a731-22ef0784220a"
2017/01/13 00:30:18 http2: decoded hpack field header field "x-runtime" = "6.018764"
2017/01/13 00:30:18 http2: decoded hpack field header field "server" = "thin"
2017/01/13 00:30:18 http2: Transport received HEADERS flags=END_HEADERS stream=3 len=255
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=1153 data="XXX (897 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=1153 data="XXX (897 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=7680
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=8192
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=757 data="XXX (501 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=9857
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=6527
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=8192 data="XXX (7936 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read DATA stream=3 len=2271 data="XXX (2015 bytes omitted)
2017/01/13 00:30:18 http2: Transport received DATA stream=3 len=2271 data="XXX (2015 bytes omitted)
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote WINDOW_UPDATE stream=3 len=4 incr=21077
2017/01/13 00:30:18 http2: Transport readFrame error on conn 0xc420171380: (*net.OpError) read tcp 10.10.64.169:51702->XXX:443: read: connection reset by peer
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: wrote RST_STREAM stream=3 len=4 ErrCode=CANCEL
bradfitz commented 7 years ago

@tombergan, I have a theory. Can you sanity check me?

Note that the error he's getting from RoundTrip is:

server sent GOAWAY and closed the connection

That comes only from http2/transport.go's GoAwayError.

But GoAwayError is only ever used by (*clientConnReadLoop).cleanup, where it's sent to any opens streams in the two range loops.

My theory is that when we get the initial GOAWAY and decide to retry the request on a new conn, we're forgetting to remove those streams from the originally-chosen ClientConn's activeRes and/or streams maps.

Plausible?

tombergan commented 7 years ago

My theory is that when we get the initial GOAWAY and decide to retry the request on a new conn, we're forgetting to remove those streams from the originally-chosen ClientConn's activeRes and/or streams maps.

You may have described a bug, but I don't think that's happening in this scenario. I see only two requests in the connection: stream=1 and stream=3. This is the second request:

2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=3 len=167
2017/01/13 00:30:12 http2: Framer 0xc4203ac8f0: read GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:12 http2: Transport received GOAWAY len=8 LastStreamID=3 ErrCode=NO_ERROR Debug=""
2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255

Note that the GOAWAY frame comes just after sending request HEADERS on stream=3. Note also that the GOAWAY frame has LastStreamID=3. We do not close stream=3 because the server claims they might process it. We optimistically assume that we will get a response. https://github.com/golang/net/blob/69d4b8aa71caaaa75c3dfc11211d1be495abec7c/http2/transport.go#L554

We receive response HEADERS on stream=3 after the GOAWAY (6 seconds after the GOAWAY, in fact). We then receive a sequence of DATA frames. However, we never get END_STREAM. Instead, the server closes the connection. Their server is behaving legally ... I just double-checked the spec, and it's quite clear that the server is not required to process streams with id < GOAWAY.LastStreamId. The server MAY process those streams only partially.

There's not much we can do. We cannot retry the request after the connection closes because we've already received response headers and returned from RoundTrip. I have only two ideas, neither perfect:

  1. The error message can be improved. Perhaps io.UnexpectedEOF instead of "server sent GOAWAY"?

  2. This is a GET, so we could pessimistically retry the RoundTrip immediately after receiving the GOAWAY, rather than optimistically assuming the request will complete. I could see that being suboptimal in some cases, though.

bradfitz commented 7 years ago

He said that RoundTrip returned an error.

If this trace is really about streamid 3 yet we saw this:

2017/01/13 00:30:18 http2: Framer 0xc4203ac8f0: read HEADERS flags=END_HEADERS stream=3 len=255

... then RoundTrip should've returned something (END_HEADERS is set), and it's the Body.Read that would've returned the GOAWAY error.

So I still don't see a clear picture of what happened.

tombergan commented 7 years ago

I am stumped. I looked for suspicious uses of cc.mu on the guess that RoundTrip may be getting deadlocked. Found this suspicious defer cc.mu.Unlock (should that instead happen just after encodeTrailers?), but that shouldn't affect requests that don't have a request body.

@bfallik, can you triple-check that the error is coming from http.Client.Do, not Response.Body.Read? Assuming the error is coming from http.Client.Do, could you insert the following code in clientConnReadLoop.cleanup just before this line:

+buf := make([]byte, 16<<10)
+buf = buf[:runtime.Stack(buf, true)]
+cc.vlogf("stack after run()\n:%s", string(buf))
 err = GoAwayError{

It would also help to add a vlogs in clientConnReadLoop.processHeaders just before this return nil

 // We'd get here if we canceled a request while the
 // server had its response still in flight. So if this
 // was just something we canceled, ignore it.
+cc.vlog("processHeaders could not find stream with id %d", f.StreamID)
 return nil

and this return nil

 // (nil, nil) special case. See handleResponse docs.
+cc.vlog("processHeaders got (nil, nil) from handleResponse")
 return nil

After making the above changes, could you run the program until it fails with that same "server sent goaway" error, then copy the stack dump here? Thanks! It is fine to remove the call frames running your code if you need to keep those private.

bfallik commented 7 years ago

@tombergan Hi. I don't have the rest of the logging you requested but I do believe that the error is propagated from Response.Body.Read() and not http.Client.Do(). I slightly modified my test harness to split apart those operations and the stack trace clearly originates from:

    if err = json.NewDecoder(resp.Body).Decode(&v); err != nil {
        return nil, errors.WithStack(err)
    }

Do you still need that other debugging information or is there different info I can provide?

Also, just to be clear, this is the code I used to reproduce the error:

func checkResponseCode(r *http.Response, err error) (*http.Response, error) {
    if err != nil {
        return nil, err
    }
    if r.StatusCode >= http.StatusBadRequest {
        return nil, errors.WithStack(HTTPError{r})
    }
    return r, nil
}

func HTTPGet(endpoint url.URL, jwtToken string) (*http.Response, error) {
    request, err := http.NewRequest("GET", endpoint.String(), nil)
    if err != nil {
        return nil, err
    }
    request.Header.Set("Accept", "application/json")
    if len(jwtToken) > 0 {
        request.Header.Add("Authorization", jwtToken)
    }

    resp, err := checkResponseCode(http.DefaultClient.Do(request))
    if err != nil {
        return nil, err
    }

    var v interface{}
    if err = json.NewDecoder(resp.Body).Decode(&v); err != nil {
        return nil, errors.WithStack(err)
    }
    return nil, err
}

I think this code is missing a resp.Body.Close() and the original may be missing that as well. Could that be related?

tombergan commented 7 years ago

Well, that changes everything :) We're back to my earlier comment.

I think this code is missing a resp.Body.Close() and the original may be missing that as well. Could that be related?

No, I don't think that's related. I think the sequence of events is described by my linked comment above.

bradfitz commented 7 years ago

So there's no bug here, then.

Except maybe in the http2 server. Which server is this? The linked SO post suggests it's AWS? Is that a new implementation? Do we have any contacts there?

bradfitz commented 7 years ago

@jeffbarr, it seems the AWS ALB's new HTTP/2 support is behaving oddly, not finishing streams after sending a GOAWAY. Who's the right person on the AWS side to look into this? Thanks!

bfallik commented 7 years ago

@bradfitz yes, the server is AWS Application Load Balancer.

bfallik commented 7 years ago

@bradfitz assuming there's no bug client side do you recommend we explicitly catch the GOAWAY+NO_ERROR error and retry the request within our application logic? I'm still learning about http2 and so wasn't sure the expected behavior but now that seems like our only/best workaround until the server can be fixed.

bradfitz commented 7 years ago

@bfallik, you could I suppose. I wouldn't make it specific to that error, though. Even though ALB shouldn't do that, you could just treat it as any other type of network error or interrupted request and retry N times, especially if it's just a GET request.

In this case it's too late for the Go http client to retry since the headers have already been returned. If we did retry, it's possible the headers would be different, so we couldn't stitch the second response's body together with the first response's headers.

Ideally ALB would be fixed, though.

bfallik commented 7 years ago

@bradfitz OK, thanks.

bradfitz commented 7 years ago

I'll keep this bug open, but I'm not sure we can do anything.

Maybe we can make the error message more clear at least.

Hopefully AWS fixes this, if that's indeed the problem.

bfallik commented 7 years ago

@bradfitz OK. I've escalated a ticket with their support and will report back when I hear anything. I've also pointed them at this ticket since it contains a much deeper analysis of the problem from the client side.

bfallik commented 7 years ago

Amazon just let me know they don't believe the issue is on their side either. They've asked me to provide some sample Go code that reproduces the issue. I'll copy that code here too in case it's of value. It's entirely possible we've written something wrong on our side but that would be surprising since we didn't have any problems with HTTP 1.1.

bfallik commented 7 years ago

Much smaller (and complete) reproducible test case:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "time"
)

func HTTPGet(endpoint, jwtToken string) error {
    request, err := http.NewRequest("GET", endpoint, nil)
    if err != nil {
        return err
    }
    request.Header.Set("Accept", "application/json")
    if len(jwtToken) > 0 {
        request.Header.Add("Authorization", jwtToken)
    }

    resp, err := http.DefaultClient.Do(request)
    if err != nil {
        return err
    }
    defer resp.Body.Close()
    if resp.StatusCode >= http.StatusBadRequest {
        return fmt.Errorf("HTTP error status %v", resp.StatusCode)
    }

    var v interface{}
    return json.NewDecoder(resp.Body).Decode(&v)
}

func main() {
    log.SetFlags(0)

    endpoint, token := os.Args[1], os.Args[2]

    i := 0
    for {
        log.Printf("round %d", i)
        if err := HTTPGet(endpoint, token); err != nil {
            log.Fatalf("HTTPGet(): %v", err)
        }
        i++
        time.Sleep(time.Second)
    }
    panic("return nil")
}
jbardin commented 7 years ago

@bfallik,

Does this still reproduce the issue if you remember to defer resp.Body.Close()?

bfallik commented 7 years ago

@jbardin yes, sadly. I was hoping your suggestion would expose the bug.

bfallik commented 7 years ago

i updated the program to include the defer.

bradfitz commented 7 years ago

@bfallik, can you email me privately with a URL & token I can use so I don't need to spend the time & learn how to set up ALB?

bfallik commented 7 years ago

@bradfitz sure

bradfitz commented 7 years ago

Same thing:

AWS replies with GOAWAY LastStreamID=13 and then sends the HEADERS and part of the DATA for Stream 13, but then closes the TCP connection before sending any END_STREAM bit on a StreamID 13 frame:

2017/01/19 16:33:46 round 6
2017/01/19 16:33:46 http2: Transport encoding header ":authority" = "..."
2017/01/19 16:33:46 http2: Transport encoding header ":method" = "GET"
2017/01/19 16:33:46 http2: Transport encoding header ":path" = "/..."
2017/01/19 16:33:46 http2: Transport encoding header ":scheme" = "https"
2017/01/19 16:33:46 http2: Transport encoding header "accept" = "application/json"
2017/01/19 16:33:46 http2: Transport encoding header "authorization" = "..."
2017/01/19 16:33:46 http2: Transport encoding header "accept-encoding" = "gzip"
2017/01/19 16:33:46 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2017/01/19 16:33:46 http2: Framer 0xc420082270: wrote HEADERS flags=END_STREAM|END_HEADERS stream=13 len=8
2017/01/19 16:33:50 http2: Framer 0xc420082270: read GOAWAY len=8 LastStreamID=13 ErrCode=NO_ERROR Debug=""
2017/01/19 16:33:50 http2: Transport received GOAWAY len=8 LastStreamID=13 ErrCode=NO_ERROR Debug=""
2017/01/19 16:33:52 http2: Framer 0xc420082270: read HEADERS flags=END_HEADERS stream=13 len=256
2017/01/19 16:33:52 http2: decoded hpack field header field ":status" = "200"
2017/01/19 16:33:52 http2: decoded hpack field header field "date" = "Thu, 19 Jan 2017 16:33:52 GMT"
2017/01/19 16:33:52 http2: decoded hpack field header field "content-type" = "application/json; charset=utf-8"
2017/01/19 16:33:52 http2: decoded hpack field header field "x-frame-options" = "SAMEORIGIN"
2017/01/19 16:33:52 http2: decoded hpack field header field "x-xss-protection" = "1; mode=block"
2017/01/19 16:33:52 http2: decoded hpack field header field "x-content-type-options" = "nosniff"
2017/01/19 16:33:52 http2: decoded hpack field header field "etag" = "\"...\""
2017/01/19 16:33:52 http2: decoded hpack field header field "cache-control" = "max-age=0, private, must-revalidate"
2017/01/19 16:33:52 http2: decoded hpack field header field "x-request-id" = "..."
2017/01/19 16:33:52 http2: decoded hpack field header field "x-runtime" = "5.984035"
2017/01/19 16:33:52 http2: decoded hpack field header field "server" = "thin"
2017/01/19 16:33:52 http2: Transport received HEADERS flags=END_HEADERS stream=13 len=256
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=396 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=396 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: wrote WINDOW_UPDATE stream=13 len=4 incr=7680
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=1514 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: wrote WINDOW_UPDATE stream=13 len=4 incr=8192
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=1514 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: wrote WINDOW_UPDATE stream=13 len=4 incr=9100
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=8192 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: read DATA stream=13 len=757 data="..."
2017/01/19 16:33:52 http2: Transport received DATA stream=13 len=757 data="..."
2017/01/19 16:33:52 http2: Framer 0xc420082270: wrote WINDOW_UPDATE stream=13 len=4 incr=7284
2017/01/19 16:33:52 http2: Transport readFrame error on conn 0xc420001860: (*net.OpError) read tcp 10.240.0.12:34533->54.174.137.204:443: read: connection reset by peer
2017/01/19 16:33:52 http2: Framer 0xc420082270: wrote RST_STREAM stream=13 len=4 ErrCode=CANCEL
2017/01/19 16:33:52 HTTPGet(): http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug=""
exit status 1

You had mentioned in email that this happens predictably when you register a new instance in ALB.

If they're killing your instance when you do so, they're only giving your old instance 2 seconds to shut down gracefully.

Or maybe that's a configuration knob?

bfallik commented 7 years ago

Checking back in after a long absence but AWS finally responded to the support ticket I raised and the public message board post: https://forums.aws.amazon.com/thread.jspa?messageID=771883#771883.

They still contend that their implementation is spec-compliant and, by implication, Go's is not, though they haven't provided any details or answers to the questions raised in this ticket.

tombergan commented 7 years ago

I believe the Go and AWS implementations are both spec compliant. From my earlier comment:

We receive response HEADERS on stream=3 after the GOAWAY (6 seconds after the GOAWAY, in fact). We then receive a sequence of DATA frames. However, we never get END_STREAM. Instead, the server closes the connection. Their server is behaving legally ... I just double-checked the spec, and it's quite clear that the server is not required to process streams with id < GOAWAY.LastStreamId. The server MAY process those streams only partially.

To quote the spec more explicitly:

Endpoints SHOULD always send a GOAWAY frame before closing a connection so that the remote peer can know whether a stream has been partially processed or not. [note: SHOULD not MUST]

...

The last stream identifier from the last GOAWAY frame received indicates which streams could have been acted upon. [note: "could have been acted upon", not "was completely acted upon"]

I don't know why the server waits 6 seconds to send HEADERS after sending the GOAWAY. There may be something suboptimal happening in AWS, I don't have enough information to say. Popping up a level, it's always possible for requests to fail. It's just a fact of life. This is a GET request, so the request is idempotent. Can't you solve this problem by retrying the request at a higher level (hopefully with a retry limit and/or exponential backoff)? If that doesn't work because the request fails deterministically, then it sounds like there's either a bug in AWS or a bug in your usage of AWS, but in either case, it's not a Go bug.

tombergan commented 7 years ago

Based on JonZ's post on the amazon forum thread, it sounds like your backend server is taking too long to generate the response body, which causes the AWS load balancer to consider the connection idle then shut down the connection with GOAWAY. This sounds like an intentional policy decision of the ALB. So, I think the bug is that your backend server takes too long to generate the response body.

Leaving this open in case @bradfitz wants to repurpose this bug to improve the error message.

The error message can be improved. Perhaps io.UnexpectedEOF instead of "server sent GOAWAY"?

bfallik commented 7 years ago

@tombergan thanks for clarifying. I misunderstood comments https://github.com/golang/go/issues/18639#issuecomment-273829910 and https://github.com/golang/go/issues/18639#issuecomment-272545083, which I thought implied that the ALB was not behaving to spec.

Yes, we can retry the request within our app. One question I had was if this retry should really occur within the http package. My reasoning being that http2 is behaving differently from http and my expectation was that it was a drop-in replacement. But it seems based on yours and @bradfitz 's feedback that this is the right solution either (see https://github.com/golang/go/issues/18639#issuecomment-272545083).

We'll consider the issue from our side now, either by trying to speed up the HTTP responses or by implementing retries. Happy to leave this issue open if the error wording can be improved.

walterking commented 7 years ago

fwiw, we see the same behavior on iOS with ALB. linked to this issue in our support ticket and got the same response

SleepyBrett commented 6 years ago

We are seeing this error from some versions of kubectl (the kubernetes cli client, written in go, of course) since moving our api servers behind an ALB except that we are getting ENHANCE_YOUR_CALM.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Unable to connect to the server: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=ENHANCE_YOUR_CALM, debug=""
bradfitz commented 6 years ago

@SleepyBrett, tune your ALB I guess?

Or do you want a different error message on Go's side?

SleepyBrett commented 6 years ago

Tune it how? This thread seems to come down to a bunch of finger pointing and no movement, aws claims you are broken, you claim they are broken, I don't see a single comment that claims some magical tuning method for the alb. The fact that it works for some of our users and not others (consistently per client, I'm gathering information to see if I can correlate it to kubectl versions and/or the version of go that was used to compile it ) seems to be pointing to some kind of movement inside this library or potentially in how kubectl uses this library.

bradfitz commented 6 years ago

@SleepyBrett, I figured ALB would have some parameters for you to tell it to wait longer before closing client connections.

Looking at the traces above, Amazon is closing its TCP connections before finishing writing the responses.

If you want to send a trace, we can see if it's the same story.

SleepyBrett commented 6 years ago

Hey so we might be onto a new theory that could explain some of the conflicting results I'm see from my user survey.

A few users went they sent me their kubectl -v10 output had ENORMOUS auth tokens, I think the alb is bouncing them with an inaccurate error code because their token is violating the spec. The spec seems to say that each header field can only be up to 16k long.

sj26 commented 6 years ago

We've seen this "goaway" error quite commonly from our golang http/2 clients against our ALB backends and paid it no mind thinking it was just a "do a retry" error. But we noticed more recently that it seems like a deeper problem, especially now go's http client handles the retry cases. It seems to be as @bradfitz suggests, the ALB is correctly sending a GOAWAY, and the go http client is noticing, but then the ALB closes the connection before closing the in-flight http/2 stream by omitting an end_stream flag on the final frame — an "unexpected connection close" error. It also looks like the stream is completed, and the upstream has fully processed a request and response, it's just omitting the flag in the http/2 stream. This breaks the expected contract of a load balancer imho. I've chimed in on the aws forums issue already linked.

@bradfitz perhaps in this case the go http client needs to return an "unexpected connection close before stream end" error or something, instead of blaming the goaway?

bilalyasar commented 6 years ago

we are experiencing same issue with aws elb -->nginx --> golang service under kubernetes. it seems that payload is corrupted somehow.

froodian commented 5 years ago

Is there anything at all that can be done here? Amazon ALB isn't going away, and if it is technically spec-conforming (even if we view its behavior as sub-optimal), it would be nice for golang not to throw errors.

bradfitz commented 5 years ago

@froodian, what do you want me to do?

froodian commented 5 years ago

Is it safe to simply retry the request? Or no because it may have gone through and may not be idempotent? I suspect that the way I'll silence this at my layer is to retry any error where the message contains both "server sent GOAWAY and closed the connection" and "ErrCode=NO_ERROR"

bradfitz commented 5 years ago

@froodian, nope, what's happening is:

If ALB hadn't sent a response header then we could retry the request, but it's pretty weird for us to retry the request when we've already given the response headers to the user code. The only safe thing to do is retry the request and hope for exactly the "same" response headers and only if they "match", then continue acting like the original res.Body (which the Go user code is already reading from) is part of the second retried request.

But things like the server's Date header probably changed, so that at least needs to be ignored. What else?

What if ALB had already returned some of the response body bytes, but not all? Do we need to keep a checksum of bytes read and stitch together the two bodies if the body is same length and second response response's prefix bytes have the same checksum?

That would all be super sketchy.

It's better to just return an error, which we do. If the caller wants to retry, they can retry.

Do you just want a better error message? What text would sound good to you?

froodian commented 5 years ago

I see, yeah, thank you for that condensed write-up @bradfitz... that makes sense, and I agree it's correct from a client perspective not to retry when we've already been given response headers... I agree with @sj26 that a message like "unexpected connection close before stream end" or something long those lines might help indicate the problem a little more clearly.

But at a more root level, I also wonder if we could lean more heavily on AWS as a group to change their behavior - I agree it really seems like they should let that last response write out its whole body to EOF before they close the TCP connection... but I guess they want to have timeouts on their end too so that if the server's app code never EOFs the response body for some reason, they still clean up the TCP connection at some point, hence their current behavior...? I guess https://forums.aws.amazon.com/thread.jspa?messageID=771883#771883 appears to be the most recent public thread with AWS about this? But I also wonder if other conversations have gone on behind the scenes.

stephanwesten commented 5 years ago

I searched in our logs for GOAWAY and found a couple of thousand hits with the following message: Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=17611, ErrCode=NO_ERROR, debug=""

All hits have ErrCode=NO_ERROR afaict.

This seems harmless, can this be an info message instead of error?

bradfitz commented 5 years ago

@Nerothos, that looks unrelated to this bug. (ErrCode=PROTOCOL_ERROR, not ErrCode=NO_ERROR). I'd file a new bug.