golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.17k stars 17.57k forks source link

x/net/http2: make Transport return nicer error when Amazon ALB hangs up mid-response? #18639

Open bfallik opened 7 years ago

bfallik commented 7 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version go version go1.8rc1 darwin/amd64

What operating system and processor architecture are you using (go env)?

Linux AMD64

What did you do?

If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

We have http client code that has started to return errors when the corresponding server uses HTTP2 instead of HTTP.

What did you expect to see?

Identical behavior.

What did you see instead?

http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""

itcuihao commented 2 years ago

hi, same error.

GODEBUG=http2debug=2 go run m.go 
2022/03/17 18:03:39 http2: Transport failed to get client conn for db.ams.op-mobile.com:443: http2: no cached connection was available
2022/03/17 18:03:40 http2: Transport creating client conn 0xc000001680 to 82.145.213.10:443
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2022/03/17 18:03:40 http2: Transport encoding header ":authority" = "db.ams.op-mobile.com"
2022/03/17 18:03:40 http2: Transport encoding header ":method" = "GET"
2022/03/17 18:03:40 http2: Transport encoding header ":path" = "/user?uids=b9bd8d50dca2e64f136aefadf99fa82111105498,……"
2022/03/17 18:03:40 http2: Transport encoding header ":scheme" = "https"
2022/03/17 18:03:40 http2: Transport encoding header "accept-encoding" = "gzip"
2022/03/17 18:03:40 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2022/03/17 18:03:40 http2: Framer 0xc00031fa40: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=4810
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2022/03/17 18:03:41 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=128, INITIAL_WINDOW_SIZE=65536, MAX_FRAME_SIZE=16777215
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: wrote SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read WINDOW_UPDATE len=4 (conn) incr=2147418112
2022/03/17 18:03:41 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=2147418112
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Transport received SETTINGS flags=ACK len=0
2022/03/17 18:03:41 http2: Framer 0xc00031fa40: read GOAWAY len=8 LastStreamID=1 ErrCode=ENHANCE_YOUR_CALM Debug=""
2022/03/17 18:03:41 http2: Transport received GOAWAY len=8 LastStreamID=1 ErrCode=ENHANCE_YOUR_CALM Debug=""
2022/03/17 18:03:41 transport got GOAWAY with error code = ENHANCE_YOUR_CALM
2022/03/17 18:03:41 http2: Transport readFrame error on conn 0xc000001680: (*errors.errorString) EOF
2022/03/17 18:03:41 RoundTrip failure: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=ENHANCE_YOUR_CALM, debug=""
CameronGo commented 2 years ago

If it helps anyone else out there, we solved this problem by changing our ALB config to use HTTP1 instead of HTTP2. It is a workaround obviously and not a fix, but It is effective for now until Go gets around to changing this behavior.

bradfitz commented 2 years ago

but It is effective for now until Go gets around to changing this behavior.

You misunderstand where the problem lies. See my earlier comment. This isn't something that Go can fix. ALB is hanging up on the middle of responses. This is an Amazon problem.

CameronGo commented 2 years ago

but It is effective for now until Go gets around to changing this behavior.

You misunderstand where the problem lies. See my earlier comment. This isn't something that Go can fix. ALB is hanging up on the middle of responses. This is an Amazon problem.

It has been a while since the long and painful process of troubleshooting this issue with AWS support so my memory may be faulty. I had 2 seemingly conflicting notes on this issue and how it was concluded. Needless to say, the result was a bit unsatisfying and resulted in our just disabling HTTP2 on our load balancers.

One note I had, which seemed to support the POV that the issue was caused by a problem with the way AWS load balancers handle the HTTP2 standard said

ALBs seem to send GOAWAY — as you linked, https://tools.ietf.org/html/rfc7540#section-6.8 — mid-stream with a last stream id, which is fine, then potentially more header/data frames for that last stream id, which is fine, but never sets the END_STREAM flag — https://tools.ietf.org/html/rfc7540#section-8.1 — on any frame in the stream before closing the connection, which is the problem. This is an error in the ALB implementation of the HTTP2 spec — it is dropping the connection with a stream in an open state — https://tools.ietf.org/html/rfc7540#section-5.1 — which golang is correctly handling as an "unexpected closed connection" error, albeit hidden beneath a "goaway" error.

But then I had some follow-up notes that appeared to imply the way the Go HTTP2 client was handling GOAWAY is what was responsible. They included these notes from AWS support.

the RFC does not state that after a GOAWAY Frame is sent, the sender (ALB) 'must' allow all Streams including-and-up-to the stream identifier included in the frame, to send END_STREAM before terminating the connection. These are separate concepts, the connection termination process 'should' allow for this to happen, but what if the GOAWAY Frame includes an error code and a stream identifier of 0 - It moves responsibility onto the client application to respond or retry in the appropriate manner.

Then I have some much less organized notes from some internal discussions at this time where there's some speculation about whether the issue is that this is being treated as an error when it should not be. I don't recall the context right now. Specifically elsewhere I've noted

Since the server is sending NO_ERROR, your client should simply try to reconnect, and not treat the message as an error.

Regardless I think some reality check is in order. AWS is not infallible, but when such a significant portion of the INET depends on AWS, an issue like this should be brought to some sort of satisfactory conclusion. AWS is either RFC compliant or not and it seems there may be differing opinions on this; however, Go applications should be able to use services from the single largest cloud provider without encountering errors like this. I think it is noteworthy that I've haven't seen this issue discussed on forums from any other languages.

jeffbarr commented 2 years ago

Let me know if you need a connection (no pun intended) inside of AWS.

On Thu, Mar 17, 2022, at 7:50 PM, CameronGo wrote:

but It is effective for now until Go gets around to changing this behavior.

You misunderstand where the problem lies. See my earlier comment. This isn't something that Go can fix. ALB is hanging up on the middle of responses. This is an Amazon problem.

It has been a while since the long and painful process of troubleshooting this issue with AWS support so my memory may be faulty. I had 2 seemingly conflicting notes on this issue and how it was concluded. Needless to say, the result was a bit unsatisfying and resulted in our just disabling HTTP2 on our load balancers.

One note I had, which seemed to support the POV that the issue was caused by a problem with the way AWS load balancers handle the HTTP2 standard said

ALBs seem to send GOAWAY — as you linked, https://tools.ietf.org/html/rfc7540#section-6.8 — mid-stream with a last stream id, which is fine, then potentially more header/data frames for that last stream id, which is fine, but never sets the END_STREAM flag — https://tools.ietf.org/html/rfc7540#section-8.1 — on any frame in the stream before closing the connection, which is the problem. This is an error in the ALB implementation of the HTTP2 spec — it is dropping the connection with a stream in an open state — https://tools.ietf.org/html/rfc7540#section-5.1 — which golang is correctly handling as an "unexpected closed connection" error, albeit hidden beneath a "goaway" error.

But then I had some follow-up notes that appeared to imply the way the Go HTTP2 client was handling GOAWAY is what was responsible. They included these notes from AWS support.

the RFC does not state that after a GOAWAY Frame is sent, the sender (ALB) 'must' allow all Streams including-and-up-to the stream identifier included in the frame, to send END_STREAM before terminating the connection. These are separate concepts, the connection termination process 'should' allow for this to happen, but what if the GOAWAY Frame includes an error code and a stream identifier of 0 - It moves responsibility onto the client application to respond or retry in the appropriate manner.

Then I have some much less organized notes from some internal discussions at this time where there's some speculation about whether the issue is that this is being treated as an error when it should not be. I don't recall the context right now. Specifically elsewhere I've noted

Since the server is sending NO_ERROR, your client should simply try to reconnect, and not treat the message as an error.

Regardless I think some reality check is in order. AWS is not infallible, but when such a significant portion of the INET depends on AWS, an issue like this should be brought to some sort of satisfactory conclusion. AWS is either RFC compliant or not and it seems there may be differing opinions on this; however, Go applications should be able to use services from the single largest cloud provider without encountering errors like this. I think it is noteworthy that I've haven't seen this issue discussed on forums from any other languages.

— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18639#issuecomment-1071975035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATH4C7E65AVYR4BN3H56TVAPVRBANCNFSM4C4I66LA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

bradfitz commented 2 years ago

@jeffbarr, gladly. I'm bradfitz on Twitter (DMs open) or gmail.com/golang.org. Thanks!

jeffbarr commented 2 years ago

I am working to find a good connection, stay tuned...

radutopala commented 2 years ago

We are also seeing this issue on ELBs in front of ElasticSearch/OpenSearch clusters:

http2: server sent GOAWAY and closed the connection; LastStreamID=19999, ErrCode=NO_ERROR, debug=""
bradfitz commented 2 years ago

@jeffbarr, thanks for the connection! Three of us hopped on a call the other day and were able to repro the issue on demand.

For the record, the tool we used for debugging was https://github.com/bradfitz/h2slam pointed at an ALB, and then changed certain ALB parameters on the AWS control plane and the TCP connection from AWS would fail (in up to 10 seconds), often without even a GOAWAY.

I'll let AWS folk give further updates here.

radutopala commented 2 years ago

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

The load balancer sends a response code of 000

With HTTP/2 connections, if the compressed length of any of the headers exceeds 8K bytes or if the number of requests served through one connection exceeds 10,000, the load balancer sends a GOAWAY frame and closes the connection with a TCP FIN.
radutopala commented 2 years ago

Hitting the ElasticSearch ALB with https://github.com/bradfitz/h2slam gives

./h2slam --host xx.es.amazonaws.com --path /_cluster/settings > slam.log
2022/04/04 16:13:31 Get "https://xx.es.amazonaws.com/_cluster/settings": http2: Transport received Server's graceful shutdown GOAWAY
cat slam.log | wc -l
9999

which is the expected behavior actually from AWS documentation perspective. So exactly at the 10k request on the same connection, the connection closes.

ebarlas commented 1 year ago

Hi folks, was there ever a resolution here?

I'm witnessing the same problematic behavior from the ALB. The GOAWAY frame seems to preempt the response data, which is never received by the application.

CameronGo commented 1 year ago

To ensure we didn't see this behavior, we ended up switching all of our ALB target groups to HTTP1. This doesn't occur then. In addition, we made this change in our http client with the following reference:


    // And also add the same thing to `Request.GetBody`, which allows
    // `net/http` to get a new body in cases like a redirect. This is
    // usually not used, but it doesn't hurt to set it in case it's
    // needed. See:
    //
    //     https://github.com/stripe/stripe-go/issues/710
    //
    req.GetBody = func() (io.ReadCloser, error) {
        reader := strings.NewReader(payload)
        return ioutil.NopCloser(reader), nil
    }
ra-coder commented 12 months ago

My briff from the discussion:

This is an issue of how Go handle behaviour of aws load balancer (or other) under http 2.0 connection

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

The load balancer sends a response code of 000 With HTTP/2 connections, if the compressed length of any of the headers exceeds 8 K bytes or if the number of requests served through one connection exceeds 10,000, the load balancer sends a GOAWAY frame and closes the connection with a TCP FIN

So the root cause => is "to heavy header" or "ddos" (hight load in the moment) to the server/service where your Go app do http 2.0 requests

I suggest to handle this error same as return 503 Busy http code in HTTP/1

The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. The server MAY send a Retry-After header field to suggest an appropriate amount of time for the client to wait before retrying the request.

3052 commented 5 months ago

I am getting this error consistently as well with certain servers. disabling HTTP/2 seems to fix it:

client := http.Client{
   Transport: &http.Transport{
      Proxy: http.ProxyFromEnvironment,
      TLSNextProto: map[string]func(string, *tls.Conn) http.RoundTripper{},
   },
}