golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.29k stars 17.7k forks source link

x/net/http2: client sends too many pings for gRPC servers since v0.31.0 #70575

Open RemiBou opened 14 hours ago

RemiBou commented 14 hours ago

Go version

1.23.3

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT='boringcrypto'
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/.gvm/pkgsets/go1.23.3/global/pkg/mod'
GONOPROXY='none'
GONOSUMDB='...'
GOOS='linux'
GOPATH='/root/.gvm/pkgsets/go1.23.3/global'
GOPRIVATE='...'
GOPROXY='...'
GOROOT='/root/.gvm/gos/go1.23.3'
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/root/.gvm/gos/go1.23.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='.../aarch64-linux-musl-cross/bin/aarch64-linux-musl-gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='.../go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1336983329=/tmp/go-build -gno-record-gcc-switches'

What did you do?

We use traefik as a library for our service mesh.

What did you see happen?

Since upgrading x/net to 0.31.0 we have this error appearing on our instance that is deployed as a sidecar of a grpc java server

2024/11/26 10:32:41 httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=561, ErrCode=ENHANCE_YOUR_CALM, debug="too_many_pings"

Downgrading net to 0.30 fixes the issues

I think it might be due to the new ping added here

https://github.com/golang/net/commit/f35fec92ec9213ee211cf45f451a5970386f7978#diff-e9bd9b4a514c2960ad85405e4a827d83c2decaaac70e9e0158a59db2d05807a7R3149 by @neild

You can find the framer logs here (http2debug=2) framerlog.log

This message is returned by grpc java here

https://github.com/grpc/grpc-java/blob/a79982c7fded85ac6f53896f045e959b6cbd7c43/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L983

Which runs this function https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/KeepAliveEnforcer.java#L57

This basically refuses a ping if one was sent less than 2 hour before

What did you expect to see?

Stream to be kept and not reset

gabyhelp commented 14 hours ago

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

GoVeronicaGo commented 14 hours ago

cc: @neild @tombergan

seankhliao commented 9 hours ago

The gRPC algorithm is at https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md#server-enforcement

neild commented 8 hours ago

Two hours is certainly...a choice.

The default limit seems to be five minutes if there are outstanding requests, which is still an eternity. There doesn't seem to be any reasonable and safe way to health check an unresponsive Java (possibly any) gRPC connection. (Well, other than to send junk requests, such as OPTIONS *. So far as I can tell, those are fine, despite being more expensive than cheap PING frames.)

Not sure what the correct thing to do here is.

seankhliao commented 8 hours ago

I used to run into similar issues with other reverse proxies as well. In the past I've disabled enforcement on the server side, in go that's https://pkg.go.dev/google.golang.org/grpc@v1.68.0/keepalive#EnforcementPolicy I think that may be a reasonable thing to do if grpc no longer has a direct mapping of the underlying streams?

neild commented 8 hours ago

Some options:

  1. Roll back https://go.dev/cl/617655. I dislike this, because I think that change fixes real problems.
  2. Add a GODEBUG to disable sending PINGs other than when explicitly configured to do so. Inconvenient for users, but something we can do today.
  3. Add a knob controlling whether to send non-ReadIdleTimeout PINGs. (DisablePingFrames?) Nobody likes more knobs, and this would need to go through the proposal process. Maybe something we should do, but not an immediate fix.
  4. Detect the ENHANCE_YOUR_CALM/"too_many_pings" closure and remember not to send PINGs on connections to that authority. Fiddly. One connection still needs to drop before we stop pinging. Can't say I like this.

gRPC Java resets its "too many pings" counter every time it sends headers, data, or trailers. You can send as many pings as you want, as long as you wait for the server to send something after the ping response. Perhaps there's some way we can take advantage of this, but I haven't figured out how yet.

RemiBou commented 5 hours ago

@neild from what I understood this change is trying to workaround issue with tcp when application isnt gracefully shutdown or there is a network issue and request are just timing out

Maybe it's not the role of the http2 client to overcome this ? grpc java closed similar problem https://github.com/grpc/grpc-java/issues/8380. To me the standard solution is to decrease the tcp retries on environment (Elastic Search suggest this).

On a side note, traefik didnt bump of x/net yet, istio didnt release the bump (but it is in master). Once they release it, this issue might become more important.