Intermittent (but frequent) transport is closing errors

cristiangraz commented 6 years ago

What version of gRPC are you using?

Client:: 1.8.0 (5a9f7b402fe85096d2e1d0383435ee1876e863d0) Server: 1.8.0 (a4bf341022f076582fc2e9a802ce170a8938f81d)

What version of Go are you using (`go version`)?

Client: 1.9.2 Server: 1.9.1

What operating system (Linux, Windows, …) and version?

Client: Alpine linux 3.6 Server: Alpine linux 3.6

What did you do?

I'm seeing intermittent but very frequent transport is closing errors. Example setup below:

We are running on GKE/Kubernetes (details at the end)
Both client and server are running grpc-go
We are using WithKeepAliveParams on the client with 120s ping
No server keepalive configuration
Our system has a lot of idle time (it's not a high volume API)
We are not currently closing *grpc.ClientConn when the client closes. We are fixing this now, but also we don't start/stop services very often. Clients and servers pods often run for several days or even weeks at a time without restart per kubectl
We are not currently retrying failed grpc calls as we are first trying to identify a fix for this issue.

I added in a 5 second polling of the grpc connection state (conn.GetState()) and noticed the connection stayed at READY for hours, but as soon as I triggered some API calls it would go to TRANSIENT_FAILURE -> CONNECTING -> READY. As I kept retrying API calls sometimes they would be successful for 50+ API requests and sometimes every other request or every few requests would repeat the error (logging transport is closing).

I set the ENV var GRPC_GO_LOG_SEVERITY_LEVEL=info on the client which logs this:

timestamp=2017-12-08T20:35:50.769455983Z level=error message="error retrieving foo" error="rpc error: code = Unavailable desc = transport is closing"
INFO: 2017/12/08 20:35:50 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, TRANSIENT_FAILURE
INFO: 2017/12/08 20:35:50 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, CONNECTING 
INFO: 2017/12/08 20:35:51 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, READY

READY to TRANSIENT_FAILURE here says:

Any failure encountered while expecting successful communication on established channel.

But I can't determine what the failure causing transport is closing is.

I'm not sure if it's a configuration issue (maybe the server needs to align with the client keepalive params) or something else. Nothing jumped out at me in the recent commits that may address this, so curious to see if anyone has any idea where to go next to identify the cause.

Client connection:

opts := []grpc.DialOption{
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time: 120 * time.Second,
    }),
}
// creds is a grpc.DialOption using grpc.WithTransportCredentials
conn, err := grpc.Dial(`example.example.svc.cluster.local:80`, append(opts, creds)...)
if err != nil {
    return err
}

Server looks like this:

// In package main
creds, err := MutualTLSServer()
if err != nil {
    m.Logger().WithError(err).Fatal("cannot init grpc server credentials")
}
m.Server = NewServer(creds)

//
// NewServer() code referenced in package main:
//
// DefaultPort is the default port for the Server.
const DefaultPort = 3000

// Server represents a gRPC server.
type Server struct {
    *grpc.Server
    ln net.Listener

    Addr string

    FooServer    *FooServer
}

// NewServer returns a new instance of Server.
func NewServer(opt ...grpc.ServerOption) *Server {
    logger := NewLogger("foo")
    opt = append(opt, grpc.UnaryInterceptor(LoggingInterceptor(logger)))
    return &Server{
        Server: grpc.NewServer(opt...),
        Addr:   fmt.Sprintf(":%d", DefaultPort),

        FooServer: &FooServer{},
    }
}

// Open opens a port for listening.
func (s *Server) Open() error {
    foo.RegisterFooServiceServer(s.Server, s.FooServer)

    ln, err := net.Listen("tcp", s.Addr)
    if err != nil {
        return err
    }
    s.ln = ln

    go s.Serve(s.ln)
    return nil
}

// Close shuts the server down.
func (s *Server) Close() error {
    if s.ln != nil {
        return s.ln.Close()
    }
    return nil
}

We are running on GKE connecting using Kubernetes (1.8.1-gke.1 for master and nodes) service discovery to connect to service.service.svc.cluster.local:80.

What did you expect to see?

Successful grpc calls.

What did you see instead?

transport is closing errors. See above.

MakMukhi commented 6 years ago

So the default keepalive enforcement policy on the server is 5 minutes: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/keepalive/keepalive.go#L62

Also, by default client doesn't run keepalive unless there are active streams: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/keepalive/keepalive.go#L38

This explains why the connection is closed by the server soon after the client makes an RPC.

If you set the client to send keepalive pings every 120 seconds, make sure you set the server's enforcement policy accordingly.

Also turning on transport level logs by setting verbosity level to 2 will help seeing these messages: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/grpclog/grpclog.go#L26

Hope this helps.

cristiangraz commented 6 years ago

@MakMukhi Thank you! Removing the client keepalive params seems to have resolved the issue. I'll make sure those match up with the server if I change those in the future. This was even clearly mentioned in the docs 🤦‍♂️

grpc / grpc-go