grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
21.11k stars 4.39k forks source link

Intermittent (but frequent) transport is closing errors #1722

Closed cristiangraz closed 6 years ago

cristiangraz commented 6 years ago

What version of gRPC are you using?

Client:: 1.8.0 (5a9f7b402fe85096d2e1d0383435ee1876e863d0) Server: 1.8.0 (a4bf341022f076582fc2e9a802ce170a8938f81d)

What version of Go are you using (go version)?

Client: 1.9.2 Server: 1.9.1

What operating system (Linux, Windows, …) and version?

Client: Alpine linux 3.6 Server: Alpine linux 3.6

What did you do?

I'm seeing intermittent but very frequent transport is closing errors. Example setup below:

I added in a 5 second polling of the grpc connection state (conn.GetState()) and noticed the connection stayed at READY for hours, but as soon as I triggered some API calls it would go to TRANSIENT_FAILURE -> CONNECTING -> READY. As I kept retrying API calls sometimes they would be successful for 50+ API requests and sometimes every other request or every few requests would repeat the error (logging transport is closing).

I set the ENV var GRPC_GO_LOG_SEVERITY_LEVEL=info on the client which logs this:

timestamp=2017-12-08T20:35:50.769455983Z level=error message="error retrieving foo" error="rpc error: code = Unavailable desc = transport is closing"
INFO: 2017/12/08 20:35:50 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, TRANSIENT_FAILURE
INFO: 2017/12/08 20:35:50 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, CONNECTING 
INFO: 2017/12/08 20:35:51 pickfirstBalancer: HandleSubConnStateChange: 0xc42028e3d0, READY

READY to TRANSIENT_FAILURE here says:

Any failure encountered while expecting successful communication on established channel.

But I can't determine what the failure causing transport is closing is.

I'm not sure if it's a configuration issue (maybe the server needs to align with the client keepalive params) or something else. Nothing jumped out at me in the recent commits that may address this, so curious to see if anyone has any idea where to go next to identify the cause.

Client connection:

opts := []grpc.DialOption{
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time: 120 * time.Second,
    }),
}
// creds is a grpc.DialOption using grpc.WithTransportCredentials
conn, err := grpc.Dial(`example.example.svc.cluster.local:80`, append(opts, creds)...)
if err != nil {
    return err
}

Server looks like this:

// In package main
creds, err := MutualTLSServer()
if err != nil {
    m.Logger().WithError(err).Fatal("cannot init grpc server credentials")
}
m.Server = NewServer(creds)

//
// NewServer() code referenced in package main:
//
// DefaultPort is the default port for the Server.
const DefaultPort = 3000

// Server represents a gRPC server.
type Server struct {
    *grpc.Server
    ln net.Listener

    Addr string

    FooServer    *FooServer
}

// NewServer returns a new instance of Server.
func NewServer(opt ...grpc.ServerOption) *Server {
    logger := NewLogger("foo")
    opt = append(opt, grpc.UnaryInterceptor(LoggingInterceptor(logger)))
    return &Server{
        Server: grpc.NewServer(opt...),
        Addr:   fmt.Sprintf(":%d", DefaultPort),

        FooServer: &FooServer{},
    }
}

// Open opens a port for listening.
func (s *Server) Open() error {
    foo.RegisterFooServiceServer(s.Server, s.FooServer)

    ln, err := net.Listen("tcp", s.Addr)
    if err != nil {
        return err
    }
    s.ln = ln

    go s.Serve(s.ln)
    return nil
}

// Close shuts the server down.
func (s *Server) Close() error {
    if s.ln != nil {
        return s.ln.Close()
    }
    return nil
}

We are running on GKE connecting using Kubernetes (1.8.1-gke.1 for master and nodes) service discovery to connect to service.service.svc.cluster.local:80.

What did you expect to see?

Successful grpc calls.

What did you see instead?

transport is closing errors. See above.

MakMukhi commented 6 years ago

So the default keepalive enforcement policy on the server is 5 minutes: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/keepalive/keepalive.go#L62

Also, by default client doesn't run keepalive unless there are active streams: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/keepalive/keepalive.go#L38

This explains why the connection is closed by the server soon after the client makes an RPC.

If you set the client to send keepalive pings every 120 seconds, make sure you set the server's enforcement policy accordingly.

Also turning on transport level logs by setting verbosity level to 2 will help seeing these messages: https://github.com/grpc/grpc-go/blob/cce0e436e5c428f2094edd926779788f1024938d/grpclog/grpclog.go#L26

Hope this helps.

cristiangraz commented 6 years ago

@MakMukhi Thank you! Removing the client keepalive params seems to have resolved the issue. I'll make sure those match up with the server if I change those in the future. This was even clearly mentioned in the docs 🤦‍♂️