'use of closed network connection', what's the best practice to deal with this?

zejunlitg commented 3 months ago

error:

rpc error: code = Unavailable desc = connection error: desc = "transport: failed to write client preface: write tcp x.x.x.x:52310->x.x.x.x:9166: use of closed network connection

It's my belief that this is caused by the fact that the underlying TCP connection is closed on the client side, but client side tried still to write on it. Apparently this is a random issue -- with the env I have, I cannot reproduce this at all.

My questions are:

What's the best practice for handling this error? I'm aware gRPC has built-in retry support. I've been fiddling with it for an hour and still am not able to get it to work. This is the example I'm referring to: https://github.com/grpc/grpc-go/blob/master/examples/features/retry/client/main.go and the only difference is it's using grpc.NewClient() while I'm using grpc.Dial, after which I'm getting the client with the conn it returns.
With gRPC built-in retry, can it be aware of the closed network connection and get to create a new one and use the new one for retrying?
Is there a way to reproduce this issue with gRPC-go? what I tried is to close the underlying TCP connection with gdb's call close(fd)(ref: https://incoherency.co.uk/blog/stories/closing-a-socket.html). When TCP connection closes, the gRPC call gets stuck for some reason. I was expecting it to sense that the network connection is closed and will thus throw the error, but it does not do that.

gRPC version: 1.57.2 Thank you very much for help.

purnesh42H commented 3 months ago

Thanks @zejunlitg for the question. I will take a look and get back to you

purnesh42H commented 3 months ago

It's my belief that this is caused by the fact that the underlying TCP connection is closed on the client side, but client side tried still to write on it.

Could you clarify more what do you mean by above? "client preface" is the string that must be sent by new connections from clients. This error indicates a failure when trying to write the initial client message (client preface) to establish the gRPC connection. The specific error "use of closed network connection" suggests that the TCP connection was closed unexpectedly.

zejunlitg commented 3 months ago

@purnesh42H AFAIK, this error happens within golang's net package:

conn, err := net.Dial("tcp", ":8888")
if err != nil {
  log.Println("dial error:", err)
  return

// close the connection here
conn.Close()

// then try to write over the connection, will throw the error 
// 'write tcp x.x.x.x:PORT_SRC->x.x.x.x:PORT_SRC: use of closed network connection'
n, err = conn.Write(buf)

That's why I said the connection is closed on the client side, I hope this clarifies.

I agree that this happen unexpectedly, it's exactly what happened, can you help me understand more about what I can do when it happens? Do I:

retry calling the same RPC
re-create the gRPC client, then call the same RPC with the new client
instead of manually retry like point 1 & 2, use gRPC's built-in retry configuration? Which one is preferred and why does it work?

purnesh42H commented 2 months ago

@zejunlitg please refer to retry documentation for more details, if not already done.

Meanwhile, could you provide more details on following?

Example code of retry example with your modifications (if any)
What is the reason for transport failure? Is there anything wrong with the server? See How to turn on logging

zejunlitg commented 2 months ago

@purnesh42H I've read the retry documentation and it does not answer my question. That's why I'm posting here for a dev answer. Unless I missed it in the doc, to be very explicit, the question is: does gRPC retry handle the fact that the network connection gets unexpectedly closed? This involves implementation details that the doc does not reveal.

RE 1: I copied the retry policy in golang example:

var retryPolicy = `{
    "methodConfig": [{
        // config per method or all methods under service
        "name": [{"service": "grpc.examples.echo.Echo"}],
        "waitForReady": true,

        "retryPolicy": {
            "MaxAttempts": 4,
            "InitialBackoff": ".01s",
            "MaxBackoff": ".01s",
            "BackoffMultiplier": 1.0,
            // this value is grpc code
            "RetryableStatusCodes": [ "UNAVAILABLE" ]
        }
    }]
}`

And then in the example it's using this API grpc.NewClient():

conn, err := grpc.NewClient(ctx,grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithDefaultServiceConfig(retryPolicy))

The only difference in my use is I'm using this:

grpc.Dial(endPoint, DialOptions()...)

and then here's the options we're using, I'm plugging in grpc.WithDefaultServiceConfig(retryPolicy) here:

func DialOptions() []grpc.DialOption {
    bc := backoff.DefaultConfig
    bc.MaxDelay = 5 * time.Second
    return []grpc.DialOption{
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithConnectParams(grpc.ConnectParams{
            Backoff: bc,
        }),
        grpc.WithDefaultCallOptions(CallOptions()...),
    }
}

RE 2, no idea about the reason, from the server log, the RPC call is not received -- we have set up interceptor that prints RPC receiving & finishing log, normally when the server receives the RPC call it would be logged. When this issue happened, no relevant log was found on the server side. As I mentioned before, this is a rare issue that's difficult to reproduce. Regardless, we still want to know the course of action for best practice. We can manually call the same RPC after some sleep or we can use gRPC built-in retry mechanism, I'm still not sure if the former or latter would work, if you can provide some insights I appreciate it.

purnesh42H commented 2 months ago

Thanks for the details. I will get back to you on transport retries. Meanwhile, to answer your other question, one way to repro client preface write network failure is to provide your custom dialer implementing net.Conn and override write(). See WithContextDialer

purnesh42H commented 2 months ago

@zejunlitg in the example retry client retry policy have UNAVAILABLE as RetryableStatusCodes which is the error code for client preface write failure, so client will retry. As mentioned above, you can verify this by providing your own custom dialer implementing *net.Conn.

So, to answer your question, retry policies are the recommended way for dealing with transient failures. However, the recommended approach is to fetch the retry configuration (which is part of the service config) from the name resolver rather than defining it on the client side.

Feel free to reopen the issue if you have anymore questions

grpc / grpc-go

'use of closed network connection', what's the best practice to deal with this? #7388