grpc / grpc-dotnet

gRPC for .NET
Apache License 2.0
4.21k stars 773 forks source link

"static dns resolver resolve failure" cause grpc client can not recover #2332

Open someview opened 11 months ago

someview commented 11 months ago

I have mentioned the similar problem in another issue. "Resource temporarily unavailable" can not recover

What version of gRPC and what language are you using?

grpc client 2.59

What operating system (Linux, Windows,...) and version?

k8s 1.25 linux apline image 3.18

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

dotnet 8.0

What did you do?

  1. create a grpc client like this:
        internal T CreateContract()
        {
            var handler = new SocketsHttpHandler
            {
                EnableMultipleHttp2Connections = true,
                PooledConnectionIdleTimeout = TimeSpan.FromSeconds(60),
                ConnectTimeout = TimeSpan.FromSeconds(5),
            }; 
            var address = “http:xxx.com”;  // url is dns outside k8s
            var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions
            {
                HttpHandler = handler,
                Credentials = ChannelCredentials.Insecure,   
            });
            var contract = channel.CreateGrpcService<T>();
            return contract;
        }
  2. run this client program in k8s, use alpine3.18 as the base image.
  3. set k8s coredns deployment replicas = 0, to make dns temp unavaliable.
  4. view the program log:
    │
    │ Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.", DebugException="System.Net.Sockets.SocketException: Resource temporarily unavailable")                            │
    │  ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable                                                                                                                                      │
    │    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)                                                                                 │
    │    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                                                 │
    │    at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)                                │
    │    at Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport.TryConnectAsync(ConnectContext context)                                                                                                │
    │    --- End of inner exception stack trace ---                                                                                                                                                                        │
    │    at Grpc.Net.Client.Balancer.Internal.ConnectionManager.PickAsync(PickContext context, Boolean waitForReady, CancellationToken cancellationToken)                                                                  │
    │    at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)                                                                               │
    │    at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)                                                                                                                    │
    │    at ProtoBuf.Grpc.Internal.Reshape.UnaryTaskAsyncImpl[TRequest,TResponse](AsyncUnaryCall`1 call, MetadataContext metadata, CancellationToken cancellationToken) in /_/src/protobuf-net.Grpc/Internal/Reshape.cs:li │
    │ ne 560  
  5. set k8s coredns replicas = 1,
  6. the grpc client also logs the same error.

How to ensure this is Grpc Problem?

  1. we have invested some grpc code pragh about socket.Connect eg: Grpc.Net.Client.Balancer.SubChannel.ConnectTransportAsync, SocketConnectivitySubchannelTransport.TryConnectAsync, And write a simple test Code:

        static async Task TestSocket() {
            while (true) {
                var cancellationTokenSource = new CancellationTokenSource();
                var cancellationToken = cancellationTokenSource.Token;
                cancellationTokenSource.CancelAfter(millisecondsDelay: 3000);
                Exception? firstConnectionError = null;
                Socket socket;
                string host = "xxx.com"; // a url outside k8s
                int port = 80;
                EndPoint endPoint = new DnsEndPoint(host, port);
                socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
                try
                {
                    await socket.ConnectAsync(endPoint, cancellationToken);
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"异常ex:{ex.Message}");
                    socket.Dispose();
                    firstConnectionError = ex;
                    if (firstConnectionError is OperationCanceledException oce &&
                         oce.CancellationToken == cancellationToken)
                        {
                            Console.WriteLine("this is timeout error");
                        }
                }
            }
        }

    we find after the coredns recover, the socket can connect correctly. We can ensure this is related to dotnet-grpc.

  2. for the grpc pragh above, using debian 11 as the basic image, we find the fact when dns service recover, the grpc client can work normally. this shows the different dns lib cause the difference for dotnet-grpc: "musl or glibc". This is how grpc-dotnet handle the exception:

           try{...}
           catch {
                // Socket is recreated every connect attempt. Explicitly dispose failed socket before next attempt.
                socket.Dispose();
    
                SocketConnectivitySubchannelTransportLog.ErrorConnectingSocket(_logger, _subchannel.Id, currentEndPoint, ex);
    
                if (firstConnectionError == null)
                {
                    firstConnectionError = ex;
                }
    
                // Stop trying to connect to addresses on cancellation.
                if (context.CancellationToken.IsCancellationRequested)
                {
                    break;
                }
            }
        }
    
        var result = ConnectResult.Failure;
    
        // Check if cancellation happened because of timeout.
        if (firstConnectionError is OperationCanceledException oce &&
            oce.CancellationToken == context.CancellationToken &&
            !context.IsConnectCanceled)
        {
            firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
            result = ConnectResult.Timeout;
        }
    
        // All connections failed
        _subchannel.UpdateConnectivityState(
            ConnectivityState.TransientFailure,
            new Status(StatusCode.Unavailable, "Error connecting to subchannel.", firstConnectionError));
        lock (Lock)
        {
            if (!_disposed)
            {
                _socketConnectedTimer.Change(Timeout.InfiniteTimeSpan, Timeout.InfiniteTimeSpan);
            }
        }

    the code may exists compatibility problem:

    if (firstConnectionError is OperationCanceledException oce &&
            oce.CancellationToken == context.CancellationToken &&
            !context.IsConnectCanceled)
        {
            firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
            result = ConnectResult.Timeout;
        }

    But this may be caused by dotnet runtime. So we have inveted dotnet runtime about socket dns problem and cancel operation. There are some issue about this, eg "https://github.com/dotnet/runtime/issues/81023", "https://github.com/dotnet/runtime/issues/75889".

  3. for the grpc pragh above, use alpine 3.18 as base image, if we deleted the connectTimeout param from socket httpMessageHander,the grpc client can recover when the dns service recover.

someview commented 11 months ago

I have mentioned the similar problem in another issue. "Resource temporarily unavailable" can not recover

What version of gRPC and what language are you using?

grpc client 2.59

What operating system (Linux, Windows,...) and version?

k8s 1.25 linux apline image 3.18

What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)

dotnet 8.0

What did you do?

  1. create a grpc client like this:
        internal T CreateContract()
        {
            var handler = new SocketsHttpHandler
            {
                EnableMultipleHttp2Connections = true,
                PooledConnectionIdleTimeout = TimeSpan.FromSeconds(60),
                ConnectTimeout = TimeSpan.FromSeconds(5),
            }; 
            var address = “http:xxx.com”;  // url is dns outside k8s
            var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions
            {
                HttpHandler = handler,
                Credentials = ChannelCredentials.Insecure,   
            });
            var contract = channel.CreateGrpcService<T>();
            return contract;
        }
  1. run this client program in k8s, use alpine3.18 as the base image.
  2. set k8s coredns deployment replicas = 0, to make dns temp unavaliable.
  3. view the program log:
│
│ Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.", DebugException="System.Net.Sockets.SocketException: Resource temporarily unavailable")                            │
│  ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable                                                                                                                                      │
│    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)                                                                                 │
│    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                                                 │
│    at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)                                │
│    at Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport.TryConnectAsync(ConnectContext context)                                                                                                │
│    --- End of inner exception stack trace ---                                                                                                                                                                        │
│    at Grpc.Net.Client.Balancer.Internal.ConnectionManager.PickAsync(PickContext context, Boolean waitForReady, CancellationToken cancellationToken)                                                                  │
│    at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)                                                                               │
│    at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout)                                                                                                                    │
│    at ProtoBuf.Grpc.Internal.Reshape.UnaryTaskAsyncImpl[TRequest,TResponse](AsyncUnaryCall`1 call, MetadataContext metadata, CancellationToken cancellationToken) in /_/src/protobuf-net.Grpc/Internal/Reshape.cs:li │
│ ne 560  
  1. set k8s coredns replicas = 1,
  2. the grpc client also logs the same error.

How to ensure this is Grpc Problem?

  1. we have invested some grpc code pragh about socket.Connect eg: Grpc.Net.Client.Balancer.SubChannel.ConnectTransportAsync, SocketConnectivitySubchannelTransport.TryConnectAsync, And write a simple test Code:
        static async Task TestSocket() {
            while (true) {
                var cancellationTokenSource = new CancellationTokenSource();
                var cancellationToken = cancellationTokenSource.Token;
                cancellationTokenSource.CancelAfter(millisecondsDelay: 3000);
                Exception? firstConnectionError = null;
                Socket socket;
                string host = "xxx.com"; // a url outside k8s
                int port = 80;
                EndPoint endPoint = new DnsEndPoint(host, port);
                socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
                try
                {
                    await socket.ConnectAsync(endPoint, cancellationToken);
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"异常ex:{ex.Message}");
                    socket.Dispose();
                    firstConnectionError = ex;
                    if (firstConnectionError is OperationCanceledException oce &&
                         oce.CancellationToken == cancellationToken)
                        {
                            Console.WriteLine("this is timeout error");
                        }
                }
            }
        }

we find after the coredns recover, the socket can connect correctly. We can ensure this is related to dotnet-grpc.

  1. for the grpc pragh above, using debian 11 as the basic image, we find the fact when dns service recover, the grpc client can work normally. this shows the different dns lib cause the difference for dotnet-grpc: "musl or glibc". This is how grpc-dotnet handle the exception:
           try{...}
           catch {
                // Socket is recreated every connect attempt. Explicitly dispose failed socket before next attempt.
                socket.Dispose();

                SocketConnectivitySubchannelTransportLog.ErrorConnectingSocket(_logger, _subchannel.Id, currentEndPoint, ex);

                if (firstConnectionError == null)
                {
                    firstConnectionError = ex;
                }

                // Stop trying to connect to addresses on cancellation.
                if (context.CancellationToken.IsCancellationRequested)
                {
                    break;
                }
            }
        }

        var result = ConnectResult.Failure;

        // Check if cancellation happened because of timeout.
        if (firstConnectionError is OperationCanceledException oce &&
            oce.CancellationToken == context.CancellationToken &&
            !context.IsConnectCanceled)
        {
            firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
            result = ConnectResult.Timeout;
        }

        // All connections failed
        _subchannel.UpdateConnectivityState(
            ConnectivityState.TransientFailure,
            new Status(StatusCode.Unavailable, "Error connecting to subchannel.", firstConnectionError));
        lock (Lock)
        {
            if (!_disposed)
            {
                _socketConnectedTimer.Change(Timeout.InfiniteTimeSpan, Timeout.InfiniteTimeSpan);
            }
        }

the code may exists compatibility problem:

if (firstConnectionError is OperationCanceledException oce &&
            oce.CancellationToken == context.CancellationToken &&
            !context.IsConnectCanceled)
        {
            firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
            result = ConnectResult.Timeout;
        }

But this may be caused by dotnet runtime. So we have inveted dotnet runtime about socket dns problem and cancel operation. There are some issue about this, eg "dotnet/runtime#81023", "dotnet/runtime#75889".

  1. for the grpc pragh above, use alpine 3.18 as base image, if we deleted the connectTimeout param from socket httpMessageHander,the grpc client can recover when the dns service recover.

https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs#L2762 the DnsConnectAsync method may conflict with the op:

if (firstConnectionError is OperationCanceledException oce &&
            oce.CancellationToken == context.CancellationToken &&
            !context.IsConnectCanceled)
        {
            firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
            result = ConnectResult.Timeout;
        }