Open someview opened 11 months ago
I have mentioned the similar problem in another issue. "Resource temporarily unavailable" can not recover
What version of gRPC and what language are you using?
grpc client 2.59
What operating system (Linux, Windows,...) and version?
k8s 1.25 linux apline image 3.18
What runtime / compiler are you using (e.g. .NET Core SDK version
dotnet --info
)dotnet 8.0
What did you do?
- create a grpc client like this:
internal T CreateContract() { var handler = new SocketsHttpHandler { EnableMultipleHttp2Connections = true, PooledConnectionIdleTimeout = TimeSpan.FromSeconds(60), ConnectTimeout = TimeSpan.FromSeconds(5), }; var address = “http:xxx.com”; // url is dns outside k8s var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions { HttpHandler = handler, Credentials = ChannelCredentials.Insecure, }); var contract = channel.CreateGrpcService<T>(); return contract; }
- run this client program in k8s, use alpine3.18 as the base image.
- set k8s coredns deployment replicas = 0, to make dns temp unavaliable.
- view the program log:
│ │ Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error connecting to subchannel.", DebugException="System.Net.Sockets.SocketException: Resource temporarily unavailable") │ │ ---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable │ │ at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) │ │ at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) │ │ at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken) │ │ at Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport.TryConnectAsync(ConnectContext context) │ │ --- End of inner exception stack trace --- │ │ at Grpc.Net.Client.Balancer.Internal.ConnectionManager.PickAsync(PickContext context, Boolean waitForReady, CancellationToken cancellationToken) │ │ at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) │ │ at Grpc.Net.Client.Internal.GrpcCall`2.RunCall(HttpRequestMessage request, Nullable`1 timeout) │ │ at ProtoBuf.Grpc.Internal.Reshape.UnaryTaskAsyncImpl[TRequest,TResponse](AsyncUnaryCall`1 call, MetadataContext metadata, CancellationToken cancellationToken) in /_/src/protobuf-net.Grpc/Internal/Reshape.cs:li │ │ ne 560
- set k8s coredns replicas = 1,
- the grpc client also logs the same error.
How to ensure this is Grpc Problem?
- we have invested some grpc code pragh about socket.Connect eg: Grpc.Net.Client.Balancer.SubChannel.ConnectTransportAsync, SocketConnectivitySubchannelTransport.TryConnectAsync, And write a simple test Code:
static async Task TestSocket() { while (true) { var cancellationTokenSource = new CancellationTokenSource(); var cancellationToken = cancellationTokenSource.Token; cancellationTokenSource.CancelAfter(millisecondsDelay: 3000); Exception? firstConnectionError = null; Socket socket; string host = "xxx.com"; // a url outside k8s int port = 80; EndPoint endPoint = new DnsEndPoint(host, port); socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true }; try { await socket.ConnectAsync(endPoint, cancellationToken); } catch (Exception ex) { Console.WriteLine($"异常ex:{ex.Message}"); socket.Dispose(); firstConnectionError = ex; if (firstConnectionError is OperationCanceledException oce && oce.CancellationToken == cancellationToken) { Console.WriteLine("this is timeout error"); } } } }
we find after the coredns recover, the socket can connect correctly. We can ensure this is related to dotnet-grpc.
- for the grpc pragh above, using debian 11 as the basic image, we find the fact when dns service recover, the grpc client can work normally. this shows the different dns lib cause the difference for dotnet-grpc: "musl or glibc". This is how grpc-dotnet handle the exception:
try{...} catch { // Socket is recreated every connect attempt. Explicitly dispose failed socket before next attempt. socket.Dispose(); SocketConnectivitySubchannelTransportLog.ErrorConnectingSocket(_logger, _subchannel.Id, currentEndPoint, ex); if (firstConnectionError == null) { firstConnectionError = ex; } // Stop trying to connect to addresses on cancellation. if (context.CancellationToken.IsCancellationRequested) { break; } } } var result = ConnectResult.Failure; // Check if cancellation happened because of timeout. if (firstConnectionError is OperationCanceledException oce && oce.CancellationToken == context.CancellationToken && !context.IsConnectCanceled) { firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError); result = ConnectResult.Timeout; } // All connections failed _subchannel.UpdateConnectivityState( ConnectivityState.TransientFailure, new Status(StatusCode.Unavailable, "Error connecting to subchannel.", firstConnectionError)); lock (Lock) { if (!_disposed) { _socketConnectedTimer.Change(Timeout.InfiniteTimeSpan, Timeout.InfiniteTimeSpan); } }
the code may exists compatibility problem:
if (firstConnectionError is OperationCanceledException oce && oce.CancellationToken == context.CancellationToken && !context.IsConnectCanceled) { firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError); result = ConnectResult.Timeout; }
But this may be caused by dotnet runtime. So we have inveted dotnet runtime about socket dns problem and cancel operation. There are some issue about this, eg "dotnet/runtime#81023", "dotnet/runtime#75889".
- for the grpc pragh above, use alpine 3.18 as base image, if we deleted the connectTimeout param from socket httpMessageHander,the grpc client can recover when the dns service recover.
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs#L2762
the DnsConnectAsync method may conflict with the op:
if (firstConnectionError is OperationCanceledException oce &&
oce.CancellationToken == context.CancellationToken &&
!context.IsConnectCanceled)
{
firstConnectionError = new TimeoutException("A connection could not be established within the configured ConnectTimeout.", firstConnectionError);
result = ConnectResult.Timeout;
}
I have mentioned the similar problem in another issue. "Resource temporarily unavailable" can not recover
What version of gRPC and what language are you using?
grpc client 2.59
What operating system (Linux, Windows,...) and version?
k8s 1.25 linux apline image 3.18
What runtime / compiler are you using (e.g. .NET Core SDK version
dotnet --info
)dotnet 8.0
What did you do?
How to ensure this is Grpc Problem?
we have invested some grpc code pragh about socket.Connect eg: Grpc.Net.Client.Balancer.SubChannel.ConnectTransportAsync, SocketConnectivitySubchannelTransport.TryConnectAsync, And write a simple test Code:
we find after the coredns recover, the socket can connect correctly. We can ensure this is related to dotnet-grpc.
for the grpc pragh above, using debian 11 as the basic image, we find the fact when dns service recover, the grpc client can work normally. this shows the different dns lib cause the difference for dotnet-grpc: "musl or glibc". This is how grpc-dotnet handle the exception:
the code may exists compatibility problem:
But this may be caused by dotnet runtime. So we have inveted dotnet runtime about socket dns problem and cancel operation. There are some issue about this, eg "https://github.com/dotnet/runtime/issues/81023", "https://github.com/dotnet/runtime/issues/75889".
for the grpc pragh above, use alpine 3.18 as base image, if we deleted the connectTimeout param from socket httpMessageHander,the grpc client can recover when the dns service recover.