Open jeffijoe opened 10 months ago
I looked at your .NET client and server sample apps and they both look fine. I also tested them on my local computer (no load balancer involved), and they worked without issue. That must mean there is an interaction between .NET gRPC client + the L7 load balancer that causes the failure.
Is setting up the solution in GCP required? I don't have experience with using it so that would be a bit of a barrier to me quickly testing it failing. I think the issue is in the gRPC client (since the server works with all the other clients) so that's the part I'd want to test.
If you have a GCP endpoint I can point a client on my local machine at, that will speed things up. I can set up gathering of low-level diagnostics and then run the client until there is a failure. If you don't want to share the address publicly then contact me over email at username@microsoft.com (jamesnk)
If that isn't possible then can I get you to get some diagnostics. I can create a PR to your repository with changes to enable detailed logging in the client. You can send attached the results here.
Thank you, James. I've sent you an email with the endpoint.
I reproduced your error and looked at detailed internal logs. The problem is the .NET client is sending the server pings to calculate the RTT, but the server closes the connection because it believes the client is malicious.
Logs: ping-goaway.txt
Important parts:
ProcessIncomingFramesAsync,Frame 15: StreamId=0; Type=GoAway; Flags=None; PayloadLength=22. ReadGoAwayFrame,lastStreamId=1, errorCode=EnhanceYourCalm
What is happening is the server is sending a GOAWAY frame with an EnhanceYourCalm error code to the client right after the client sends a ping. The server is telling the client that it thinks the client is malicious.
Adding this code to the start of the client app disables RTT pings fixes the problem:
AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);
A bug fix needs to happen inside .NET to improve how pings are sent, but you can use that setting to work around the problem until a fix is publicly available.
It sounds like this is related to #2358 then.
So the server sends a GOAWAY
, but the client does not surface this information in the exception that is thrown; is that intentional, or is that one of the things that need to be fixed? And regarding the ping, is it because the .NET client sends excessive pings?
EDIT: Also found this, seems related: https://github.com/dotnet/runtime/issues/62216
but the client does not surface this information in the exception that is thrown; is that intentional, or is that one of the things that need to be fixed?
Providing the real reason why a HTTP request failed is a bug and should be fixed. I passed feedback to the team. The ENHANCE_YOUR_CALM status code should appear in the exception message and shouldn't require low-level debugging.
And regarding the ping, is it because the .NET client sends excessive pings?
Yes. It fails after the 5th ping. Some servers are very conservative and see five pings as excessive. The algorithm used to send pings needs to be updated.
The AppContext
switch works perfectly, thank you!
Will this issue remain open until the underlying issues have been fixed? Or how will we be able to track when we can remove the switch?
It sounds like this is related to https://github.com/grpc/grpc-dotnet/issues/2358 then.
Yes it is. Exactly the same.
Will this issue remain open until the underlying issues have been fixed? Or how will we be able to track when we can remove the switch?
The fix will happen in dotnet/runtime repository. I'll put a comment here to a new issue there when I close it.
Regardless of the fact if this will be fixed in .NET or the GCP load balancer, we need to involve the owners of the load balancer in order to make the right decision.
I warmed up an old email thread about a similar issue, but the primary contact in the discussion seems to have left google, so I don't have strong hopes I can reach the people I need over there.
@jeffijoe could you please also raise a ticket against Google? IMO that would be the most efficient way to have this routed to the right people so we can chat.
@antonfirsov we don't have a direct way to contact Google but we've opened a ticket with DoIT who's our support provider and linked them to this issue, they may be able to escalate this to Google.
@antonfirsov our support rep forwarded the ticket to Google, this is what we got back:
I have reached out regarding this request and it is currently not possible to involve the product team as the issue lies on the .NET client side. The .NET team provided a workaround that the customer can use in their code in order to disable the pings while they (the .NET team) work on fixing the issue with the .NET client. Please note that changing the LB to accept this behavior will potentially introduce a vulnerability that can be used by any gRPC client software to generate attacks. In conclusion, kindly proceed with the workaround proposed by the .NET team in order for traffic to not be considered as malicious by the LB.
@jeffijoe thanks for doing the round with google!
The fix is on the way on the .NET side: https://github.com/dotnet/runtime/pull/97881. With change the problem doesn't happen anymore using your repro, however it would be nice to get a confirmation if it fixes your real problem as well. Having a customer validation would raise the chance of the fix being accepted for 6.0-8.0 backport.
Would it be possible for you to try the attached custom Linux build of System.Net.Http.dll
?
You would need to replace it in <DOTNETDIR>/shared/Microsoft.NETCore.App/8.0.1/
.
@antonfirsov if you tested it and it works against our reproduction server, and considering how disabling the feature worked for both our repro and the real problem, then I'm confident the issue has been resolved with your fix. 👍
@antonfirsov I tested your custom Linux build of System.Net.Http.dll
successfully in my scenario.
Against a Google Cloud API.
I copied the DLL into the runtime 8.0.3 lacking a runtime 8.0.1. ;-) Run the test three times to be sure that it won't throw the exception again. It failed each time with the original DLL.
I implemented an unofficial Gemini SDK for .NET and I came across the reported HttpIOException
using long-running stream requests. The method involved GenerateContentStream
is available on Mscc.GenerativeAI repository. And I'm using it to create a transcription of an audio file. The xUnit test case is located here: TranscribeStream_Audio_From_FileAPI
As this is a .NET library, I'm not sure whether the suggested workaround with AppContext.SetSwitch()
is applicable. I'd assume that this needs to be done in any application using my NuGet package instead.
The streaming response comes from the server and using your custom DLL simply stops instead of throwing the HttpIOException
while using the original DLL.
I hope this helps, and that this change will be merged into .NET 8 and hopefully backported to .NET 6.
Cheers, JoKi
$ dotnet --info .NET SDK: Version: 8.0.203 Commit: 5e1ceea679 Workload version: 8.0.200-manifests.4e94be9c
Runtime Environment: OS Name: ubuntu OS Version: 22.04 OS Platform: Linux RID: linux-x64 Base Path: /home/jochen/.dotnet8/sdk/8.0.203/
.NET workloads installed: There are no installed workloads to display.
Host: Version: 8.0.3 Architecture: x64 Commit: 9f4b1f5d66
.NET SDKs installed: 8.0.100-preview.5.23303.2 [/home/jochen/.dotnet8/sdk] 8.0.100-rc.1.23463.5 [/home/jochen/.dotnet8/sdk] 8.0.100-rc.2.23502.2 [/home/jochen/.dotnet8/sdk] 8.0.201 [/home/jochen/.dotnet8/sdk] 8.0.202 [/home/jochen/.dotnet8/sdk] 8.0.203 [/home/jochen/.dotnet8/sdk]
.NET runtimes installed: Microsoft.AspNetCore.App 8.0.0-preview.5.23302.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.0-rc.1.23421.29 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.0-rc.2.23480.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.2 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 8.0.3 [/home/jochen/.dotnet8/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 8.0.0-preview.5.23280.8 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.0-rc.1.23419.4 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.0-rc.2.23479.6 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.2 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App] Microsoft.NETCore.App 8.0.3 [/home/jochen/.dotnet8/shared/Microsoft.NETCore.App]
@antonfirsov After upgrading to latest SDK and runtimes 8.0.4 the problems seems to be resolved.
Thanks!
@jochenkirstaetter that is very strange since we changed our mind regarding the backport of https://github.com/dotnet/runtime/pull/97881 because we find the change risky. It might be backported to 8.0 after a short battle test by 9.0 adapters (early 2025).
@antonfirsov that's indeed really strange.
Okay, let's see what happened here. I observed the exception using SDK and Runtime 8.0.3, reproducible.
Given the exception I searched the Internet and found the issues here reading about the problem.
Then, I downloaded the custom System.Net.Http 8.0.1 and placed it (as suggested) in the shared folder of 8.0.3
Any run was without exception although "incomplete" like stopped in the middle of the response content (text)
Then I replaced the custom DLL with the original one, and I got the exception again, as expected.
Next, I upgraded to v8.0.4 and haven't observed any exception since. The incomplete response is still there.
Guess, I'll give my Linux system a complete reboot and check again.
FYI, I'm using JetBrains Rider for development and testing.
Cheers, JoKi
I have the same error
StatusCode=\"Unavailable\", Detail=\"Error reading next message. HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\"
In my case i have different applications with client and server. When i want to reboot server with gracefull shutdown i can't send await responseStream.CompleteAsync(); and no way to correct finish communication between server and client
Why we don't have method CompleteAsync for IServerStreamWriter
I reproduced your error and looked at detailed internal logs. The problem is the .NET client is sending the server pings to calculate the RTT, but the server closes the connection because it believes the client is malicious.
Logs: ping-goaway.txt
Important parts:
ProcessIncomingFramesAsync,Frame 15: StreamId=0; Type=GoAway; Flags=None; PayloadLength=22. ReadGoAwayFrame,lastStreamId=1, errorCode=EnhanceYourCalm
What is happening is the server is sending a GOAWAY frame with an EnhanceYourCalm error code to the client right after the client sends a ping. The server is telling the client that it thinks the client is malicious.
Adding this code to the start of the client app disables RTT pings fixes the problem:
AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);
A bug fix needs to happen inside .NET to improve how pings are sent, but you can use that setting to work around the problem until a fix is publicly available.
I add this code in ConfigureServices function ,but it not work. my program has this exception about once a week
UPDATE: THIS HAPPENS IF THE SERVER PROCESS IS UNEXPECTEDLY RESTARTED WHEN GRPC COMMUNICATION IS IN PROGRESS. SO IT'S AN ERROR IN MY CODE ONLY
I've reproduced this error in .NET SDK Version 9.0.100-preview.6.24328.19 on a .NET client/server setup, both on the same Windows 11 machine, via named pipe transport. My gRPC related packages:
Client side:
Google.Protobuf (3.28.2) Grpc.Net.ClientFactory (2.66.0) Grpc.Tools (2.66.0)
Server side: Google.Protobuf (3.28.2) Grpc.AspNetCore (2.66.0)
I added the above noted call on my client side, which did not help: AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);
The call stack is a little different:
Grpc.Core.RpcException: Status(StatusCode=\\"Unavailable\\", Detail=\\"Error starting gRPC call. HttpRequestException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded) HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\\", DebugException=\\"System.Net.Http.HttpRequestException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\\")\r\n ---> System.Net.Http.HttpRequestException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\r\n ---> System.Net.Http.HttpIOException: The response ended prematurely while waiting for the next frame from the server. (ResponseEnded)\r\n at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)\r\n at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()\r\n at System.Net.Http.Http2Connection.Http2Stream.TryEnsureHeaders()\r\n at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)\r\n at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\r\n --- End of inner exception stack trace ---\r\n at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\r\n at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)\r\n at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\r\n at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\r\n at Grpc.Net.Client.Internal.GrpcCall2.RunCall(HttpRequestMessage request, Nullable
1 timeout)\r\n --- End of inner exception stack trace ---\r\n at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method2 method, String host, CallOptions options, TRequest request)\\r\\n at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext
2 ctx)\r\n at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext2 context, BlockingUnaryCallContinuation
2 continuation)\r\n at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request
Possibly related were subsequent broken pipe errors:
Grpc.Core.RpcException: Status(StatusCode=\\"Unavailable\\", Detail=\\"Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: Pipe is broken.\\", DebugException=\\"System.Net.Http.HttpRequestException: An error occurred while sending the request.\\")\r\n ---> System.Net.Http.HttpRequestException: An error occurred while sending the request.\r\n ---> System.IO.IOException: The request was aborted.\r\n ---> System.IO.IOException: Pipe is broken.\r\n at System.IO.Pipes.PipeStream.CheckWriteOperations()\r\n at System.IO.Pipes.PipeStream.WriteAsync(ReadOnlyMemory1 buffer, CancellationToken cancellationToken)\\r\\n at System.Net.Http.Http2Connection.FlushOutgoingBytesAsync()\\r\\n --- End of inner exception stack trace ---[\\r\\n](file://r/n) at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)\\r\\n at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()\\r\\n at System.Net.Http.Http2Connection.Http2Stream.TryEnsureHeaders()\\r\\n at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)\\r\\n at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\\r\\n --- End of inner exception stack trace ---[\\r\\n](file://r/n) at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\\r\\n at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)\\r\\n at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)\\r\\n at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\\r\\n at System.Net.Http.HttpMessageInvoker.<SendAsync>g__SendAsyncWithTelemetry|6_0(HttpMessageHandler handler, HttpRequestMessage request, CancellationToken cancellationToken)\\r\\n at Grpc.Net.Client.Internal.GrpcCall
2.RunCall(HttpRequestMessage request, Nullable1 timeout)\\r\\n --- End of inner exception stack trace ---[\\r\\n](file://r/n) at Grpc.Net.Client.Internal.GrpcCall
2.GetResponseHeadersCoreAsync()\r\n at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method2 method, String host, CallOptions options, TRequest request)\\r\\n at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext
2 ctx)\r\n at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext2 context, BlockingUnaryCallContinuation
2 continuation)\r\n at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
This seems to be associated with the following System log from the Event Viewer:
The machine-default permission settings do not grant Local Activation permission for the COM Server application with CLSID
{C2F03A33-21F5-47FA-B4BB-156362A2F239}
and APPID
{316CDED5-E4AE-4B15-9113-7055D84DCC97}
to the user
I added the above noted call on my client side, which did not help
AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);
Note that this issue tracks a particular flow control misbehavior when grpc-dotnet is used together with GCP L7 Load Balancer. There might be many other reasons for getting a Response ended prematurely
exception.
@pcullen1 unless you can confirm to see the EnhanceYourCalm
error code when logging the outputs of the Private.InternalDiagnostics.System.Net.Http
event source (eg with dotnet-trace
), you are likely experiencing an unrelated error. @yaoze1989 same seems to be true for your case.
@antonfirsov I must apologize because I had some test code on the server side that I forgot about that was intentionally sending errors to the client, which then restarts the server process. Doing that is causing gRPC to emit these errors. So I need to handle that scenario more gracefully by cancelling the operations in progress first. I'm also no longer seeing those Event 10016 DistributedCOM errors, which are logged every time that I can see when the gRPC exceptions are logged.
If I should delete my above post, please all let me know. It was a learning experience, however!
What version of gRPC and what language are you using?
C#,
Grpc.Net.Client
version2.60.0
What operating system (Linux, Windows,...) and version?
macOS 14.2.1, Linux (Ubuntu on Google Cloud Kubernetes Engine)
What runtime / compiler are you using (e.g. .NET Core SDK version
dotnet --info
)What did you do?
We have a long-lived gRPC streaming call that gets canceled after receiving a few messages with the following:
So far, we can reproduce this when using a L7 external load balancer in Google Cloud (Classic Application Load Balancer). When using an L4 external load balancer, it works as expected.
We have built an extensive reproduction, including multiple servers and clients in different languages to isolate the root cause as being the gRPC dotnet client library. The following clients have been tested and work without issue:
The linked repository also includes information about the infrastructure setup.
The following proto is used:
At first, we thought the root cause was the L7 load balancer, however, given other gRPC client libraries work without issue, we figured it might be something specific in the dotnet one.
The issue occurs after having received between 5-8 messages on the stream when sending a message roughly every second. 6 messages is the most common one we've observed.
If we send messages faster (using a
while
loop that calls a HTTP endpoint on the server to trigger messages), we get between 20-25 messages before the connection is broken.We also observed that the connection breaks after receiving a message, never while waiting for one.