googleapis / google-cloud-dotnet

Google Cloud Client Libraries for .NET
https://cloud.google.com/dotnet/docs/reference/
Apache License 2.0
924 stars 362 forks source link

[BigQuery/Storage Write API] Default GrpcAdapter Memory Leak on Server 2022 #13150

Closed indy-singh closed 2 months ago

indy-singh commented 2 months ago

Tangentially Related Context: https://github.com/googleapis/google-cloud-dotnet/issues/10034

Headline

I'll start with the fix/workaround:-

Switching from the default GrpcAdapter to GrpcCoreAdapter.Instance as per this comment from a different issue solved the slow memory leak.

Longer explanation

We are in the process of upgrading all of production servers (we are on-prem) from Server 2019 to Server 2022. We've rolled out 2022 with http3 enabled as per this post.

We started to notice a new slow leak on the newly upgraded 2022 boxes:-

image

You'll notice that the grey line remains flat because that box was still on 2019.

We hooked into AppDomain.CurrentDomain.FirstChanceException and saw that 2022 was throwing a lot of QUIC errors:-

System.Threading.Channels.ChannelClosedException: The channel has been closed.
 ---> System.Net.Quic.QuicException: The connection timed out from inactivity.
   at System.Net.Quic.QuicConnection.HandleEventShutdownInitiatedByTransport(_SHUTDOWN_INITIATED_BY_TRANSPORT_e__Struct& data)
   at System.Net.Quic.QuicConnection.HandleConnectionEvent(QUIC_CONNECTION_EVENT& connectionEvent)
   at System.Net.Quic.QuicConnection.NativeCallback(QUIC_HANDLE* connection, Void* context, QUIC_CONNECTION_EVENT* connectionEvent)
--- End of stack trace from previous location ---

   --- End of inner exception stack trace ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
System.Net.Quic.QuicException: The connection timed out from inactivity.
   at System.Net.Quic.QuicConnection.HandleEventShutdownInitiatedByTransport(_SHUTDOWN_INITIATED_BY_TRANSPORT_e__Struct& data)
   at System.Net.Quic.QuicConnection.HandleConnectionEvent(QUIC_CONNECTION_EVENT& connectionEvent)
   at System.Net.Quic.QuicConnection.NativeCallback(QUIC_HANDLE* connection, Void* context, QUIC_CONNECTION_EVENT* connectionEvent)
--- End of stack trace from previous location ---
   at System.Net.Quic.QuicConnection.AcceptInboundStreamAsync(CancellationToken cancellationToken)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

We tried turning off (including a reboot) EnableAltSvc but the leak still occurred. Then we tried the same with EnableHttp3 but no joy. This was because naively we thought http3 == quic.

Eventually, I remembered from this comment that there was a suspicion of the default GrpcAdapater was the cause of that issue. So we gave that a shot:-

<PackageReference Include="Grpc.Core" Version="2.46.6" />

And the slow memory leak was fixed (deployed just before the weekend) hence the above graph flatlines at the start of the weekend.

The leak state took a little time to reproduce; eventually I switched to using an EventListener to watch for any QUIC activity because my [broad] theory was basically QUIC was somehow causing the memory leak.

public sealed class MyListener : EventListener
{
  protected override void OnEventSourceCreated(EventSource eventSource)
  {
    if ("Private.InternalDiagnostics.System.Net.Quic".Equals(eventSource.Name))
    {
      EnableEvents(eventSource, EventLevel.LogAlways, EventKeywords.All);
    }
  }

  protected override void OnEventWritten(EventWrittenEventArgs eventData)
  {
    Console.WriteLine($"{DateTime.UtcNow:ss:fff} {eventData.EventName}: " +
      string.Join(' ', eventData.PayloadNames!.Zip(eventData.Payload!).Select(pair => $"{pair.First}={pair.Second}")));

  }
}

I stopped digging once the problem was resolved; my best guess is that somehow those QUIC errors aren't disposing of resources and just slowly leaks. The actual application tier code hasn't changed at all since I opened the previous issue.

Happy to provide more details/logs/diagnostics.

Environment details

jskeet commented 2 months ago

Please could you clarify what this means?

We tried turning off (including a reboot) EnableAltSvc but the leak still occurred. Then we tried the same with EnableHttp3 but no joy. This was because naively we thought http3 == quic.

Is it possible that what you were changing was enabling/disabling QUIC for your web app, rather than for the client? It's relatively fiddly to turn off HTTP/3 at the client level at the moment - we have a workaround you could try that's documented here: https://github.com/googleapis/google-cloud-dotnet/blob/main/issues/Issue12318/Workaround/Program.cs

(Obviously we'd like to make this simpler over time.)

It would also be useful if you could provide a minimal example of what you're doing with the client, ideally as a console app - including whether you're disposing of everything. (It's over a year since the previous issue, so my guess is that your code may have changed.)

indy-singh commented 2 months ago

Please could you clarify what this means?

Yes by turning off HTTP3 I meant reversing the instructions here. So disabling http3 on OS level.

It's relatively fiddly to turn off HTTP/3 at the client level at the moment - we have a workaround you could try that's documented here: https://github.com/googleapis/google-cloud-dotnet/blob/main/issues/Issue12318/Workaround/Program.cs

Ah, that is nice. I did have a play around with Options API but it was not immediately apparent that it was done like that. I was thinking I would have the pass the delegating handler in. I'm happy to try that.

It would also be useful if you could provide a minimal example of what you're doing with the client, ideally as a console app - including whether you're disposing of everything. (It's over a year since the previous issue, so my guess is that your code may have changed.)

The code hasn't changed since the previous issue. But happy to provide a minimal example. This may take a while.

Thanks, Indy

jskeet commented 2 months ago

Yes by turning off HTTP3 I meant reversing the instructions here. So disabling http3 on OS level.

That post is all about acting as an HTTP/3 server. I don't know for sure, but I'd strongly suspect that it has no effect on clients.

Ah, that is nice. I did have a play around with Options API but it was not immediately apparent that it was done like that. I was thinking I would have the pass the delegating handler in. I'm happy to try that.

Great, thank you.

The code hasn't changed since the previous issue. But happy to provide a minimal example. This may take a while.

No hurry - but it's much appreciated :)

indy-singh commented 2 months ago

Hi @jskeet,

I've repurposed this to show a minimal example of what we are doing.

Cheers, Indy

indy-singh commented 2 months ago

That post is all about acting as an HTTP/3 server. I don't know for sure, but I'd strongly suspect that it has no effect on clients

Ah, sorry. I should be clearer about our setup. We have server-to-server communication. Thank you for prompting me to clarifying that point. I had missed that entirely, apologies.

jskeet commented 2 months ago

We have server-to-server communication.

Yes, but the BigQueryStorageClient is still acting as a client, therefore any settings which are about serving HTTP/3 are probably irrelevant to whether the client uses HTTP/3.

indy-singh commented 2 months ago

Yeah, I was just pointing out that I had misunderstood what registry keys were actually doing 👍🏽

jskeet commented 2 months ago

Okay. I've cloned your code, and will try to reproduce the issue shortly. If you've got sample output from your repro which looks like it's leaking, that would be great for the sake of comparisons.

jskeet commented 2 months ago

Could you let me know where the QuicException comes, or post the fuller stack trace? I wonder whether that's coming before you start any using statements. I note you're using quite an old version of Google.Cloud.BigQuery.Storage.V1. If you update to 3.15.0, you can use a using statement for appendRowsStream:

using var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows(CallSettings.FromCancellationToken(cancellationTokenSource.Token));

That way the gRPC call will be disposed even you don't get as far as completing the write stream.

indy-singh commented 2 months ago

If you've got sample output from your repro which looks like it's leaking, that would be great for the sake of comparisons.

I've not had a chance to let it run for long enough to match what is happening in production. But I can try.

It's relatively fiddly to turn off HTTP/3 at the client level at the moment - we have a workaround you could try that's documented here: https://github.com/googleapis/google-cloud-dotnet/blob/main/issues/Issue12318/Workaround/Program.cs

ExactVersionHandler also solves the problem. I'm not sure which solution/workaround is better..?

Could you let me know where the QuicException comes, or post the fuller stack trace?

That was the full stack trace as emitted from FirstChanceException. I'll see what else I can dig out.

If you update to 3.15.0, you can use a using statement for appendRowsStream

Will do :)

Cheers, Indy

jskeet commented 2 months ago

ExactVersionHandler also solves the problem

Excellent! I would personally go that route. While using Grpc.Core is still supported, it's definitely legacy - it'll be years before we are able to stop supporting it completely, but that's the direction of travel. In time I absolutely expect that we'll make it easier to modify the Grpc.Net.Client handler (ideally making it simple to control keepalive, specify the HTTP version etc) to avoid these problems - but for now it sounds like the workaround is the way to go.

indy-singh commented 2 months ago

I'm going to let ExactVersionHandler stew for 24 hours before I'm happy the slow memory leak has been resolved. As it is a slow one, time is the test - so far all good.

Am I correct in thinking that ExactVersionHandler forces to HTTP2 and therefore skips the QUIC path entirely?

jskeet commented 2 months ago

Am I correct in thinking that ExactVersionHandler forces to HTTP2 and therefore skips the QUIC path entirely?

Yes. (I believe the request contains HTTP/2.0 by default, and this basically says "and don't upgrade it".)

indy-singh commented 2 months ago

Great. Well I'm totally out of my depth with Quic, but I'm having a play with https://github.com/microsoft/msquic/blob/main/docs/Diagnostics.md / https://github.com/microsoft/msquic/blob/main/docs/TSG.md#trouble-shooting-a-performance-issue and see if anything sticks out.

indy-singh commented 2 months ago

Hmmmm, I wonder if this is related?

https://github.com/dotnet/runtime/issues/80111

If I hook into

TaskScheduler.UnobservedTaskException += (sender, args) =>
{
    Log.Error(new LogDetails("UnobservedTaskException", exception: args.Exception, objectDump: $"Was observed {args.Observed}"));
};

I definitely get some QUIC stuff:-

System.AggregateException: A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread. (The connection timed out from inactivity.)
 ---> System.Net.Quic.QuicException: The connection timed out from inactivity.
   at System.Net.Quic.ResettableValueTaskSource.TryComplete(Exception exception, Boolean final)
   at System.Net.Quic.QuicStream.HandleEventShutdownComplete(_SHUTDOWN_COMPLETE_e__Struct& data)
   at System.Net.Quic.QuicStream.HandleStreamEvent(QUIC_STREAM_EVENT& streamEvent)
   at System.Net.Quic.QuicStream.NativeCallback(QUIC_HANDLE* connection, Void* context, QUIC_STREAM_EVENT* streamEvent)
--- End of stack trace from previous location ---
   at System.Net.Quic.ResettableValueTaskSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Quic.QuicStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.Http3Connection.<>c__DisplayClass41_0.<<ProcessServerControlStreamAsync>g__ReadFrameEnvelopeAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at System.Net.Http.Http3Connection.ProcessServerControlStreamAsync(QuicStream stream, ArrayBuffer buffer)
   at System.Net.Http.Http3Connection.ProcessServerStreamAsync(QuicStream stream)
   at System.Net.Http.Http3Connection.ProcessServerStreamAsync(QuicStream stream)
   --- End of inner exception stack trace ---
jskeet commented 2 months ago

Hmm... that's deeper than I normally go, to be honest - but I will say we've seen a few issues with QUIC recently. All the more reason to disable it in the handler :)

indy-singh commented 2 months ago

I'm going to let ExactVersionHandler stew for 24 hours before I'm happy the slow memory leak has been resolved. As it is a slow one, time is the test - so far all good.

Showed no signs of a slow leak over night. I'm happy for this issue to be closed if you want?

Cheers, Indy

jskeet commented 2 months ago

Will do - thanks so much for testing things so quickly. We'll keep an eye on the whole QUIC situation... it sounds like we may well want to work out how to disable it by default. (Unfortunately, that's easier said than done, as the "right" HttpHandler to use is a fairly involved decision.)