open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.23k stars 765 forks source link

Metrics stop exporting at seemingly random times every week or so #5729

Open ladenedge opened 4 months ago

ladenedge commented 4 months ago

What is the question?

We're using metrics extensively in our .NET 8 application which is deployed to a dozen or so servers. Once in a while -- say, once a week -- a single server will stop exporting to our OTel Collector. The other servers continue to work fine, and restarting our application fixes the problem.

Thank you for any help!

Additional context

Application is .NET 8 with the following OTel packages:

<PackageReference Include="OpenTelemetry.Contrib.Instrumentation.AWS" Version="1.0.2" />
<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.8.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.EventCounters" Version="1.5.1-alpha.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.GrpcNetClient" Version="1.6.0-beta.2" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.8.1" />
cijothomas commented 4 months ago

Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics

Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause). Is metrics missing in Console too?

ladenedge commented 4 months ago

Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:

Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")

Here's the full entry, repeated every minute:

2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}

So now I have two questions:

  1. Is there some way to modify the timeout on the internal OTel gRPC calls?
  2. Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

Thank you again for your help!

cijothomas commented 4 months ago

Is there some way to modify the timeout on the internal OTel gRPC calls?

For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: https://github.com/open-telemetry/opentelemetry-dotnet/issues/2009

Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)

I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried. (@vishweshbankwar to keep me honest here.)

vishweshbankwar commented 4 months ago

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.

This is correct, even if the retries are enabled. https://github.com/open-telemetry/opentelemetry-dotnet/issues/5436

cijothomas commented 4 months ago

Is there some way to modify the timeout on the internal OTel gRPC calls?

You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

vishweshbankwar commented 4 months ago

Is there some way to modify the timeout on the internal OTel gRPC calls? You can increase the timeout period by setting TimeoutMilliseconds

Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?

Yes - it is used for setting the deadline time of a grpc call we make here.

cijothomas commented 4 months ago

Thanks!. Looks like https://github.com/open-telemetry/opentelemetry-dotnet/issues/1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)

ladenedge commented 4 months ago

You can increase the timeout period by setting TimeoutMilliseconds

Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?

Thanks again, and feel free to close this issue if you feel you've provided all the info you can!

cijothomas commented 4 months ago

especially if each call only includes the latest metrics.

This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative? Also, are there many Observable instruments with callbacks potentially taking a lot of time?

ladenedge commented 4 months ago

We are indeed using Delta mode:

reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;

We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.

Any other ideas are most welcome, and thank you again for all the help!

vishweshbankwar commented 4 months ago

@ladenedge - Just to confirm, you don't have retries enabled, correct?

it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.

ladenedge commented 4 months ago

I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.

ladenedge commented 4 months ago

Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count) -- but is it even possible that they contribute to the missed deadline?

cijothomas commented 4 months ago

are observables actually queried during the exporter's network call?

No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )

g3david405 commented 3 months ago

I face same issue

ladenedge commented 3 months ago

For what it's worth, we are increasing our timeout to 30s to see if that makes any difference. (But this change won't be deployed for a week or so, so.. not very helpful yet, heh.)