Open ladenedge opened 4 months ago
Please see if you can get internal logs (Warning and above) https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics
Are you missing all metrics from the server, or just a subset of metrics? (There is metric cardinality caps implemented which can explain the behavior, but if it is every metric stopping at the same time, unlikely to be cause). Is metrics missing in Console too?
Okay, we found a newly failing server and the diagnostic log helped a great deal. It appears we are timing out:
Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
Here's the full entry, repeated every minute:
2024-07-04T12:43:18.7570236Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://myendpoint.com:4317/}{Grpc.Core.RpcException: Status(StatusCode="DeadlineExceeded", Detail="")
at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, DateTime deadlineUtc, CancellationToken cancellationToken)}
So now I have two questions:
Thank you again for your help!
Is there some way to modify the timeout on the internal OTel gRPC calls?
For gRPC, not much option to customize. There are open issues/prs for related settings that will allow exposing this. Eg: https://github.com/open-telemetry/opentelemetry-dotnet/issues/2009
Why might this be happening after a week or so? Once we do timeout, is this likely to get worse due to the queuing of new metrics? (Ie. is there any chance of recovering from this condition?)
I don't think any "queue" up occurs today. If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried. (@vishweshbankwar to keep me honest here.)
Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds
If a batch is lost due to grpc timeout, its not saved for retry later, instead the next batch is tried.
This is correct, even if the retries are enabled. https://github.com/open-telemetry/opentelemetry-dotnet/issues/5436
Is there some way to modify the timeout on the internal OTel gRPC calls?
You can increase the timeout period by setting TimeoutMilliseconds
Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?
Is there some way to modify the timeout on the internal OTel gRPC calls? You can increase the timeout period by setting TimeoutMilliseconds
Thanks! Even if OTel's export timeout is increased, will it get applied to the timeout used by the GrpcClient itself?
Yes - it is used for setting the deadline time of a grpc call we make here.
Thanks!. Looks like https://github.com/open-telemetry/opentelemetry-dotnet/issues/1735 is still open which state we don't really enforce the timeouts, but I could be wrong. (or its only for traces!)
You can increase the timeout period by setting TimeoutMilliseconds
Great info! However, it looks like the default is 10s, so it worries me that we're exceeding that -- especially if each call only includes the latest metrics. I could increase it to 20s or 30s, but I wonder if I'm just doing something wrong. Do you have any suggestions for diagnosing why my sends are exceeding 10s, or just how much data I'm sending? Or is this more likely a network issue between the servers and the collector?
Thanks again, and feel free to close this issue if you feel you've provided all the info you can!
especially if each call only includes the latest metrics.
This is only true if using Delta. If using Cumulative, then everything from start will always be exported... Are you using Delta or Cumulative? Also, are there many Observable instruments with callbacks potentially taking a lot of time?
We are indeed using Delta mode:
reader.TemporalityPreference = MetricReaderTemporalityPreference.Delta;
We have a handful of observables. You're suggesting that the time it takes to observe those metrics must be accounted for in the gRPC deadline? That's interesting. We've tried to make those calls quick, but it's certainly something we could take a closer look at -- that could also explain why our servers never recover from this condition.
Any other ideas are most welcome, and thank you again for all the help!
@ladenedge - Just to confirm, you don't have retries enabled, correct?
it's odd that once the server hits DeadlineExceeded, it is not able to recover from that and continues to throw that error until re-started.
I assume you're talking about retries via the HttpClient? If so, then no, I'm using the default factory.
Also, to follow up on the observables: are observables actually queried during the exporter's network call? Looking over our handful of observable counters, they appear quick (eg. Dictionary.Count
) -- but is it even possible that they contribute to the missed deadline?
are observables actually queried during the exporter's network call?
No. (If you were using Prometheus scraping, then the observables callbacks are done in response to scrape request, so they do contribute to response time of the scrape itself.) In short - observable callback is not at play in your case, as you are doing push exporter. (Sorry I confused you with the mention of observables :( )
I face same issue
For what it's worth, we are increasing our timeout to 30s to see if that makes any difference. (But this change won't be deployed for a week or so, so.. not very helpful yet, heh.)
What is the question?
We're using metrics extensively in our .NET 8 application which is deployed to a dozen or so servers. Once in a while -- say, once a week -- a single server will stop exporting to our OTel Collector. The other servers continue to work fine, and restarting our application fixes the problem.
Thank you for any help!
Additional context
Application is .NET 8 with the following OTel packages: