open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.22k stars 764 forks source link

Missing traces after OTEL backend has been unavailable? #3740

Closed zornwal closed 1 month ago

zornwal commented 2 years ago

Question/bug?

I am using OTEL .NET to send traces to a Promscale backend directly from my instrumented application (I know it's recommended to use a collector), and am seeing some odd behaviour that I don't know if is intended or not. Whenever the OTEL backend (Promscale) has had downtime (e.g. container being restarted), and the instrumented application can not send its traces, the application stops sending traces forever until the application is restarted.

I would expect the client to drop/lose whatever traces are being sent while the OTEL backend is down, and that the client would resume sending traces once the backend is up again. However, it seems that the client never resumes sending traces. Is this intended? And if so, is there a way to configure it to start sending traces again?

Nugets & Runtime

<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.4.0-alpha.1" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.4.0-alpha.1" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc9.5" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.0.0-rc9.5" />
<PackageReference Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.3" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.Runtime" Version="1.0.0" />
<PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="1.0.0-rc9.6" />

Targeting .NET 6.0

cijothomas commented 2 years ago

I'd expect exporter to resume when endpoint is available again... Could you collect https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/README.md#self-diagnostics ?

zornwal commented 2 years ago

Adding self diagnostics produces the following logs, with the same error repeating at every send attempt. Restarting the instrumented application resumes exporting. Restarting the backend service has no effect.

Successfully opened file.
2022-10-10T07:06:28.9717826Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://my.domain.com:9202/}{Grpc.Core.RpcException: Status(StatusCode="Unimplemented", Detail="Bad gRPC response. HTTP status code: 404")
   at Grpc.Net.Client.Internal.HttpClientCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Opentelemetry.Proto.Collector.Trace.V1.TraceService.TraceServiceClient.Export(ExportTraceServiceRequest request, CallOptions options)
   at Opentelemetry.Proto.Collector.Trace.V1.TraceService.TraceServiceClient.Export(ExportTraceServiceRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcTraceExportClient.SendExportRequest(ExportTraceServiceRequest request, CancellationToken cancellationToken)}
[...]
cijothomas commented 2 years ago

Tagging @alanwest too. Need to investigate this more.

alanwest commented 2 years ago

Hmm, the status code seems fishy.

Status(StatusCode="Unimplemented", Detail="Bad gRPC response. HTTP status code: 404")

I'll unlikely be able to repro anything related specifically to Promscale, but I can at least try mocking an Unimplemented response to the exporter and see how the exporter behaves... will update you soon.

zornwal commented 2 years ago

I agree that it seems fishy, but I am hesitant to immediately blame our network setup. I've been doing some digging, and I can see that not all our apps are impacted the same. The first one I noticed it for is an ASP.NET Core API hosted via IIS, which shows this 404 behavior. But another app, which is hosted as a windows service is not affected, and continues sending traces just fine after the backend recovers. The windows service has the following package versions:

<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.3" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="1.0.0-rc9.6" />

but is otherwise instrumented in the same way as the ASP.NET Core API is. Both apps are running on the same physical (and virtual) machine, so it seems unlikely that it is exclusively a networking issue - though it may be partially network related.

I have tried updating the affected apps package versions to the same as the unaffected one, but that did not change the outcome.

I'm trying to get our ops team involved so I can get some detailed logs from the network side of things, to try and narrow down the issue on that end. From the limited logs I do have I can tell that the reverse proxy correctly logs an error that it could not resolve the backend IP when the backend is down, and that it stops logging that when the backend recovers. It may be a caching issue, but I am hard pressed to explain why the same endpoint should return different status codes for different clients, as well as why restarting the client resolves the issue.

Will keep you updated with whatever I found out

alanwest commented 2 years ago

I implemented a simple test to verify that the OTLP exporter will continue to export after receiving an Unimplemented response code. So, it would seem there is something else going on...

zornwal commented 1 year ago

Okay, so, I've managed to resolve my issue although I don't really understand why it was an issue in the first place.

Previously our reverse proxy was configured to simply route TCP/HTTP traffic to the promscale backend. Explicitly changing this route to an h2c protocol (HTTP/2 over TCP) fixes the issue where clients can not reconnect after the backend has been down.

I'm not knowledgable enough on networking and/or gRPC to know if this reconnection failure is intended behaviour or not, but in any case it does not seem related to the OTLP exporter itself. 😄

github-actions[bot] commented 1 month ago

This issue was marked stale due to lack of activity and will be closed in 7 days. Commenting will instruct the bot to automatically remove the label. This bot runs once per day.