open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.17k stars 751 forks source link

TLS Handshake error with OLTPExporter and .Net Framework 4.8 #5216

Open Chicoo opened 8 months ago

Chicoo commented 8 months ago

Bug Report

List of all OpenTelemetry NuGet packages and version that you are using (e.g. OpenTelemetry 1.0.2):

OpenTelemetry 1.7.0" OpenTelemetry.Exporter.Console 1.7.0 OpenTelemetry.Exporter.OpenTelemetryProtocol 1.7.0

Runtime version (e.g. net462, net48, netcoreapp3.1, net6.0 etc. You can find this information from the *.csproj file):

net48

Symptom

When connecting to a secure Elastic APM Service v8.10.4 the APM service using the OLTPExporter and .Net Framework 4.8 no data is received and an TLS Handshake error is shown in the APM logs.

What is the expected behavior?

The data should be received by the APM service and shown in Kibana

What is the actual behavior?

No data was shown in Kibana and the APM service shows an error

{"log.level":"error","@timestamp":"2024-01-12T11:26:43.098Z","log.logger":"beater.http","log.origin":{"file.name":"http/server.go","file.line":3212},"message":"http: TLS handshake error from xxx.xxx.xxx.xxx:39130: EOF","service.name":"apm-server","ecs.version":"1.6.0"}

Reproduce

Create a self-contained project using the template of your choice, apply the minimum required code to result in the issue you're observing.

We will close this issue if:

Additional Context

The same code is working when the target framework is .Net 7.

stevejgordon commented 8 months ago

I can't provide a solution, but I looked at this as I work at Elastic. I attempted to repro based on your code (note your repro doesn't show how you configure the tracer, etc., as that code is from a reference project you've not included). I used Elastic Cloud for my testing. I successfully sent traces to the observability endpoint from .NET 4.8.

@Chicoo: It sounds like you are running a self-hosted APM server. Is that the case? If so, are you using self-signed certificates and any specific TLS configuration?

I've checked with the APM server team, and generally, the EOF error is harmless. It looks like the SDK has some logging around export failures. It's probably worth enabling the self-diagnostics to collect the error logs and see if the .NET exception includes more details to help narrow down the issue.

Chicoo commented 7 months ago

I am using an on-prem installation of ELasticStack which is running in a domain without internet connections. There is CA in the domain, so for me the certificates are not self-signed. There are no devices between the OLTP client and the endpoint that filter the traffic (no firewalls,...)

I have enabled the self-diagnostics but that does not tell me much. Maybe it is to some help for someone with an understanding of the source code.

2024-01-23T15:14:52.4060768Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{https://apm.topfas-apm.topfas.nsf/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="failed to connect to all addresses", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1706022892.276000000","description":"Failed to pick subchannel","file":"..\..\..\src\core\ext\filters\client_channel\client_channel.cc","file_line":3129,"referenced_errors":[{"created":"@1706022892.276000000","description":"failed to connect to all addresses","file":"..\..\..\src\core\lib\transport\error_utils.cc","file_line":163,"grpc_status":14}]}")
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, CallOptions options)
   at OpenTelemetry.Proto.Collector.Metrics.V1.MetricsService.MetricsServiceClient.Export(ExportMetricsServiceRequest request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken)
   at OpenTelemetry.Exporter.OpenTelemetryProtocol.Implementation.ExportClient.OtlpGrpcMetricsExportClient.SendExportRequest(ExportMetricsServiceRequest request, CancellationToken cancellationToken)}
stevejgordon commented 7 months ago

Thanks for providing that error, @Chicoo. This exception is coming from the gRPC library and certainly looks like an issue connecting, possibly due to certificate validation. Elasticsearch 8 enables many security features by default, including a self-signed certificate. I'm still leaning towards this being an issue with the client/server establishing a secure connection based on that. In the Elasticsearch .NET client, for example, we provide a mechanism to configure it by providing the certificate fingerprint, which we use to validate the server certificate.

I'm not very familiar with the gRPC libraries in .NET, but there are two environment variables that may yield more details. However, I don't know if they work for the MS implementation, but they are potentially worth trying. The SDK code only logs the main exception, so I'm unsure if these will help.

If this were over HTTP, it's possible to configure the OtlpExporterOptions.HttpClientFactory property, allowing you to provide a custom factory that could include configuring certificate validation options. You could try switching to the HTTP OTLP option here.

From here, as I'm unfamiliar with all of the diagnostic options in this library, I'd suggest creating your own HttpClient in the app and sending a request to the root URL for your Elasticsearch instance. That should tell you if it can connect; if not, it may provide further details you can use.

Otherwise, hopefully, one of the maintainers can provide suggestions to diagnose the specific issue with the gRPC connection.

Chicoo commented 7 months ago

@stevejgordon Thanks for the analysis. I am a bit tight up now with other work, but I am certainly going to try your suggestions over the next weeks