open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.16k stars 748 forks source link

Unexpected Telemetry Logging Disruption in .NET 6 Projects Using OpenTelemetry Within an IIS Context #4936

Open SSPAGNAMN76 opened 11 months ago

SSPAGNAMN76 commented 11 months ago

Bug Report

Symptom

Describe the bug

After deployment, telemetry logging in our .NET 6 projects using OpenTelemetry only functions for a limited time span. The logging ceases without error warnings, and a memory dump reveals that instances of OpenTelemetry vanish until a connection pool refresh.

Expected behavior

Continuous and uninterrupted telemetry logging with OpenTelemetry instances remaining persistent without requiring a pool refresh.

Runtime environment:

Additional context

Potential issue with the Garbage Collector prematurely deallocating the MeterProvider instance, interrupting the telemetry logging. Deployed applications utilizing .NET 6 with slightly varied OpenTelemetry initiation codes are being impacted. Notably, within our suite of applications, the ones experiencing this issue are the only two hosted on IIS, presenting a possible correlation that may warrant further exploration.

Reproduce

Steps to reproduce the behavior:

  1. Deploy a .NET 6 application utilizing OpenTelemetry for telemetry logging.
  2. After a short time span, observe ceasing of telemetry logging without error messages.
  3. Perform a memory dump and note the absence of OpenTelemetry instances.
  4. Refresh the connection pool, note the reappearance of OpenTelemetry instances, and resume of telemetry logging.

Additional Steps Taken:

private static MeterProvider Provider;

// Configuration logic... I store MeterProvider in th static variable
Provider = Sdk.CreateMeterProviderBuilder()
    // Additional configuration logic...
   .Build();

I added GC KeepAlive

// Further configuration logic...
GC.KeepAlive(Provider);

that replaces the previous logic:

services.AddSingleton(meterProvider);

The solution seemed to work, but in the long run, an oscillating performance occurred so we haven't solved the issue.

Configuration class code

Here I provide you the code of the configuration class before applying changes described above :

using OpenTelemetry;
using OpenTelemetry.Exporter;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;

namespace MyProject.Metrics;

public static class MetricsConfigurator
{
    private static readonly AsyncLocal<MetricStreamConfiguration> CurrentCallConfiguration = new();

    public static MeterProvider Configure(IConfiguration configuration, Assembly mainAssembly)
    {
        var appName = mainAssembly.GetName();
        var resource = ResourceBuilder.CreateDefault()
            .AddService(serviceName: appName.Name, serviceVersion: appName.Version!.ToString())
            .AddAttributes(new KeyValuePair<string, object>[]
            {
                new("server.name", Environment.MachineName),
                new("process.id", Environment.ProcessId)
            });

        var exportInterval = TimeSpan.FromMilliseconds(configuration.GetValue("OpenTelemetry:ExportIntervalMilliseconds", 60000));

        return Sdk.CreateMeterProviderBuilder()
            .AddMeter("*")
            .SetResourceBuilder(resource)
            .AddView(instrument =>
            {
                var config = CurrentCallConfiguration.Value;
                CurrentCallConfiguration.Value = null;
                return config;
            })
            .AddOtlpExporter((eo, mo) =>
            {
                eo.Endpoint = new Uri("http://localhost:2222");
                eo.Protocol = OtlpExportProtocol.Grpc;
                mo.PeriodicExportingMetricReaderOptions = new PeriodicExportingMetricReaderOptions
                {
                    ExportIntervalMilliseconds = (int)exportInterval.TotalMilliseconds
                };
                mo.TemporalityPreference = MetricReaderTemporalityPreference.Delta;
            }).Build();
    }

    public static void AddOpenTelemetry(this IServiceCollection services, IConfiguration configuration)
    {
        var meterProvider = Configure(configuration, Assembly.GetCallingAssembly());
        services.AddSingleton(meterProvider);
    }

    public static Meter UsingConfiguration(this Meter meter, MetricStreamConfiguration configuration)
    {
        CurrentCallConfiguration.Value = configuration;
        return meter;
    }
}

The class is referenced in Program.cs written using Minimal Hosting / Minimal Apis approach :

// ......
builder.Services.AddOpenTelemetry(builder.Configuration);
// ......

Log Results:

Enabling logs the following error comes out, but I'm pretty sure that we haven't network issue :

Clear Indication Towards an IIS Context Issue

After traversing through the data and instances provided, we've descended into a scenario that requires our utmost attention and collaborative resolution. I’ve dissected the situation and here is a clear, deductive path that leans towards the conclusions reached in point 3.

  1. Persistent Memory Instances: After manually invoking the Garbage Collector during the memory dump, both instances - the metrics-producing application and the collector - remain discernible in the memory. This consistency between active instances and memory representation points towards an undeviating relation of instances with our issue at hand, rather than an irregularity in memory allocation or deallocation.

  2. Network Error Elimination: The co-existence of the metrics producer and collector on the identical machine effectively nullifies the hypothesis of network errors contributing to the traced issue. This brings to the forefront a stark revelation that the surface-level error we are perceiving is perhaps a symptom of an underlying, concealed issue. We also try to replace localhost:2222 with 127.0.0.1:2222 to force IPv4 without positive results.

  3. The IIS Context Culprit: When immersing into a .NET 6 console application context, the metrics are dispatched seamlessly, devoid of discrepancies. Conversely, introducing an application into the IIS context, by launching it as a process (in-process), resurrects the aforestated issue. This correlation between the manifestation of the problem and the IIS context is undeniable.

Conclusion

Considering these aspects and the additional information that the problem is being observed only on applications hosted on AWS using Framework 6.0 - as opposed to those on Azure using Framework 4.8 which are running smoothly - it's evident that our issue is intricately tied to the IIS context in which the application is running, and potentially the interaction with the specific environment and framework version.

Your expertise and insights into how applications behave diversely when executed within an IIS context, or any anomalies witnessed previously, especially within an AWS environment and utilizing Framework 6.0, would be immensely valuable. Urgency binds us due to the criticality of the situation, and your swift response will be greatly appreciated.

Let's unravel this together and sculpt a resolution at the earliest. Observations on peculiarities or necessary settings, either at the environment or application level for this specific hosting scenario, would be instrumental in navigating towards a solution.

Thank you in advance for your cooperation and looking forward to fruitful discussions ahead.

cijothomas commented 11 months ago

What is the bug in OpenTelemetry ?

As mentioned here (https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#common-issues-that-lead-to-missing-metrics), MeterProvider is must be kept active for metrics to work. It also links to examples for achieving this with asp.net core style apps and normal console apps as well..

SSPAGNAMN76 commented 11 months ago

What is the bug in OpenTelemetry ?

As mentioned here (https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#common-issues-that-lead-to-missing-metrics), MeterProvider is must be kept active for metrics to work. It also links to examples for achieving this with asp.net core style apps and normal console apps as well..

The issue fundamentally is that metrics are collected intermittently. We have environments where everything works properly and environments where the problem occurs. The problem occurs on applications hosted on AWS DEMO, written with .NET 6.0, and running under IIS.

cijothomas commented 11 months ago

What is the bug in OpenTelemetry ? As mentioned here (https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/metrics#common-issues-that-lead-to-missing-metrics), MeterProvider is must be kept active for metrics to work. It also links to examples for achieving this with asp.net core style apps and normal console apps as well..

The issue fundamentally is that metrics are collected intermittently. We have environments where everything works properly and environments where the problem occurs. The problem occurs on applications hosted on AWS DEMO, written with .NET 6.0, and running under IIS.

That seem like an environment issue, rather than an issue with OpenTelemetry. And it seems you are already fixing your environment to ensure MeterProvider is not disposed earlier than desired, so I am not seeing anything actionable for OpenTelemetry to fix...

Let me know if I misunderstood the whole issue.

SSPAGNAMN76 commented 11 months ago

Updates

Hello Team and @cijothomas ,

I've refactored the configuration class as follows and it seems to be a pretty robust implementation:

public class SingletonExample
{
    private static readonly object LockObj = new object();
    private static ProviderInstance _provider;
    private static IConfiguration Config;
    private static Assembly AssemblyInfo;

    public static readonly AsyncLocal<ConfigurationDetails> CurrentConfiguration = new();

    public static ProviderInstance Provider
    {
        get
        {
            if (_provider != null) return _provider;
            lock (LockObj)
            {
                _provider ??= CreateConfiguration();
            }
            return _provider;
        }
    }

    public static void Initialize(IConfiguration config, Assembly assembly)
    {
        Config = config ?? throw new ArgumentNullException(nameof(config));
        AssemblyInfo = assembly ?? throw new ArgumentNullException(nameof(assembly));
    }

    private static ProviderInstance CreateConfiguration()
    {
        var appDetails = AssemblyInfo.GetName();
        var resource = ResourceBuilder.CreateDefault()
            // Add basic application details
            .AddService(serviceName: appDetails.Name, serviceVersion: appDetails.Version!.ToString())
            // Add system details
            .AddAttributes(new KeyValuePair<string, object>[]
            {
                new("system.name", Environment.MachineName),
                new("process.id", Environment.ProcessId)
            });

        // Export interval in milliseconds
        var exportInterval = TimeSpan.FromMilliseconds(Config.GetValue("Library:ExportIntervalMilliseconds", 60000));

        return Sdk.CreateProviderBuilder()
            .AddType("*")
            .SetResourceBuilder(resource)
            .AddView(instrument =>
            {
                var config = CurrentConfiguration.Value;
                CurrentConfiguration.Value = null;
                return config;
            })
            // Configure exporter
            .AddExporter((exportOptions, metricOptions) =>
            {
                exportOptions.Endpoint = new Uri("http://localhost:5110");
                exportOptions.Protocol = ExportProtocolType.Grpc;
                metricOptions.ExportReaderOptions = new ExportReaderOptions
                {
                    ExportIntervalMilliseconds = (int)exportInterval.TotalMilliseconds
                };
                metricOptions.TemporalityPreference = TemporalityPreferenceType.Delta;
            }).Build();
    }
}

public static class ExtensionExample
{
    public static Provider UsingConfiguration(this Provider provider, ConfigurationDetails configuration)
    {
        SingletonExample.CurrentConfiguration.Value = configuration;
        return provider;
    }

    public static void AddLibrary(this IServiceCollection services, IConfiguration configuration)
    {
        SingletonExample.Initialize(configuration, Assembly.GetCallingAssembly());
        services.AddSingleton(_ => SingletonExample.Provider);
    }
}

🚀 Upon deployment to the STG environment on AWS, it functioned well for about 3 hours, but then started throwing the following error in bursts:

2023-10-10T07:38:24.9429543Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{http://localhost:5110/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: The response ended prematurely while waiting for the next frame from the server.", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.")

Analysis: Even though the collector and the application reside on the same machine (thus making a network error quite unlikely), this error is presented. It leads me to believe that it might be obscuring another issue underneath.

Additional Query: Could this error originate from the collector side? If so, what specifically should we focus on investigating?

Would appreciate if someone can delve a bit deeper into this matter. The part where the IOException occurs (The response ended prematurely while waiting for the next frame from the server) needs special attention.

Looking forward to hearing thoughts or suggestions on the implementation and error!

Thank you 🙏.

SSPAGNAMN76 commented 11 months ago

🚨 Important Update 🚨

Dear Contributors, @cijothomas

Thank you for your continuous attention and support on this issue. I’d like to share a crucial update regarding our advancements and the current challenges.

Issue: We’ve identified that the test environment hasn’t been logging for over an hour and a specific type of error keeps appearing in the trace.

Actions Taken: We attempted to restart the connection pool on two web servers in sticky session sequentially. Upon restarting the first and attempting an action to generate a metric (not sent to the collector), we restarted the second pool, performed another operation, and both metrics were sent successfully.

Detailed Analysis: A thorough analysis of active communication processes with the collector showed that before the pool restarts, there were no processes attributable to our application, while after recycling the pool, the processes are present.

Insights: Metrics are being captured correctly, but communication issues with the grpc service seem to be occurring. Interestingly, this only occurs when the application is on framework 6.0, in-process on IIS, and hosted on AWS.

Seeking Suggestions: We're reaching out to the community for insights or suggestions that might assist in resolving this issue. Your expertise is invaluable, and any input would be greatly appreciated!

Thank you for your patience and collaboration!

cijothomas commented 11 months ago

Sorry, not clear what is the ask from this repo? The error below would why telemetry is missing. This could indicates collector is not available.. could be transient error as well. I do not have much expertise to troubleshoot gRPC connection errors.

"2023-10-10T07:38:24.9429543Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{http://localhost:5110/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: The response ended prematurely while waiting for the next frame from the server.", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.")"

SSPAGNAMN76 commented 11 months ago

Sorry, not clear what is the ask from this repo? The error below would why telemetry is missing. This could indicates collector is not available.. could be transient error as well. I do not have much expertise to troubleshoot gRPC connection errors.

"2023-10-10T07:38:24.9429543Z:Exporter failed send data to collector to {0} endpoint. Data will not be sent. Exception: {1}{http://localhost:5110/}{Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: The response ended prematurely while waiting for the next frame from the server.", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.")"

@cijothomas Thank you so much for your involvement and interest,

I believe the issue lies within the code because the collector is gathering correctly from other applications (which, however, are using the framework 4.8). It is mathematically impossible for it to be a networking problem since the collector and the application reside on the same server. I have attempted to substitute 'localhost' with '127.0.0.1' without obtaining any results.

Your help in shedding light on this perplexing issue would be profoundly appreciated.

Warm regards,

cijothomas commented 11 months ago

I do not have sufficient experience further with root causing gRPC exception. One thing to note is that in .NET Framework vs .NET 6.0, the exporter uses different client library for gRPC, so the issue (if any) could be with the underlying library....

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/OpenTelemetry.Exporter.OpenTelemetryProtocol.csproj#L18-L19

matt-hensley commented 11 months ago

There is a proposal in #4860 to remove the older gRPC library.

Is switching protocols to http/protobuf an option in this case?

SSPAGNAMN76 commented 11 months ago

Hi @matt-hensley ,
I truly appreciate your suggestion regarding migrating to a different protocol for the collector component. It's definitely something to consider for the future. However, at this juncture, it's imperative for us to get our current GRPC solution up and running without delay. We can certainly revisit this idea once we have stabilized our present setup. Thanks again for your insight and understanding. Hope you can help me again. Below I post an update. Simo

OpenTelemetry Logging Error: Task Cancellation Exception with Grpc.Core

Description:

I've been consistently encountering an error when trying to write logs with OpenTelemetry. The error suggests that there's a task cancellation happening when reaching out via the gRPC channel. Here's the error message:

Failed to write logs to OpenTelemetry due to System.Threading.Tasks.TaskCanceledException: Reached deadline.
   at Grpc.Core.Channel.<WaitForStateChangedAsync>d__16.MoveNext() in /path/obfuscated/Grpc.Core/Channel.cs:line 146
... [rest of the stack trace is similar to provided, but with paths obfuscated]
   at Obfuscated.OpenTelemetryLogProvider.<Flush>d__39.MoveNext() in /path/obfuscated/OpenTelemetryLogProvider.cs:line 250

This error happens at an unspecified frequency and seems to be related to the gRPC channel state change mechanism. This might be causing a delay or inability to log data, which is crucial for our operations.

Any assistance or insights on resolving this issue would be greatly appreciated.

cijothomas commented 11 months ago

OpenTelemetryLogProvider.cs - This does not look like something from this repo...

cijothomas commented 11 months ago

There is a proposal in #4860 to remove the older gRPC library.

Is switching protocols to http/protobuf an option in this case?

From the description, the issue is in .NET 6.0 only, which would be using the new gRPC client!

SSPAGNAMN76 commented 10 months ago

There is a proposal in #4860 to remove the older gRPC library. Is switching protocols to http/protobuf an option in this case?

From the description, the issue is in .NET 6.0 only, which would be using the new gRPC client!

Found the issue but NOT the solution at all: Expanding x-dynatrace Header Causing Request Rejection by OpenTelemetry Collector in .NET 6.0 Web Application

I am grappling with an intriguing issue related to the x-dynatrace header while working with an IIS hosted web application and a custom OpenTelemetry Collector service, particularly during communications via GRPC protocol. The expanding header eventually leads to the inability to send metrics to the OT Collector as it starts rejecting requests after the header size exceeds a particular threshold (~20KB).

Environment Details:

Issue In-depth:

  1. Post approximately 1 hour and 30 minutes of operation, the x-dynatrace header balloons to around 28KB in size.
  2. A recycling of the app pool mitigates the header size temporarily, but the augmentation cycle repeats with each subsequent request.
  3. Suspending the Dynatrace Agent and recycling the app pool facilitated stable communication between our application and the OT Collector.

Seeking Solutions

  1. Has anyone been confronted with a similar Dynatrace header expansion issue in a comparable environment?
  2. Are there established solutions or workarounds to restrain the header from ballooning or to handle it effectively?
  3. Are there efficient ways to intercept and alter headers in GRPC calls, particularly in a .NET 6.0 environment, that won’t be too resource-exhaustive?
  4. Does a way exist to handle this issue programmatically? Specifically, I'm looking for guidance on the most effective way to manage the "x-dynatrace" header's size dynamically within our codebase.

Your insights, suggestions, or shared experiences with similar scenarios would be greatly valuable. Thank you in advance for your expertise and assistance!

cijothomas commented 10 months ago

I'd recommend asking this in Dynatrace agent support.