open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.
https://opentelemetry.io
Apache License 2.0
474 stars 283 forks source link

Deadlock while instrumenting EventCountersMetrics #1024

Closed besseb05 closed 1 year ago

besseb05 commented 1 year ago

Issue with OpenTelemetry.Instrumentation.EventCounters

List of all OpenTelemetry NuGet packages and version that you are using (e.g. OpenTelemetry 1.0.2):

<PackageReference Update="OpenTelemetry" Version="1.4.0-beta.3" />
<PackageReference Update="OpenTelemetry.Api" Version="1.4.0-beta.3" />
<PackageReference Update="OpenTelemetry.Exporter.CredScan" Version="2.0.12" />
<PackageReference Update="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc9.9" />
<PackageReference Update="OpenTelemetry.Exporter.Zipkin" Version="1.4.0-beta.3" />
<PackageReference Update="OpenTelemetry.Instrumentation.AspNetCore" Version="1.0.0-rc9.9" />
<PackageReference Update="OpenTelemetry.Instrumentation.EventCounters" Version="1.0.0-alpha.2" />
<PackageReference Update="OpenTelemetry.Instrumentation.GrpcNetClient" Version="1.0.0-rc9.9" />
<PackageReference Update="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc9.9" />
<PackageReference Update="OpenTelemetry.Instrumentation.Process" Version="1.0.0-alpha.2" />
<PackageReference Update="OpenTelemetry.Instrumentation.Runtime" Version="1.1.0-beta.1" />

Runtime version: net7.0

Is this a feature request or a bug?

What is the expected behavior?

When we are instrumenting our code, we expect this code for instrumenting the MeterProvider is executed when the ServiceFabric service settings are created. The code should execute and return normally, so that the service can start up and start serving calls.

#pragma warning disable CA2000 // Dispose objects before losing scope
            EventSource gcEventSource = new("gc");
#pragma warning restore CA2000 // Dispose objects before losing scope
            // This Counter will be removed once OT team adds process.gc.percentageTimeinGC metrics in OT process instrumentation library
#pragma warning disable CA2000 // Dispose objects before losing scope
            PollingCounter gcPauseDurationPc = new("pauseDuration", gcEventSource, () => GC.GetTotalPauseDuration().TotalMilliseconds);
#pragma warning restore CA2000 // Dispose objects before losing scope
            MeterProvider meterProvider = Sdk.CreateMeterProviderBuilder()
                .AddMeter(AzureCommonNamespace)
                .AddProcessInstrumentation()
                .AddRuntimeInstrumentation()
                // Adding a view for all histograms created from Azure-Common to not have any buckets which helps with performance
                .AddView((instrument) =>
                {
                    if (instrument.Meter.Name == AzureCommonNamespace)
                    {
                        return new ExplicitBucketHistogramConfiguration() { Boundaries = Array.Empty<double>() };
                    }

                    return null;
                })
                .AddEventCountersInstrumentation(options =>
                {
                    options.AddEventSources(gcEventSource.Name);
                });
                .Build();

          // Most of the process.runtime.dotnet.gc.* metrics are only available after the GC finished at least one collection.
            GC.Collect(1);

What is the actual behavior? We have come across an instance where, according to the memory dump we took of a process which was stuck, there was a deadlock between two threads while running this instrumentation code. I'm sharing the stack traces below:

image

We have only observed this behavior once so far, so it's probably a rare race condition. We were following this documentation as guidance. Could it be related to creating the EventSource before we build the MeterProvider?

Additional Context

Add any other context about the feature request here.

cijothomas commented 1 year ago

@mic-max Could you investigate this?

mic-max commented 1 year ago

@mic-max Could you investigate this?

Yes Cijo, looking into it now

Update: I do believe this is a threading bug in the EventCounters instrumentation. Trying to reproduce it with coyote to confirm and create a failing test which can be fixed.

mic-max commented 1 year ago

When instantiating an EventListener, the callbacks to OnEventSourceCreated and OnEventWritten can happen before the constructor has completed. Take care when you initialize instance members used in those callbacks.

Source: https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.tracing.eventlistener?view=net-7.0

Yun-Ting commented 1 year ago

@besseb05: Are we able to pin down whether it is the lock in the below screenshot from EnableEvents that deadlocked with https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/7fda8d43b3d2c7a3f314df88063b28b4ea2078bb/src/OpenTelemetry.Instrumentation.EventCounters/EventCountersMetrics.cs#LL95-L111C10? image

Another clarification question I have, we are using EvenEventCountersMetrics as a Singleton in this case, right?

besseb05 commented 1 year ago

Hi @Yun-Ting, I am not sure if it is that lock... the memory dump that we had is no longer available and I didn't take note of it when I looked at it before, sorry

Another clarification question I have, we are using EvenEventCountersMetrics as a Singleton in this case, right?

I am not sure about this... I shared the initialization we are doing in our service

Yun-Ting commented 1 year ago

@besseb05, no worries. Let me know if you happen to hit the issue again and get the dump. Thanks!

besseb05 commented 1 year ago

Hi @Yun-Ting, we ran into the issue again. I believe that that is the lock causing the deadlock but I am not very experienced analyzing memory dumps. The callstack in the dump definitely shows that the thread is waiting on a lock in that function

Yun-Ting commented 1 year ago

I believe that that is the lock causing the deadlock but I am not very experienced analyzing memory dumps.

Hi @besseb05, would you mind sharing the dump file here so I can take a look?

The callstack in the dump definitely shows that the thread is waiting on a lock in that function

Do you mean that the deadlock situation was happening for these 2 locks? https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1024#issuecomment-1464561184

besseb05 commented 1 year ago

I believe that that is the lock causing the deadlock but I am not very experienced analyzing memory dumps.

@Yun-Ting I don't think I am able to share the dump, sorry

Do you mean that the deadlock situation was happening for these 2 locks? https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1024#issuecomment-1464561184

I think so. Were you able to replicate that deadlock on your side?

besseb05 commented 1 year ago

@Yun-Ting do you have any updates?

besseb05 commented 1 year ago

Hi @Yun-Ting do you have any updates?

Yun-Ting commented 1 year ago

Hi @besseb05, the mitigation was merged and released. https://github.com/open-telemetry/opentelemetry-dotnet-contrib/releases/tag/Instrumentation.EventCounters-1.5.1-alpha.1.