open-telemetry / opentelemetry-dotnet-instrumentation

OpenTelemetry .NET Automatic Instrumentation
https://opentelemetry.io
Apache License 2.0
369 stars 93 forks source link

NetFX dependency redirections fails in some cases #3714

Open RassK opened 1 month ago

RassK commented 1 month ago

What

Runtime dependency redirection mechanism is failing in some cases.

Example affected library: System.Diagnostics.DiagnosticSource

Affected version: OpenTelemetry.AutoInstrumentation v1.8.0 (ships with System.Diagnostics.DiagnosticSource v8.0.0)

App 1 (NetFxConsole): .NET FX 4.8 Console app (very simple, creates one manual Activity + makes a HttpClient call) ref: System.Diagnostics.DiagnosticSource v8.0.1 result: works

App 2 (SimpleFrameworkApp): ASP.NET 4.8 (template) ref: System.Diagnostics.DiagnosticSource v8.0.1 result: Could not load file or assembly 'System.Diagnostics.DiagnosticSource, Version=8.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' or one of its dependencies. The system cannot find the file specified.

Native logs attached: otel-dotnet-auto-5888-NetFxConsole-Native.log otel-dotnet-auto-7856-w3wp-Native.log

RassK commented 1 month ago

Let's make sure we don't release this before a proper fix: https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/pull/3710

This is a very good test example at the moment.

RassK commented 1 month ago

I think @pjanotti figured it out that this check here is skipping important modules from dependency analysis. The initial tests show that after I removed System.Diagnostics prefix from being skipped the test application successfully bumped System.Diagnostics.DiagnosticSource to v8.0.1 from v8.0.0.

Not sure if runtime_information_.is_desktop() doesn't correctly identify IIS apps? Does anyone know how important is the initial skiplist for performance at all?

CC: @zacharycmontoya @nrcventura

pjanotti commented 1 month ago

Not sure if runtimeinformation.is_desktop() doesn't correctly identify IIS apps?

It is based on https://learn.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/icorprofilerinfo3-getruntimeinformation-method so unless that is broken it should correctly identify as "desktop" meaning .NET Fx.

Does anyone know how important is the initial skiplist for performance at all?

I'm not aware of any measurement of that. Anyway, we need to ensure first that we are correct we then can take care of optimizing if that becomes an issue. This should affect start up time and should be done only once per app domain IIRC.

nrcventura commented 1 month ago

I think that @pjanotti and @RassK are both correct, and have identified the root cause. It looks like the primary purpose of this skip_assembly_prefixes list was to skip the assemblies that do not need to be instrumented. The problem in this issue arrises because we also use that list to skip the assembly binding redirects.

Either way I think we need to decouple the concepts of which assemblies should be skipped for instrumentation, and which assemblies should be skipped for binding redirects. I suspect there are multiple assemblies in the current skip list that we want to apply binding redirects to. I am also wondering if we should ever skip an assembly for these binding redirects, because more and more built-in assemblies (or at least assemblies in the current skip list) will potentially be instrumented or use one of the libraries that we need to apply these binding redirects to. The current skip list check might need to be applied after we perform the binding redirects.

RassK commented 1 month ago

So should we start with a quick fix by removing https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/main/src/OpenTelemetry.AutoInstrumentation.Native/otel_profiler_constants.h#L48 ?

Seems the most crucial thing to do 🤔

Kielek commented 1 month ago

One more thing, it will be great to have a tests that check if we do not ship (in .netFx folder any library listed/prefixed by https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/315758d7088eafab96a927dc374ca8f9c7f1c994/src/OpenTelemetry.AutoInstrumentation.Native/otel_profiler_constants.h)

nrcventura commented 1 month ago

I don't think that we need the quick fix. I think we can just move the logic for updating the references, before the skip assemblies checks.

RassK commented 1 month ago

Why I'm towards a quickfix is that S.D.DS v8.0.1 is already out there and we don't have a supporting release. Splunk also has a failing POC in progress because of this bug. For last, even if we manage to reorder loading with some time, there are no tests to cover this issue... I'm not sure how to pinpoint this issue so we prevent it happening again 🤔 Testing seems to need loads of work.

pjanotti commented 1 month ago

@RassK I think the hope was that the two things could be really separated: is that too big of a change?

For the tests we could potentially read the redirection from a file only in the test (e.g.: have some env var that reads from a file and ignores the map defined at compile time if defined and if not just proceed with the map defined in the header file). This file could point to earlier versions of the redirect but we should only use the ones from the app. That said I think that the tests could be done in separate as a follow-up.

RassK commented 1 month ago

Seems some cache messed my testing up (by somehow compiling the redirection in). Prob updating my local branch cleared a cache and now the quickfix or making sure System.Diagnostics is not skipped does not work. I also could not get over of the question why IIS and console is different - and seems this is the important question.

Location 1: This is fine, if checks work as supposed https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/main/src/OpenTelemetry.AutoInstrumentation.Native/cor_profiler.cpp#L705

Location 2: There is another discovered path which does skipping. Even fixing it similar to the Location 1 does not fix it. https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/main/src/OpenTelemetry.AutoInstrumentation.Native/cor_profiler.cpp#L1216

So back to the question, why IIS and console differ. There is clearly a difference how AppDomain is handled.

Console app loads the main module (In my case NetFxConsole.exe) in ModuleLoadFinished section. Then analyses it's dependencies and finds System.Diagnostics.DiagnosticSource v8.0.1 and performs redirection update.

[2024-10-24T10:06:30.192636400Z] [3796|7816] [debug] ModuleLoadFinished: 140728057938248 NetFxConsole AppDomain 1766768132752 [DefaultDomain]  | IsNGEN = false | IsDynamic = false | IsResource = false
[2024-10-24T10:06:30.192674500Z] [3796|7816] [debug] RedirectAssemblyReferences: AssemblyRef for [mscorlib] version=4.0.0.0
[2024-10-24T10:06:30.192678000Z] [3796|7816] [debug] RedirectAssemblyReferences: AssemblyRef for [System.Diagnostics.DiagnosticSource] version=8.0.0.1
[2024-10-24T10:06:30.192682400Z] [3796|7816] [info] RedirectAssemblyReferences: redirection update for [System.Diagnostics.DiagnosticSource] to_version=8.0.0.1 previous_version_redirection=8.0.0.0
[2024-10-24T10:06:30.192718000Z] [3796|7816] [debug] RedirectAssemblyReferences: AssemblyRef for [System.Net.Http] version=4.2.0.0
[2024-10-24T10:06:30.192723000Z] [3796|7816] [debug] RedirectAssemblyReferences: same version for [System.Net.Http] version=4.2.0.0 previous_redirects=0
[2024-10-24T10:06:30.192724300Z] [3796|7816] [debug] RedirectAssemblyReferences: EnumAssemblyRefs returned S_FALSE assembly_refs_count=0
[2024-10-24T10:06:30.192728900Z] [3796|7816] [debug] ModuleLoadFinished: done NetFxConsole v1.0.0.0

However IIS app never performs ModuleLoadFinished for the main module (In my case SimpleFrameworkApp.dll). Therefore skips entirely the dependency analysis for the dependencies of the SimpleFrameworkApp. I could not find any other AppDomain having System.Diagnostics.DiagnosticSource as a dependency besides our modules (OpenTelemetry.* modules).

I can find direct loading of System.Diagnostics.DiagnosticSource v8.0.1 module but because it's not a dependency of a module then no redirection analysis is performed.

[2024-10-24T10:06:24.023477200Z] [3572|6248] [debug] ModuleLoadFinished: 140728071477120 System.Diagnostics.DiagnosticSource AppDomain 2175410301504 [/LM/W3SVC/5/ROOT-1-133742379818231863]  | IsNGEN = false | IsDynamic = false | IsResource = false
[2024-10-24T10:06:24.023751000Z] [3572|6248] [debug] RedirectAssemblyReferences: AssemblyRef for [mscorlib] version=4.0.0.0
[2024-10-24T10:06:24.023754500Z] [3572|6248] [debug] RedirectAssemblyReferences: AssemblyRef for [System.Memory] version=4.0.1.2
[2024-10-24T10:06:24.023758000Z] [3572|6248] [debug] RedirectAssemblyReferences: same version for [System.Memory] version=4.0.1.2 previous_redirects=0
[2024-10-24T10:06:24.023760500Z] [3572|6248] [debug] RedirectAssemblyReferences: AssemblyRef for [System] version=4.0.0.0
[2024-10-24T10:06:24.023763000Z] [3572|6248] [debug] RedirectAssemblyReferences: AssemblyRef for [System.Core] version=4.0.0.0
[2024-10-24T10:06:24.023765800Z] [3572|6248] [debug] RedirectAssemblyReferences: AssemblyRef for [System.Runtime.CompilerServices.Unsafe] version=6.0.0.0
[2024-10-24T10:06:24.023768700Z] [3572|6248] [debug] RedirectAssemblyReferences: same version for [System.Runtime.CompilerServices.Unsafe] version=6.0.0.0 previous_redirects=0
[2024-10-24T10:06:24.023769900Z] [3572|6248] [debug] RedirectAssemblyReferences: EnumAssemblyRefs returned S_FALSE assembly_refs_count=0
[2024-10-24T10:06:24.023773100Z] [3572|6248] [debug] ModuleLoadFinished: done System.Diagnostics.DiagnosticSource v8.0.0.1

So my next question is: Where is ModuleLoadFinished for SimpleFrameworkApp module ?

RassK commented 3 weeks ago

https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/pull/3751 is going to fix the issue for some time. Will keep this open to investigate a proper solution.