open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.
https://opentelemetry.io
Apache License 2.0
480 stars 286 forks source link

Add more .NET CLR related metrics to make it similar to .NET Framework #713

Open chandramouleswaran opened 2 years ago

chandramouleswaran commented 2 years ago

Add more .NET CLR related metrics to make it similar to .NET Framework

We are currently on .NET6 and use the Runtime instrumentation package. We would like to start seeing more .NET metrics which are very useful for live site investigations.

Is this a feature request or a bug? Feature Request

What is the expected behavior? We would like the runtime package to start emitting

  1. % Time in GC
  2. GC handles
  3. Bytes in all heaps (this might be a summation or a pre-agg on existing one?)
  4. Logical and Physical threads
  5. GC Finalization survivors
  6. GC Finalization queue length (this does not exist?)
  7. Reserved memory size

What do you expect to see? The corresponding metrics for the above - one of the issues we are dealing with is a memory leak - which we suspect is coming from the native/interop side. We are unable to confirm if that is the case because we also see the number of objects in the finalization queue to be higher and we are not sure if items in the finalization queue is what is causing the memory size to go up.

The size of finalization queue by itself is less but we read online that if items are not cleared from the finalization queue, there might be a memory leak.

What is the actual behavior? This does not exist today. Requesting for these to also be made available and surfaced up.

Additional Context

For verbosity - here are all the counters we used to use previously

    <Counter>\.NET CLR Exceptions(MYPROCESS)\# of Exceps Thrown / sec</Counter>
    <Counter>\.NET CLR Exceptions(MYPROCESS)\# of Exceps Thrown</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\# of IL Bytes Jitted</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\# of Methods Jitted</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\% Time in Jit</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\IL Bytes Jitted / sec</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\Standard Jit Failures</Counter>
    <Counter>\.NET CLR Jit(MYPROCESS)\Total # of IL Bytes Jitted</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\% Time Loading</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Assembly Search Length</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Bytes in Loader Heap</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Current Assemblies</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Current Classes Loaded</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Current appdomains</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Rate of Assemblies</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Rate of Classes Loaded</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Rate of Load Failures</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Rate of appdomains unloaded</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Rate of appdomains</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Total # of Load Failures</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Total Appdomains</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Total Assemblies</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Total Classes Loaded</Counter>
    <Counter>\.NET CLR Loading(MYPROCESS)\Total appdomains unloaded</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\# of current logical Threads</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\# of current physical Threads</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\# of current recognized threads</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\# of total recognized threads</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\Contention Rate / sec</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\Current Queue Length</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\Queue Length / sec</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\Total # of Contentions</Counter>
    <Counter>\.NET CLR LocksAndThreads(MYPROCESS)\rate of recognized threads / sec</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Bytes in all Heaps</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# GC Handles</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Gen 0 Collections</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Gen 1 Collections</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Gen 2 Collections</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Induced GC</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Total committed Bytes</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# Total reserved Bytes</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# of Pinned Objects</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\# of Sink Blocks in use</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\% Time in GC</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Allocated Bytes/sec</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Finalization Survivors</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Gen 0 Promoted Bytes/Sec</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Gen 0 heap size</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Gen 1 Promoted Bytes/Sec</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Gen 1 heap size</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Gen 2 heap size</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Large Object Heap size</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Process ID</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Promoted Finalization-Memory from Gen 0</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Promoted Memory from Gen 0</Counter>
    <Counter>\.NET CLR Memory(MYPROCESS)\Promoted Memory from Gen 1</Counter>
reyang commented 2 years ago

Thanks @chandramouleswaran!

Here is my initial thought:

  1. % Time in GC

This reminded me of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime#processruntimedotnetjitcompilation_time, maybe a good solution would be having process.runtime.dotnet.gc.collection_time? @noahfalk

  1. GC handles

This seems important IMHO. I guess we just need to figure out a way to get the value.

  1. Bytes in all heaps (this might be a summation or a pre-agg on existing one?)

This is something we're trying to address in #683, need few more days for the .NET runtime team to get back to us.

  1. Logical and Physical threads

.NET logical threads sound like a good fit for runtime instrumentation to me. Physical threads seem to be a good fit for "process instrumentation", which we plan to address here https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.

A related link to the spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/process-metrics.md#process

  1. GC Finalization survivors
  2. GC Finalization queue length (this does not exist?)

These seem to be a very important things IMHO. @noahfalk

  1. Reserved memory size

This sounds like a good fit for "process instrumentation"

noahfalk commented 2 years ago

maybe a good solution would be having process.runtime.dotnet.gc.collection_time? @noahfalk

Yeah I expect we'd want to do something based on the new GC pause duration API: https://github.com/dotnet/runtime/issues/66036

Bytes in all heaps

We should get specific about which bytes are included. Assuming we are talking about the GC heap I propose the single number most useful for most developers would be total committed VM, but currently that isn't the measurement that https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/683 is proposing.

In general for GC there are various sets of bytes which could be included (or not):

  1. bytes representing live .NET objects
  2. Fragmentation - Gaps between .NET objects. New objects may be allocated into these gaps over time or objects may be compacted to eliminate the gaps, but that doesn't happen instantly.
  3. Extra committed memory - The GC commits VM from the OS at coarse granularity and then allocates objects at a much smaller granularity. This means there is usually some excess committed memory awaiting new objects to be allocated into it.

Total committed VM includes all three categories, the current proposal for https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/683 only includes the first category.

.NET logical threads sound like a good fit for runtime instrumentation to me.

Long ago .NET had the idea that managed threads could be disconnected from OS threads and there was a brief period where .NET supported windows fibers. That support didn't last and for the last 10-15 years there is a direct correlation between a managed thread and an OS thread. The only distinction is that if a given OS thread never executes any managed code then we won't count it as a managed thread, so managed threads are potentially a subset of OS threads. Assuming "logical thread" == "managed thread" then I'm not sure it would make a very useful metric. I think the only reason it existed originally was because when we had fibers it was more useful.

GC Finalization survivors GC Finalization queue length (this does not exist?)

Talking with Maoni (GC owner) she said that it was quite infrequent that we hear from customers where a memory problem originates from excessive finalization queues or finalization survival. That doesn't preclude any particular customer from hitting it, but it does mean we are less likely to prioritize producing APIs (and thus counters) for it. What I would generally suggest for diagnosing a managed memory leak is a workflow like this:

  1. Use a top-level GC size counter to recognize that managed memory is growing excessively
  2. Use a debugger or a memory profiler to capture a snapshot of memory, or ideally multiple snapshots.
  3. Use the memory analysis views within the profiler/debugger to determine which object types are growing between successive snapshots.
  4. Look at the GC root path for a sample of those objects to determine what is keeping them alive
  5. If the root is the finalization queue the tools should report that, in the same way that they would report roots in GC handles, or stack, or statics. Ideally users don't need to test each of these hypotheses one at a time, the profiling/debugging tools will point at the relevant roots whatever they are.

There is an example of this kind of diagnosis here using some command-line tools but the same type of diagnosis can be done with GUI tools too: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-memory-leak

Hope that helps!

noahfalk commented 2 years ago

Bytes in all heaps

I just hunted down the definition of this counter from the docs and it wasn't what I expected. Based on my read it is:

Gen 0 size + Gen 1 size + Gen 2 size + Large Object heap size

The Gen 1, Gen 2, and Large object heap sizes presumably do include fragmentation. The odd part is that Gen 0 size is described as being the budget of potential allocation that could occur in gen 0 before a GC collection would occur. This number could be considerably larger than the current size of the gen 0 heap, and depending on GC implementation details it seems legal that the sum could be larger than the total amount of committed VM. It would mostly depend on whether the GC eagerly commits all memory necessary for the gen 0 budget or it commits that memory on demand. I'd propose if we have a counter that claims to measure "size" it should be defined as a current size rather than a budget for future potential growth.

reyang commented 2 years ago

@noahfalk for .NET runtime specific memory (e.g. gc heap) and threads (e.g. logical threads), do you see a high demand in CLR hosting scenario? (e.g. w3wp.exe and SQL stored-procedures).

noahfalk commented 2 years ago

My understanding is that SQL hosting was a big push circa the mid-2000s on .NET Framework, but I am not aware that .NET Core supports it at all, and pretty sure even .NET Framework dropped the fiber support part. Hosting in w3wp I expect is very common. No pushback at all on gc-heap metrics (though I still think we should try not to confuse people with too many different or non-intuitive gc measurement variations). For logical threads there is nothing actively wrong with it, but it may be low value given the absence of fibers.