Open chandramouleswaran opened 2 years ago
Thanks @chandramouleswaran!
Here is my initial thought:
- % Time in GC
This reminded me of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime#processruntimedotnetjitcompilation_time, maybe a good solution would be having process.runtime.dotnet.gc.collection_time
? @noahfalk
- GC handles
This seems important IMHO. I guess we just need to figure out a way to get the value.
- Bytes in all heaps (this might be a summation or a pre-agg on existing one?)
This is something we're trying to address in #683, need few more days for the .NET runtime team to get back to us.
- Logical and Physical threads
.NET logical threads sound like a good fit for runtime instrumentation to me. Physical threads seem to be a good fit for "process instrumentation", which we plan to address here https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.
A related link to the spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/process-metrics.md#process
- GC Finalization survivors
- GC Finalization queue length (this does not exist?)
These seem to be a very important things IMHO. @noahfalk
- Reserved memory size
This sounds like a good fit for "process instrumentation"
maybe a good solution would be having process.runtime.dotnet.gc.collection_time? @noahfalk
Yeah I expect we'd want to do something based on the new GC pause duration API: https://github.com/dotnet/runtime/issues/66036
Bytes in all heaps
We should get specific about which bytes are included. Assuming we are talking about the GC heap I propose the single number most useful for most developers would be total committed VM, but currently that isn't the measurement that https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/683 is proposing.
In general for GC there are various sets of bytes which could be included (or not):
Total committed VM includes all three categories, the current proposal for https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/683 only includes the first category.
.NET logical threads sound like a good fit for runtime instrumentation to me.
Long ago .NET had the idea that managed threads could be disconnected from OS threads and there was a brief period where .NET supported windows fibers. That support didn't last and for the last 10-15 years there is a direct correlation between a managed thread and an OS thread. The only distinction is that if a given OS thread never executes any managed code then we won't count it as a managed thread, so managed threads are potentially a subset of OS threads. Assuming "logical thread" == "managed thread" then I'm not sure it would make a very useful metric. I think the only reason it existed originally was because when we had fibers it was more useful.
GC Finalization survivors GC Finalization queue length (this does not exist?)
Talking with Maoni (GC owner) she said that it was quite infrequent that we hear from customers where a memory problem originates from excessive finalization queues or finalization survival. That doesn't preclude any particular customer from hitting it, but it does mean we are less likely to prioritize producing APIs (and thus counters) for it. What I would generally suggest for diagnosing a managed memory leak is a workflow like this:
There is an example of this kind of diagnosis here using some command-line tools but the same type of diagnosis can be done with GUI tools too: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/debug-memory-leak
Hope that helps!
Bytes in all heaps
I just hunted down the definition of this counter from the docs and it wasn't what I expected. Based on my read it is:
Gen 0 size + Gen 1 size + Gen 2 size + Large Object heap size
The Gen 1, Gen 2, and Large object heap sizes presumably do include fragmentation. The odd part is that Gen 0 size is described as being the budget of potential allocation that could occur in gen 0 before a GC collection would occur. This number could be considerably larger than the current size of the gen 0 heap, and depending on GC implementation details it seems legal that the sum could be larger than the total amount of committed VM. It would mostly depend on whether the GC eagerly commits all memory necessary for the gen 0 budget or it commits that memory on demand. I'd propose if we have a counter that claims to measure "size" it should be defined as a current size rather than a budget for future potential growth.
@noahfalk for .NET runtime specific memory (e.g. gc heap) and threads (e.g. logical threads), do you see a high demand in CLR hosting scenario? (e.g. w3wp.exe and SQL stored-procedures).
My understanding is that SQL hosting was a big push circa the mid-2000s on .NET Framework, but I am not aware that .NET Core supports it at all, and pretty sure even .NET Framework dropped the fiber support part. Hosting in w3wp I expect is very common. No pushback at all on gc-heap metrics (though I still think we should try not to confuse people with too many different or non-intuitive gc measurement variations). For logical threads there is nothing actively wrong with it, but it may be low value given the absence of fibers.
Add more .NET CLR related metrics to make it similar to .NET Framework
We are currently on .NET6 and use the Runtime instrumentation package. We would like to start seeing more .NET metrics which are very useful for live site investigations.
Is this a feature request or a bug? Feature Request
What is the expected behavior? We would like the runtime package to start emitting
What do you expect to see? The corresponding metrics for the above - one of the issues we are dealing with is a memory leak - which we suspect is coming from the native/interop side. We are unable to confirm if that is the case because we also see the number of objects in the finalization queue to be higher and we are not sure if items in the finalization queue is what is causing the memory size to go up.
The size of finalization queue by itself is less but we read online that if items are not cleared from the finalization queue, there might be a memory leak.
What is the actual behavior? This does not exist today. Requesting for these to also be made available and surfaced up.
Additional Context
For verbosity - here are all the counters we used to use previously