open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
256 stars 165 forks source link

Add `process.cpu.count` metric to semantic conventions for OS process metrics. #651

Open Yun-Ting opened 1 year ago

Yun-Ting commented 1 year ago

What are you trying to achieve?

Add a metric to expose number of available processors to the current process to semantic conventions for OS process metrics.

Proposed instrument name, type, unit and description: Name Instrument Type (*) Unit Description Labels
process.cpu.count UpDownCounter {processors} Number of processors (CPUs) available to the current process.

Currently, the definition of process.cpu.utilization is "Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process", which requires maintaining the state (in this case, the last collection time) of the instrument.

The challenge encountered during implementation in .NET is: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/831

Potential workarounds: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/948

What did you expect to see?

Add process.cpu.count metric to the semantic conventions and let the backend do the computation. Given instrument values of process.cpu.time and process.cpu.count, the backend will have sufficient data to calculate the CPU utilization metric. https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/981

Additional context.

Previous discussion related to this topic: https://github.com/open-telemetry/opentelemetry-specification/pull/2392

trask commented 1 year ago

the definition of process.cpu.utilization is "Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process", which requires maintaining the state (in this case, the last collection time) of the instrument.

I think this is a really interesting point.

It seems like process.cpu.utilization is defined as a "delta metric" and may behave strangely in the case of multiple metric readers.

Maybe it's good to remove process.cpu.utilization altogether, in favor of process.cpu.time + process.cpu.count?

trask commented 1 year ago

It seems like process.cpu.utilization is defined as a "delta metric" and may behave strangely in the case of multiple metric readers.

@alanwest made similar https://github.com/open-telemetry/opentelemetry-specification/issues/2439#issuecomment-1353938369:

In the event of multiple meter providers, their reporting intervals may be different. So, calculating the difference in process.cpu.time since the last measurement requires the instrumentation to maintain some state per meter provider. I don't think the metric API spec offers the support necessary for this.

lalitb commented 1 year ago

It seems like process.cpu.utilization is defined as a "delta metric" and may behave strangely in the case of multiple metric readers.

Agree. It would be instead useful if process.cpu.utilization represents real-time percentage cpu utilization for given process (as provided in %CPU column of top command om posix), as an UpDownCounter. Or maybe have a separate metrics for that. For C++ applications on linux, process.cpu.count would mostly be the total number of cores in the machine, and same value would be recorded each time. Probably it is more useful to .Net Runtime or JVM if they provides more accurately the number of processors/cores allocated to it. Nothing against this PR though :)

alanwest commented 1 year ago

Maybe it's good to remove process.cpu.utilization altogether, in favor of process.cpu.time + process.cpu.count?

At a minimum, the spec needs to clarify that it is not possible for instrumentation written using OTel's language APIs to generate the process.cpu.utilization (at least as it is currently defined).

Regarding removing it altogether, I'm not sure..

The collector's hostmetrics receiver generates the process.cpu.utilization metric.

It is my understanding that the semantic conventions of the spec are not limited to only describing metrics produced using language APIs. If the collector produces this metric, and its semantics match what is here in the spec, then shouldn't we leave it in the spec?

It would be instead useful if process.cpu.utilization represents real-time percentage cpu utilization for given process (as provided in %CPU column of top command om posix), as an UpDownCounter.

I agree. This sounds like the semantics of the process.runtime.jvm.cpu.utilization metric:

Recent CPU utilization for the process. [2]

[2]: These utilizations are not defined as being for the specific interval since last measurement (unlike system.cpu.utilization).

Unless there is good reason not to, I would prefer to redefine the process.cpu.utilization to be the same as Java's process.runtime.jvm.cpu.utilization. If this is not possible, then perhaps all languages could have a process.runtime.*.cpu.utilization metric.

Regarding process.cpu.count, I'm also not against it, but if it's primary purpose is to enable computing utilization, then I think it would be ideal to just produce it directly.

arminru commented 1 year ago

From the spec triage meeting: This looks like a reasonable request but the right approach still needs to be decided on, which is already being discussed in this thread.

cc @open-telemetry/instr-wg for your input on this

Yun-Ting commented 1 year ago

Hi @open-telemetry/instr-wg, may I kindly have your take on this? Thank you.

dmitryax commented 8 months ago

This should be transferred to https://github.com/open-telemetry/semantic-conventions

joaopgrassi commented 7 months ago

@open-telemetry/semconv-system-approvers can please take a look at this and see if needs more info or can be added to the system semconv project? thanks!