open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
271 stars 174 forks source link

Refine definition for *.limit metric name suffix -or- update current usage to match definition #438

Open arminru opened 1 year ago

arminru commented 1 year ago

In https://github.com/open-telemetry/semantic-conventions/pull/409#discussion_r1368368453 it was brought up that certain metrics with the .limit suffix are defined as Gauges, whereas most others are defined as UpDownCounters:

.limits that are UpDownCounters:

.limits that are Gauges:


The UpDownCounters are consistent with our current definition for .limit at https://github.com/open-telemetry/semantic-conventions/blob/v1.22.0/docs/general/metrics.md#instrument-naming:

  • limit - an instrument that measures the constant, known total amount of something should be called entity.limit. For example, system.memory.limit for the total amount of memory on a system.

One can sum up the existing memory, disk space, network bandwidth, or power supply within a given system or compositions of them and get a meaningful aggregate representing the "total amount" available.

The Gauge metrics, however, don't represent an available "total amount". One cannot add the maximum permissible temperature (°C) over multiple components, battery charge fraction for stable operation (%) over multiple batteries, or permissible voltage (V) over multiple components. The aggregated sum breaks the definition and expectation for the individual metric observations.

Two CPUs that can sustain 100 °C each, for example. won't sustain 200 °C together (or 40°C on one and 160°C on the other). Three SSDs that operate at 3.3 V won't tolerate 9.9 V on the shared power supply. Neither is a maximum charge level of 300% for three (potentially different) batteries a helpful aggregation.


I think our options to resolve this are:

  1. Adapt the definition of limit to allow for both use cases or interpretations. We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.
  2. Keep the current definition of limit and introduce a new, well-known suffix for the non-aggregatable limits and change the current Gauge metrics to use this suffix instead.
  3. Keep the current definition of limit and change the current Gauge metrics to use some other suffix that's not defined by our naming conventions.

I'm looking for feedback on which direction we should pursue and potential suggestions for the respective naming/wording.

arminru commented 1 year ago

cc @bertysentry since you initially proposed most of the hardware semantic conventions. Would be great to get some context and suggestions from you.

bertysentry commented 1 year ago

That's an excellent question!

TBH, I had not read the exact definition of the .limit suffix when I wrote the hw. semconv. I just inferred that limit was literally a limit for the underlying instrument.

I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. My opinion is that this definition is too restrictive.

My recommendation is to update the definition of the .limit suffix as below:

An instrument that measures the limit of another entity instrument should be named entity.limit.

Examples:

  • system.memory.limit
  • hw.temperature.limit

Different types of limits can be precised with the limit_type attribute (max, min, degraded, critical, throttled, etc.).

The type and unit of the .limit metric must be the same as the underlying metric. Example: hw.temperature is a Gauge in Celsius degrees, therefore hw.temperature.limit must be defined as a Gauge in Celsius degrees as well.

eero-t commented 5 months ago

I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. ... Different types of limits can be precised with the limit_type attribute (max, min, degraded, critical, throttled, etc.).

For example with GPUs and their memory; there's the physical amount of memory, how much of that kernel exposes to user-space (after deducting its own overheads), and how much of that user-space API exposes to applications.

In OneAPI Level-Zero Sysman API, first 2 limits are named as total and available.

In OpenCL API, first and last are named as GLOBAL_MEM_SIZE and MAX_MEM_ALLOC_SIZE.

bertysentry commented 5 months ago

Conclusion to @arminru's question: option 1 (Adapt the definition of limit to allow for both use cases or interpretations. We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.)