Open arminru opened 1 year ago
cc @bertysentry since you initially proposed most of the hardware semantic conventions. Would be great to get some context and suggestions from you.
That's an excellent question!
TBH, I had not read the exact definition of the .limit
suffix when I wrote the hw.
semconv. I just inferred that limit was literally a limit for the underlying instrument.
I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. My opinion is that this definition is too restrictive.
My recommendation is to update the definition of the .limit
suffix as below:
An instrument that measures the limit of another entity instrument should be named
entity.limit
.Examples:
system.memory.limit
hw.temperature.limit
Different types of limits can be precised with the
limit_type
attribute (max
,min
,degraded
,critical
,throttled
, etc.).The type and unit of the
.limit
metric must be the same as the underlying metric. Example:hw.temperature
is a Gauge in Celsius degrees, thereforehw.temperature.limit
must be defined as a Gauge in Celsius degrees as well.
I'm a bit surprised by the actual definition, which is counter-intuitive: a "limit" shouldn't necessarily mean a total amount. ... Different types of limits can be precised with the limit_type attribute (max, min, degraded, critical, throttled, etc.).
For example with GPUs and their memory; there's the physical amount of memory, how much of that kernel exposes to user-space (after deducting its own overheads), and how much of that user-space API exposes to applications.
In OneAPI Level-Zero Sysman API, first 2 limits are named as total
and available
.
In OpenCL API, first and last are named as GLOBAL_MEM_SIZE
and MAX_MEM_ALLOC_SIZE
.
Conclusion to @arminru's question: option 1 (Adapt the definition of limit to allow for both use cases or interpretations. We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.)
In https://github.com/open-telemetry/semantic-conventions/pull/409#discussion_r1368368453 it was brought up that certain metrics with the
.limit
suffix are defined as Gauges, whereas most others are defined as UpDownCounters:The UpDownCounters are consistent with our current definition for
.limit
at https://github.com/open-telemetry/semantic-conventions/blob/v1.22.0/docs/general/metrics.md#instrument-naming:One can sum up the existing memory, disk space, network bandwidth, or power supply within a given system or compositions of them and get a meaningful aggregate representing the "total amount" available.
The Gauge metrics, however, don't represent an available "total amount". One cannot add the maximum permissible temperature (°C) over multiple components, battery charge fraction for stable operation (%) over multiple batteries, or permissible voltage (V) over multiple components. The aggregated sum breaks the definition and expectation for the individual metric observations.
Two CPUs that can sustain 100 °C each, for example. won't sustain 200 °C together (or 40°C on one and 160°C on the other). Three SSDs that operate at 3.3 V won't tolerate 9.9 V on the shared power supply. Neither is a maximum charge level of 300% for three (potentially different) batteries a helpful aggregation.
I think our options to resolve this are:
limit
to allow for both use cases or interpretations. We'd need to remove the "total amount" wording and replace it with something else. We should also consider adding a note that both aggregatable and non-aggregatable limits can occur.limit
and introduce a new, well-known suffix for the non-aggregatable limits and change the current Gauge metrics to use this suffix instead.limit
and change the current Gauge metrics to use some other suffix that's not defined by our naming conventions.I'm looking for feedback on which direction we should pursue and potential suggestions for the respective naming/wording.