Issues with Hardware Metrics semantic conventions

What are you trying to achieve?

Follow recommendations on: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

What did you expect to see?

Consistent and implementable specification.

Additional context.

Ran across following issues when trying to map GPU metrics to the semantics...

Non-implementable / unspecified items:

Single firmware_version won't work:
- Devices have multiple types of firmware (display, media, scheduling/power management etc)
hw.errors have just hw.error.type attribute, although:
- Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)
- Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)
- Errors can originate from different parts of the device HW (display, media, compute, 3d etc)
- => I suggest adding .category attribute, similarly to Level-Zero spec:
- https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t

Inconsistencies:

hw.gpu.power vs. hw.power{hw.type="gpu"} confusion
- If both are valid, why there's no hw.gpu.energy to match hw.energy{hw.type="gpu"}?
Common HW name & id attributes vs. GPU model & serial attributes
- Should all of these be provided despite overlap?
Why vendor attribute is used for GPU devices, but manufacturer for (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/
Inconsistent attribute examples for GPU metrics missing from spec:
- system.cpu.frequency vs. hw.cpu.speed
- .utilization vs. .*_ratio [1] suffix for things like (frequency) throttling
- whether in addition to base metric, one should provide .utilization / .*_ratio, or .limit value?

[1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics

PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.

@eero-t Thank you for the feedback, give me a few hours to answer your questions. We will address all issues!

Thank you for the feedback, @eero-t!

I will try to answer your questions and we will discuss parts what will require an update of the conventions to be able to cover your use cases.

Single firmware_version won't work:

Devices have multiple types of firmware (display, media, scheduling/power management etc)

Unfortunately, we cannot include all cases of firmware sub-components in the specifications for metrics semantic conventions. However, I know the group is working on a different type of entity (like a Resource) where one could put these kind of attributes describing the monitored component, without polluting the timeseries with too many attributes.

In the meantime, we recommend using the firmware_version attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A.

hw.errors have just hw.error.type attribute, although:

Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)

Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)

Errors can originate from different parts of the device HW (display, media, compute, 3d etc)

=> I suggest adding .category attribute, similarly to Level-Zero spec:

https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t

The hw.error.type is a free format string that you can set to any applicable value.

[!TIP] As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with hw.error.type="correctable" and once with hw.error.type="display". Most users will first display the total number of errors as sum(hw.error{hw.type="gpu"}) before breaking down by error type.

hw.gpu.power vs. hw.power{hw.type="gpu"} confusion

This is totally an oversight. The hw.gpu.power metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}.

Why vendor attribute is used for GPU devices, but manufacturer for (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/

That's a very good point. In hardware metrics, we've preferred vendor over manufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.

We're totally open to discussion on this point. Again, this may be better implemented as part of the new "entity" thing (I can't retrieve the link to the proposed spec for this).

system.cpu.frequency vs. hw.cpu.speed

Are you suggesting we add hw.gpu.speed? It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)

WRT to frequency or speed I agree we should use the same terminology everywhere and use the term that the industry commonly uses.

.utilization vs. .*_ratio [1] suffix for things like (frequency) throttling

whether in addition to base metric, one should provide .utilization / .*_ratio, or .limit value? [1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics

In OpenTelemetry, *.utilization metrics are usually used in conjunction with corresponding *.usage and *.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization was really improper.

PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.

For proper aggregation and rate calculation, it is important (when possible) to store the total number of errors and then let the timeseries database calculate the rate for you on a period of time chosen at query time.

Thank you again for the thorough review. We will start by fixing the hw.gpu.power which is not supposed to be here. For other suggestions, we will need the input from the rest of the team:

system.cpu.frequency vs. hw.cpu.speed
vendor vs manufacturer
Add hw.gpu.speed?

In the meantime, we recommend using the firmware_version attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A.

In that case its name should be changed either to be firmware_versions, or e.g. just firmware, to indicate that its value is not semver. And description should be updated accordingly ("free-form list of FW types & their versions").

The hw.error.type is a free format string that you can set to any applicable value.

Not really the answer I was expecting... :-/

As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with hw.error.type="correctable" and once with hw.error.type="display". Most users will first display the total number of errors as sum(hw.error{hw.type="gpu"}) before breaking down by error type.

Good to know, thanks!

The hw.gpu.power metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}.

OK.

In hardware metrics, we've preferred vendor over manufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.

I see, there are cases where it would be better to use both. However, if vendor and manufacturer are same (which is also often the case), it's inconsistent that "vendor" should be used for some things, and "manufacturer" for others.

IMHO spec could specify both attributes for everything, and state that other one can be dropped if it's not relevant.

Btw. Sysman API spec lists brand & vendor instead of vendor & manufacturer: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-device-properties-t

system.cpu.frequency vs. hw.cpu.speed

Are you suggesting we add hw.gpu.speed?

No, that sounds really odd. I think it should be hw.gpu.frequency.

"Speed" could be interpreted e.g. to be FLOPS, whereas I think the base unit for "frequency" is clear (HZ).

It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)

OneAPI Level-Zero Sysman API specification lists following potential metrics for devices like GPUs (and FPGAs etc):

Engine utilization: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-engine-stats-t
- For large number of engine types: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-engine-group-t
Fabric throughput: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-fabric-port-state-t
Fan speed: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-fan-speed-t
Frequency: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-freq-state-t
- For GPU, memory and media: https://spec.oneapi.io/level-zero/latest/sysman/api.html#frequency-enums
Memory usage: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-state-t
- Either system (shared with CPU) or device local, and with large number of different types
- https://spec.oneapi.io/level-zero/latest/sysman/api.html#memory-enums
Memory bandwidth: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-bandwidth-t
Energy/power usage: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-power-energy-counter-t
- With separate limits for peak, burst, and sustained usage (device specific time intervals)
PSU current, temperature etc: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-psu-state-t
Error counters of different types and categories: https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras-enums
Temperature: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zestemperaturegetstate

In OpenTelemetry, *.utilization metrics are usually used in conjunction with corresponding *.usage and *.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization was really improper.

Sounds reasonable.

(I need to think how it applies in my case as some of the metrics are viewed also directly, not just through OpenTelemetry tooling / queries etc.)

The hw.error.type is a free format string that you can set to any applicable value.

Not really the answer I was expecting... :-/

@bertysentry I mean, different categories of errors require different responses, and from different people:

User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it
Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely
HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access

Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.

Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).

And please comment also on this:

Common HW name & id attributes vs. GPU model & serial attributes

Should all of these be provided despite overlap?

The hw.error.type is a free format string that you can set to any applicable value. @bertysentry I mean, different categories of errors require different responses, and from different people:

User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it

Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely

HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access

Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.

Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).

@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.

Actually, we may change hw.error.type to simply error.type in the future. error.type is the only required attribute for errors, in general, in Otel semantic conventions.

Common HW name & id attributes vs. GPU model & serial attributes

Should all of these be provided despite overlap?

serial_number and id are not necessarily the same. If they are same in your instrumentation, you can drop serial_number, as it's not a required attribute (only "recommended")
Same for name and model: you can drop model if it's always the same as the name attribute in your instrumentation. Simply make sure to always follow the same logic for all devices.

@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.

@bertysentry My point was that those items are semantically orthogonal and should therefore have separate attributes in the semantics spec. Current spec seems to be confused about the role of the errors .type attribute(s) which I think is also is a good indication that it should be split into multiple ones.

For example:

.severity: correctable, uncorrectable (data loss), fatal (functionality loss)
.type: type / category of the error; timeout, memory-parity-error, hw-programming-error etc
.location / .origin / .cause / .whatever: e.g. http-endpoint, display (part of GPU with some memory), compute (GPU pipeline setting)

Actually, we may change hw.error.type to simply error.type in the future.

Yes, please! Using already existing shorter common ones, makes both the spec, and the metrics output compliant with it, more readable.

error.type is the only required attribute for errors, in general, in Otel semantic conventions.

Hm. Instead of stuffing semantically different types of error attributes as free-from text into single .type attribute (your recommendation), this page recommends to:

Use a domain-specific attribute
Set error.type to capture all errors, regardless of whether they are defined within the domain-specific set or not.

open-telemetry / semantic-conventions

Issues with Hardware Metrics semantic conventions #940