Open eero-t opened 2 months ago
FYI @bertysentry
@eero-t Thank you for the feedback, give me a few hours to answer your questions. We will address all issues!
Thank you for the feedback, @eero-t!
I will try to answer your questions and we will discuss parts what will require an update of the conventions to be able to cover your use cases.
- Single
firmware_version
won't work:
- Devices have multiple types of firmware (display, media, scheduling/power management etc)
Unfortunately, we cannot include all cases of firmware sub-components in the specifications for metrics semantic conventions. However, I know the group is working on a different type of entity (like a Resource) where one could put these kind of attributes describing the monitored component, without polluting the timeseries with too many attributes.
In the meantime, we recommend using the firmware_version
attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z
. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A
.
hw.errors
have justhw.error.type
attribute, although:
Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)
Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)
Errors can originate from different parts of the device HW (display, media, compute, 3d etc)
=> I suggest adding
.category
attribute, similarly to Level-Zero spec:https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t
The hw.error.type
is a free format string that you can set to any applicable value.
[!TIP] As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with
hw.error.type="correctable"
and once withhw.error.type="display"
. Most users will first display the total number of errors assum(hw.error{hw.type="gpu"})
before breaking down by error type.
hw.gpu.power
vs.hw.power{hw.type="gpu"}
confusion
This is totally an oversight. The hw.gpu.power
metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}
.
- Why
vendor
attribute is used for GPU devices, butmanufacturer
for (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/
That's a very good point. In hardware metrics, we've preferred vendor
over manufacturer
, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.
We're totally open to discussion on this point. Again, this may be better implemented as part of the new "entity" thing (I can't retrieve the link to the proposed spec for this).
system.cpu.frequency
vs.hw.cpu.speed
Are you suggesting we add hw.gpu.speed
? It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)
WRT to frequency
or speed
I agree we should use the same terminology everywhere and use the term that the industry commonly uses.
.utilization
vs..*_ratio
[1] suffix for things like (frequency) throttling- whether in addition to base metric, one should provide
.utilization
/.*_ratio
, or.limit
value? [1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics
In OpenTelemetry, *.utilization
metrics are usually used in conjunction with corresponding *.usage
and *.limit
. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization
was really improper.
PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional
all
category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.
For proper aggregation and rate calculation, it is important (when possible) to store the total number of errors and then let the timeseries database calculate the rate for you on a period of time chosen at query time.
Thank you again for the thorough review. We will start by fixing the hw.gpu.power
which is not supposed to be here. For other suggestions, we will need the input from the rest of the team:
system.cpu.frequency
vs. hw.cpu.speed
vendor
vs manufacturer
hw.gpu.speed
?In the meantime, we recommend using the
firmware_version
attribute to put all necessary information. It's a free format string, it doesn't need to be exactlyx.y.z
. You set this attribute to:display 23.1; media 1.0.00; scheduling 2.13A
.
In that case its name should be changed either to be firmware_versions
, or e.g. just firmware
, to indicate that its value is not semver. And description should be updated accordingly ("free-form list of FW types & their versions").
The
hw.error.type
is a free format string that you can set to any applicable value.
Not really the answer I was expecting... :-/
As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with
hw.error.type="correctable"
and once withhw.error.type="display"
. Most users will first display the total number of errors assum(hw.error{hw.type="gpu"})
before breaking down by error type.
Good to know, thanks!
The
hw.gpu.power
metric must be removed from the semantic conventions. Please usehw.power{hw.type="gpu"}
.
OK.
In hardware metrics, we've preferred
vendor
overmanufacturer
, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.
I see, there are cases where it would be better to use both. However, if vendor and manufacturer are same (which is also often the case), it's inconsistent that "vendor" should be used for some things, and "manufacturer" for others.
IMHO spec could specify both attributes for everything, and state that other one can be dropped if it's not relevant.
Btw. Sysman API spec lists brand
& vendor
instead of vendor
& manufacturer
: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-device-properties-t
system.cpu.frequency
vs.hw.cpu.speed
Are you suggesting we add
hw.gpu.speed
?
No, that sounds really odd. I think it should be hw.gpu.frequency
.
"Speed" could be interpreted e.g. to be FLOPS, whereas I think the base unit for "frequency" is clear (HZ).
It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)
OneAPI Level-Zero Sysman API specification lists following potential metrics for devices like GPUs (and FPGAs etc):
In OpenTelemetry,
*.utilization
metrics are usually used in conjunction with corresponding*.usage
and*.limit
. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this casehw.fan.speed.utilization
was really improper.
Sounds reasonable.
(I need to think how it applies in my case as some of the metrics are viewed also directly, not just through OpenTelemetry tooling / queries etc.)
The
hw.error.type
is a free format string that you can set to any applicable value.Not really the answer I was expecting... :-/
@bertysentry I mean, different categories of errors require different responses, and from different people:
Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.
Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).
And please comment also on this:
- Common HW
name
&id
attributes vs. GPUmodel
&serial
attributes
- Should all of these be provided despite overlap?
The
hw.error.type
is a free format string that you can set to any applicable value. @bertysentry I mean, different categories of errors require different responses, and from different people:
- User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it
- Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely
- HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access
Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.
Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).
@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type
attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.
Actually, we may change hw.error.type
to simply error.type
in the future. error.type
is the only required attribute for errors, in general, in Otel semantic conventions.
- Common HW
name
&id
attributes vs. GPUmodel
&serial
attributes
- Should all of these be provided despite overlap?
serial_number
and id
are not necessarily the same. If they are same in your instrumentation, you can drop serial_number
, as it's not a required attribute (only "recommended")name
and model
: you can drop model
if it's always the same as the name
attribute in your instrumentation. Simply make sure to always follow the same logic for all devices.@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the
hw.error.type
attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.
@bertysentry My point was that those items are semantically orthogonal and should therefore have separate attributes in the semantics spec. Current spec seems to be confused about the role of the errors .type
attribute(s) which I think is also is a good indication that it should be split into multiple ones.
For example:
.severity
: correctable, uncorrectable (data loss), fatal (functionality loss).type
: type / category of the error; timeout, memory-parity-error, hw-programming-error etc.location
/ .origin
/ .cause
/ .whatever
: e.g. http-endpoint, display (part of GPU with some memory), compute (GPU pipeline setting)Actually, we may change
hw.error.type
to simplyerror.type
in the future.
Yes, please! Using already existing shorter common ones, makes both the spec, and the metrics output compliant with it, more readable.
error.type
is the only required attribute for errors, in general, in Otel semantic conventions.
Hm. Instead of stuffing semantically different types of error attributes as free-from text into single .type
attribute (your recommendation), this page recommends to:
error.type
to capture all errors, regardless of whether they are defined within the domain-specific set or not.?
What are you trying to achieve?
Follow recommendations on: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/
What did you expect to see?
Consistent and implementable specification.
Additional context.
Ran across following issues when trying to map GPU metrics to the semantics...
Non-implementable / unspecified items:
firmware_version
won't work:hw.errors
have justhw.error.type
attribute, although:.category
attribute, similarly to Level-Zero spec:Inconsistencies:
hw.gpu.power
vs.hw.power{hw.type="gpu"}
confusionhw.gpu.energy
to matchhw.energy{hw.type="gpu"}
?name
&id
attributes vs. GPUmodel
&serial
attributesvendor
attribute is used for GPU devices, butmanufacturer
for (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/system.cpu.frequency
vs.hw.cpu.speed
.utilization
vs..*_ratio
[1] suffix for things like (frequency) throttling.utilization
/.*_ratio
, or.limit
value?[1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics
PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional
all
category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.