open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
219 stars 141 forks source link

Issues with Hardware Metrics semantic conventions #940

Open eero-t opened 2 months ago

eero-t commented 2 months ago

What are you trying to achieve?

Follow recommendations on: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

What did you expect to see?

Consistent and implementable specification.

Additional context.

Ran across following issues when trying to map GPU metrics to the semantics...

Non-implementable / unspecified items:

Inconsistencies:

[1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics

PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.

joaopgrassi commented 2 months ago

FYI @bertysentry

bertysentry commented 2 months ago

@eero-t Thank you for the feedback, give me a few hours to answer your questions. We will address all issues!

bertysentry commented 2 months ago

Thank you for the feedback, @eero-t!

I will try to answer your questions and we will discuss parts what will require an update of the conventions to be able to cover your use cases.

  • Single firmware_version won't work:
    • Devices have multiple types of firmware (display, media, scheduling/power management etc)

Unfortunately, we cannot include all cases of firmware sub-components in the specifications for metrics semantic conventions. However, I know the group is working on a different type of entity (like a Resource) where one could put these kind of attributes describing the monitored component, without polluting the timeseries with too many attributes.

In the meantime, we recommend using the firmware_version attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A.

  • hw.errors have just hw.error.type attribute, although:

    • Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)

    • Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)

    • Errors can originate from different parts of the device HW (display, media, compute, 3d etc)

    • => I suggest adding .category attribute, similarly to Level-Zero spec:

    • https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t

The hw.error.type is a free format string that you can set to any applicable value.

[!TIP] As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with hw.error.type="correctable" and once with hw.error.type="display". Most users will first display the total number of errors as sum(hw.error{hw.type="gpu"}) before breaking down by error type.

  • hw.gpu.power vs. hw.power{hw.type="gpu"} confusion

This is totally an oversight. The hw.gpu.power metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}.

That's a very good point. In hardware metrics, we've preferred vendor over manufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.

We're totally open to discussion on this point. Again, this may be better implemented as part of the new "entity" thing (I can't retrieve the link to the proposed spec for this).

  • system.cpu.frequency vs. hw.cpu.speed

Are you suggesting we add hw.gpu.speed? It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)

WRT to frequency or speed I agree we should use the same terminology everywhere and use the term that the industry commonly uses.

In OpenTelemetry, *.utilization metrics are usually used in conjunction with corresponding *.usage and *.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization was really improper.

PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.

For proper aggregation and rate calculation, it is important (when possible) to store the total number of errors and then let the timeseries database calculate the rate for you on a period of time chosen at query time.

Thank you again for the thorough review. We will start by fixing the hw.gpu.power which is not supposed to be here. For other suggestions, we will need the input from the rest of the team:

eero-t commented 2 months ago

In the meantime, we recommend using the firmware_version attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A.

In that case its name should be changed either to be firmware_versions, or e.g. just firmware, to indicate that its value is not semver. And description should be updated accordingly ("free-form list of FW types & their versions").

The hw.error.type is a free format string that you can set to any applicable value.

Not really the answer I was expecting... :-/

As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with hw.error.type="correctable" and once with hw.error.type="display". Most users will first display the total number of errors as sum(hw.error{hw.type="gpu"}) before breaking down by error type.

Good to know, thanks!

The hw.gpu.power metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}.

OK.

In hardware metrics, we've preferred vendor over manufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.

I see, there are cases where it would be better to use both. However, if vendor and manufacturer are same (which is also often the case), it's inconsistent that "vendor" should be used for some things, and "manufacturer" for others.

IMHO spec could specify both attributes for everything, and state that other one can be dropped if it's not relevant.

Btw. Sysman API spec lists brand & vendor instead of vendor & manufacturer: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-device-properties-t

  • system.cpu.frequency vs. hw.cpu.speed

Are you suggesting we add hw.gpu.speed?

No, that sounds really odd. I think it should be hw.gpu.frequency.

"Speed" could be interpreted e.g. to be FLOPS, whereas I think the base unit for "frequency" is clear (HZ).

It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)

OneAPI Level-Zero Sysman API specification lists following potential metrics for devices like GPUs (and FPGAs etc):

In OpenTelemetry, *.utilization metrics are usually used in conjunction with corresponding *.usage and *.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization was really improper.

Sounds reasonable.

(I need to think how it applies in my case as some of the metrics are viewed also directly, not just through OpenTelemetry tooling / queries etc.)

eero-t commented 2 months ago

The hw.error.type is a free format string that you can set to any applicable value.

Not really the answer I was expecting... :-/

@bertysentry I mean, different categories of errors require different responses, and from different people:

Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.

Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).

And please comment also on this:

  • Common HW name & id attributes vs. GPU model & serial attributes
    • Should all of these be provided despite overlap?
bertysentry commented 2 months ago

The hw.error.type is a free format string that you can set to any applicable value. @bertysentry I mean, different categories of errors require different responses, and from different people:

  • User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it
  • Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely
  • HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access

Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.

Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).

@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.

Actually, we may change hw.error.type to simply error.type in the future. error.type is the only required attribute for errors, in general, in Otel semantic conventions.

  • Common HW name & id attributes vs. GPU model & serial attributes
    • Should all of these be provided despite overlap?
eero-t commented 2 months ago

@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.

@bertysentry My point was that those items are semantically orthogonal and should therefore have separate attributes in the semantics spec. Current spec seems to be confused about the role of the errors .type attribute(s) which I think is also is a good indication that it should be split into multiple ones.

For example:

Actually, we may change hw.error.type to simply error.type in the future.

Yes, please! Using already existing shorter common ones, makes both the spec, and the metrics output compliant with it, more readable.

error.type is the only required attribute for errors, in general, in Otel semantic conventions.

Hm. Instead of stuffing semantically different types of error attributes as free-from text into single .type attribute (your recommendation), this page recommends to:

?