open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
262 stars 167 forks source link

hw.host.power/energy versus hw.power/energy metrics #1055

Open sebastien-rosset opened 1 year ago

sebastien-rosset commented 1 year ago

What are you trying to achieve?

Improve the spec to provide guidelines for:

  1. hw.host.power versus hw.power metric.
  2. hw.host.energy versus hw.energy metric.
  3. hardware components that can report both IN/OUT power/energy utilization.

What did you expect to see?

Better guidelines for hw.host.power and hw.power metrics.

Additional context.

Use hw.power instead of hw.gpu.power

I see hw.gpu.power has specifically been defined for GPUs. Shouldn't GPU power be standardized to use hw.power?

Device with In/Out power

Some devices transfer some of the energy they receive, i.e., they have IN and OUT power. What metrics and attributes should be used to report power/energy utilization? In particular, how does one report output power?

A possible solution is to add metrics to report both input and output power:

Metric Name Description
hw.power The power drawn by the component (but not necessarily fully consumed by the component)
hw.power_out The output power delivered by the component
hw.energy The energy drawn by the component (but not necessarily fully consumed by the component)
hw.energy_out The energy power delivered by the component (power delivered externally).

hw.power could potentially be renamed to hw.power_in.

For example:

  1. A network device that supports power over Ethernet. The device may consume 500W and some of that power is transferred over Ethernet to connected devices, which themselves may report their own power utilization. In this case, the switch is the host resource and reports power usage. An appliance connected to the switch may be a separate host resource that also reports power usage.
  2. A PSU draws 148W in and its output power is 122W. The PSU provides power to an attached component.
    1. The PSU reports hw.power = 148W and hw.power_out = 122W.
    2. The attached component reports hw.power = 121W.
    3. The SUM of hw.power - SUM of hw.power_out across components indicates how much power is drawn without double counting the power.
    4. A limitation of this approach is that it would not be able to account for loss over the power medium (e.g. a wireless charge incurs significant power loss as heat).
  3. A smart PDU can report the input/output power, energy, voltage, current. The PDU itself consumes very little energy, most of the power is transferred to the connected devices. Suppose the PDU has 10 connected devices, each consuming 500 Watts. The PDU may consume 20 Watts, so overall the PDU "consumes" 5,020 Watts. How should the PDU report its power? If hw.host.power reports 5,020 Watts, and each connected device reports hw.host.power with 500 Watts, then in aggregate the power is double counted.

Smart meters

Smart meters can report the power utilization. For example, what metrics should a house smart meter report?

Reporting both hw.host.power and hw.power metrics

Suppose a physical system has:

  1. Multiple Power Supply Units (PSUs).
  2. Sub-components that each consume power, e.g., memory, disks, CPUs, GPUs.
  3. Power usage of each of the sub-components can be measured and needs to be reported.

For example, a physical server has power supply units (PSUs), CPUs, DIMMs, disks, GPU, PCI components, etc. Each of these consume energy and typically have sensors that can report power utilization. The power supply units can report the total energy consumed by the host, and each sub-component can have an instrument that reports the power utilization of that component.

Questions/Issues:

  1. Should the physical system calculates the sum of power usage across all PSUs and reports hw.host.power metric for the whole system to be the sum of all PSUs power usage?
  2. The hw.power metric is used to report power usage for each of the sub-components?
  3. Since the PSUs are themselves hardware components, how should they report power utilization? Use hw.power metric?
  4. I would expect that hw.host.power is greater than or equal to the sum of the hw.power across sub-components. But if PSU reports hw.power, that may double count power usage for the sub-components unless somehow we can distinguish between input and output PSU power.
bertysentry commented 1 year ago

@sebastien-rosset I didn't see this issue, sorry!

You're totally right, there's no need for hw.gpu.power, this one should be removed from the specification.

WRT in/out, I agree too and we were considering adding a direction attribute, instead of a separate metric: hw.power{direction="in"} for anything that consumes energy hw.power{direction="out"} for anything that outputs energy, like a battery, a smart plug or a UPS.

Actually, we could have that at the level too, for UPS systems: hw.host.power{direction="out"}.

Another option was to simply say that power that is "produced" should be represented as a negative value for hw.power. This way, one could aggregate all hw.power metrics to get the overall power consumed in their data center/room, and not deal with calculations like sum(hw.power{direction="in"}) - sum(hw.power{direction="out"}). It's probably simpler in this case, but a little counter-intuitive to see negative values for a power metric... 🤷‍♂️

@bogdandrutu This issue must be moved to the semantic conventions repository. Thank you!