Support external power source: BMC/IPMI/HMC

rootfs commented 1 year ago

Current Kepler Architecture

[x] HMC
[x] BMC

Out of band external power source support

jichenjc commented 1 year ago

you mean add a new model that we can obtain power data through those devices and put into current kepler model so like previously we only consider machine itself but now we need consider related device energy ?

marceloamaral commented 1 year ago

The power consumption of a platform (i.e. server) can be reported by the BMC/IPMI/HMC.

In our current implementation of Kepler, we collect the platform power consumption from the motherboard sensor (HMC), which is available in most modern servers. This sensor provides data on the power consumption of components directly attached to the motherboard, such as the CPU and memory. However, it may not include the power consumption of components like disks and GPUs.

Access to the motherboard sensor is possible via the ACPI interface within the machine or through IPMI, which reads the motherboard sensor via the BMC. We currently use the ACPI interface in Kepler, but in cases where the ACPI interface is disabled and IPMI is enabled, IPMI could be used instead.

It's important to note that the power consumption data obtained through IPMI may differ if the source is BMC or out-of-band management systems that can consolidate the power consumption of different components, including the platform, disk, and GPU.

jichenjc commented 1 year ago

ok, make sense to me , appreciate the detailed info~

jiere commented 1 year ago

Please see the joint message from IPMI promoters here. Even IPMI v2.0 is a 10+ years-old spec, there are various of open-source projects related to IPMI metrics exporter. Shall we directly support BMC-Redfish integration for OOB power monitoring? Another question is about the metrics usage, since BMC data is some kinds of runtime transient power, not aggregate, how could it be used in Kepler then?

marceloamaral commented 1 year ago

IPMI or Redfish

I am ok with any direction

BMC data is some kinds of runtime transient power, not aggregate, how could it be used in Kepler

We do extrapolation (current power * elapsed time) and aggregate it in Kepler.

rootfs commented 1 year ago

Let's focus on Redfish first.

eklee15 commented 1 year ago

Some questions, 1) Are the users ok with giving BMC access to Kepler? (out-of-band)? 2) Are we only assuming the BM Kepler use case?

rootfs commented 1 year ago

Some questions,

Are the users ok with giving BMC access to Kepler? (out-of-band)?

That out-of-band architecture looks more secure than giving BMC access to each node

Are we only assuming the BM Kepler use case?

If we consider external power source as anything that powers the machine (BMC for BM or hypervisor level power source for VM), then this architecture could work for both BM and VM

rootfs commented 1 year ago

A prior study on BMC power calibration can be found here

eklee15 commented 1 year ago

1) I'm a bit confused about the BMC access. AFAIK, out-of-band access would need to give Kepler access to the BMC. Would you please clarify which architecture you are referring to? Perhaps, we can deep dive into this during the community meeting. 2) Yes, if we use out-of-band measurements, we can have both node and VM/BM power measurements through Kepler, but there is a chance that would double-count the idle power. VM-BM mapping should be carefully tracked so as not to double-count the idle power.

rootfs commented 1 year ago

05/09 meeting:

whether sidecar can access the external source, reads the stats, and calculate the power.
out-of-band has a sync issue (ns vs ms vs s depending on the BMC config or access delays).
Daemonset level BMC access has a security and overhead issues.
Current HMC implementation: HMC provides endpoint for access, so it can provide info to kepler and calculate the energy inside kepler.

Potential implementations:

Use Redfish BMC exporter (BMC models vary in terms of access overhead).
Investigate direct exporter access overhead and prometheus access scalability access Redfish first
Correlation between application usage and BMC metrics. Need to know what to report (accelerators/network/storage). Baseline method: based on ground truth (i.e. HW specs, but needs visibility of HW components including fan/board/CPU/GPU/DRAM).

tiwatsuka commented 1 year ago

Do you have any update on this issue?

I've just compared power value from Kepler and Redfish. Even though the difference of them is not large, I think it's better to fill the gap if I can. Is anyone working on it?

Brief report

Environment
- Node
  - HPE ProLiant ML30 Gen10 Plus
    - iLO Restful API for HPE iLO 5 is enabled
- kepler
  - v0.5
  - minimum composition (no model server & no estimator)
Load
- incrementally add load to CPU
  - by stress-ng -c <n> (n=1..4)
Value of power
- Redfish
  - value of PowerCoonsumedWatts
- Kepler
  - sum of all power values ("PKG", "DRAM" and "OTHER")
Findings
- there was a time gap of around 15 seconds between Refdish and Kepler
  - Redfish was usually delayed
- power values were different around 3 watt on average
  - the value from kepler was higher than the one from Redfish when CPU load was low
    - this was contrary when cpu load was high
  - "OTHER" part of the value of Kepler had not been changed by the load, but actually it should had been, I guess
Screenshot of graph
- please ignore yellow line

marceloamaral commented 1 year ago

Thanks @tiwatsuka, very interesting work.

Which color is Kepler and Redfish? Blue and green, respectively?

Is this the Kepler node power or sum of all containers? Can you share your prometheus query? Since the Prometheus query takes the average of a time window we can expect some variations.

The OTHER part is the total power from the motherboard sensor (using ACPI API) less the RAPL power. Given that you're running a CPU-intensive application, the "OTHER" part of the power consumption should ideally be minimal and relatively constant. A disk or network-intensive workload might potentially impact the "OTHER" power consumption if the power drawn by the disk and network components is being accounted for by the motherboard sensor. However, I haven't personally tested this scenario.

rootfs commented 1 year ago

thank you @tiwatsuka! This is a very cool study. Kepler (blue) appears to match with redfish (green) most of the time but when there are major transitions, there are some lags. This is likely due to the report interval differences between BMC and RAPL. On my setup (dell), the report interval is 1 min.

# redfishtool -r xxxx -u xxxx -p xxxx raw GET /redfish/v1/Chassis/System.Embedded.1/Power/PowerControl
{
    "@odata.context": "/redfish/v1/$metadata#Power.Power",
    "@odata.id": "/redfish/v1/Chassis/System.Embedded.1/Power#/PowerControl/0",
    "@odata.type": "#Power.v1_6_1.PowerControl",
    "MemberId": "0",
    "Name": "System Power Control",
    "PowerAllocatedWatts": 1536,
    "PowerAvailableWatts": 0,
    "PowerCapacityWatts": 1536,
    "PowerConsumedWatts": 389,
    "PowerLimit": {
        "CorrectionInMs": 0,
        "LimitException": "HardPowerOff",
        "LimitInWatts": 485
    },
    "PowerMetrics": {
        "AverageConsumedWatts": 389,
        "IntervalInMin": 1,
        "MaxConsumedWatts": 415,
        "MinConsumedWatts": 386
    },
    "PowerRequestedWatts": 1097,
    "RelatedItem": [
        {
            "@odata.id": "/redfish/v1/Chassis/System.Embedded.1"
        },
        {
            "@odata.id": "/redfish/v1/Systems/System.Embedded.1"
        }
    ],
    "RelatedItem@odata.count": 2
}

I have explored different ways of support redfish, including the open API approach and gofish. But both appear to be overkill for our use case. I am going to just support the Power API in kepler.

tiwatsuka commented 1 year ago

@marceloamaral The blue line is Kepler and the green is Redfish.

Here is the query. I simply copied from the dashboard of Kepler.

sum(irate(kepler_container_package_joules_total{container_namespace=~\"$namespace\"}[1m])) +
sum(irate(kepler_container_dram_joules_total{container_namespace=~\"$namespace\"}[1m])) +
sum(irate(kepler_container_other_host_components_joules_total{container_namespace=~\"$namespace\"}[1m]))

AFAIK, power from BMC is AC power consumption and one from RAPL is DC power consumption. When DC power required by CPU increase, the loss of AC-DC conversion also increase. If it is true and Kepler considers this, the lost should be included in "OTHER" part, I guess.

@rootfs The interval is 20 on my setting. I think this affects only Average, Max and Min consumed watts. "PowerConsumedWatts" can be different from "AverageConsumedWatts".

    "PowerControl": [
        {
            "@odata.id": "/redfish/v1/Chassis/1/Power#PowerControl/0",
            "MemberId": "0",
            "PowerCapacityWatts": 500,
            "PowerConsumedWatts": 74,
            "PowerMetrics": {
                "AverageConsumedWatts": 39,
                "IntervalInMin": 20,
                "MaxConsumedWatts": 81,
                "MinConsumedWatts": 37
            }
        }
    ],

In my observation, power from BMC usually lag several soconds (even when I use ipmi-tool). However I didn't verify it on so many hardware neither find specification about it. The lag might lead wrong estimation when the load on a node changes frequently.

rootfs commented 1 year ago

@tiwatsuka thanks for the info. We are working on the BMC support, it is still early but would you help review and test on your environment? I don't have any HPE servers yet.

https://github.com/sustainable-computing-io/kepler/pull/734

rootfs commented 1 year ago

BMC support is finished.

sustainable-computing-io / kepler

Support external power source: BMC/IPMI/HMC #644

Current Kepler Architecture

Out of band external power source support