sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
Apache License 2.0
1.1k stars 175 forks source link

No energy usage metrics for isolated CPU cores. #1165

Open rszmigiel opened 8 months ago

rszmigiel commented 8 months ago

What happened?

I'm running PoC with OpenShift 4.13, Kepler 0.9.2 installed with Kepler (Community) Operator. One of the use-cases is to visualise energy consumption of DPDK enabled containers. These containers are using isolated CPU cores on a SingleNodeOpenShift installation.

I got CPUs isolated with the following PerformanceProfile:

kind: PerformanceProfile
  name: openshift-node-performance-profile
  - "rcupdate.rcu_normal_after_boot=0"
  - "efi=runtime"
  - "module_blacklist=irdma"
    isolated: "4-31,36-63,68-95,100-127"
    reserved: "0-3,32-35,64-67,96-99"
    defaultHugepagesSize: 1G
      - count: 64
        size: 1G
        node: 0
      - count: 64
        size: 1G
        node: 1
  machineConfigPoolSelector: ""
  nodeSelector: ''
    topologyPolicy: single-numa-node
    enabled: false
    realTime: false
    highPowerConsumption: false
    perPodPowerManagement: false

I also got workload partitioning configured (

While I run a Pod that's configured to use isolated CPU cores, for an instance:

            memory: "24Gi"
            cpu: "50"
            hugepages-1Gi: 24Gi
            memory: "24Gi"
            cpu: "50"
            hugepages-1Gi: 24Gi

and then run a sample workload to put some load on these cores, for an instance:

stress-ng --cpu 50 --io 2 --vm 50 --vm-bytes 1G --timeout 10m --metrics-brief

I can observe that assigned CPU cores shows high usage in top output: image

but Kepler's power usage diagrams don't reflect that - they're very flat: image

However, if I run the same Pod but on shared (non-isolated) CPU cores by removing whole resources.requests and resources.limits sections, the Kepler graphs looks much more reasonable: image

even the workload is running on small portion of non-isolated CPU cores: image

Therefore I conclude that Kepler does not show proper power usage when isolated CPU cores are being used.

What did you expect to happen?

I'd like to see energy usage for isolated and non-isolated CPU cores. This is very important for all high throughput, low latency workloads.

How can we reproduce it (as minimally and precisely as possible)?

Get OpenShift 4.13 with Kepler Community Operator installed, configure node to run isolated cpu cores and workloads isolation. Run two pods, one using isolated CPU cores, one using shared CPU cores. Observe that energy usage metrics are being collected only for shared (non-isolated) CPU cores.

Anything else we need to know?

No response

Kepler image tag

Kubernetes version

```console $ kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"0c63f9da2694c080257111616c60005f32a5bf47", GitTreeState:"clean", BuildDate:"2023-10-20T23:17:10Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/arm64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.9+636f2be", GitCommit:"e782f8ba0e57d260867ea108b671c94844780ef2", GitTreeState:"clean", BuildDate:"2023-10-20T19:28:29Z", GoVersion:"go1.19.13 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} $ oc version Client Version: 4.14.1 Kustomize Version: v5.0.1 Server Version: 4.13.21 Kubernetes Version: v1.26.9+636f2be ```

Cloud provider or bare metal

Baremetal SingleNodeOpenShift

OS version

```console # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="413.92.202310210500-0" VERSION_ID="4.13" VARIANT="CoreOS" VARIANT_ID=coreos PLATFORM_ID="platform:el9" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 413.92.202310210500-0 (Plow)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:9::coreos" HOME_URL="" DOCUMENTATION_URL="" BUG_REPORT_URL="" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.13" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.13" OPENSHIFT_VERSION="4.13" RHEL_VERSION="9.2" OSTREE_VERSION="413.92.202310210500-0" # uname -a Linux XYZ1 5.14.0-284.36.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 5 08:11:31 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Kepler deployment config

For on kubernetes: ```console $ oc get cm kepler-exporter-cm -n openshift-kepler-operator -o yaml apiVersion: v1 data: BIND_ADDRESS: CGROUP_METRICS: '*' CPU_ARCH_OVERRIDE: "" ENABLE_EBPF_CGROUPID: "true" ENABLE_GPU: "true" ENABLE_PROCESS_METRICS: "false" ENABLE_QAT: "false" EXPOSE_CGROUP_METRICS: "true" EXPOSE_HW_COUNTER_METRICS: "true" EXPOSE_IRQ_COUNTER_METRICS: "true" EXPOSE_KUBELET_METRICS: "true" KEPLER_LOG_LEVEL: "1" KEPLER_NAMESPACE: openshift-kepler-operator METRIC_PATH: /metrics MODEL_CONFIG: CONTAINER_COMPONENTS_ESTIMATOR=false REDFISH_PROBE_INTERVAL_IN_SECONDS: "60" REDFISH_SKIP_SSL_VERIFY: "true" kind: ConfigMap metadata: creationTimestamp: "2024-01-05T00:51:36Z" labels: exporter kepler-operator kepler kepler name: kepler-exporter-cm namespace: openshift-kepler-operator ownerReferences: - apiVersion: blockOwnerDeletion: true controller: true kind: Kepler name: kepler uid: 1d726dc5-4e3a-4e00-ad82-72c62728b414 resourceVersion: "18871948" uid: 760f221f-3230-41d5-9bcd-3b028132bc9b $ oc -n openshift-operators describe deployment kepler-operator-controller Name: kepler-operator-controller Namespace: openshift-operators CreationTimestamp: Fri, 05 Jan 2024 01:50:23 +0100 Labels: olm.deployment-spec-hash=7755955f67 olm.owner=kepler-operator.v0.9.2 olm.owner.kind=ClusterServiceVersion olm.owner.namespace=openshift-operators Annotations: 1 Selector:,, Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: Annotations: alm-examples: [ { "apiVersion": "", "kind": "Kepler", "metadata": { "labels": { "": "kepler", "": "kepler", "": "kepler-operator" }, "name": "kepler" }, "spec": { "exporter": { "deployment": { "port": 9103 } } } } ] capabilities: Basic Install categories: Monitoring containerImage: createdAt: 2023-11-01T12:15:43Z description: Deploys and Manages Kepler on Kubernetes manager olm.operatorGroup: global-operators olm.operatorNamespace: openshift-operators olm.targetNamespaces: {"properties":[{"type":"olm.gvk","value":{"group":"","kind":"Kepler","version":"v1alpha1"}},{"type":... operator-sdk-v1.27.0 repository: Service Account: kepler-operator-controller-manager Containers: manager: Image: Port: 8080/TCP Host Port: 0/TCP Command: /manager Args: --openshift --leader-elect --kepler.image=$(RELATED_IMAGE_KEPLER) --kepler.image.libbpf=$(RELATED_IMAGE_KEPLER_LIBBPF) --zap-log-level=5 Limits: cpu: 500m memory: 128Mi Requests: cpu: 10m memory: 64Mi Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3 Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3 Environment: RELATED_IMAGE_KEPLER: RELATED_IMAGE_KEPLER_LIBBPF: OPERATOR_CONDITION_NAME: kepler-operator.v0.9.2 Mounts: Volumes: Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: NewReplicaSet: kepler-operator-controller-5d5767d64f (1/1 replicas created) Events:

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vimalk78 commented 8 months ago

AFAIK CPU isolation removes a set of CPUs from scheduling algorithm of kernel. Kepler adds a probe to kernel's sched_switch tracepoint to calculate how much cpu time/cpu cycles a process is using, and attributes power usage based on cpu time/cpu cycles of the process. So if some process is using a CPU which is outside of scheduler, then the probe may not be running for that cpu, Kepler may not be knowing the process's cpu time/cycles to assign any power usage to it, and may not be generating metrics for it.

Cc: @rootfs @marceloamaral

rszmigiel commented 8 months ago

In such case could we obtain power usage metrics using alternative ways, even they're not as much detailed as in case of the eBPF use? For an instance, to workaround the issue mentioned in this case I used output from ipmitool sdr command. It provides summarised power usage across all CPUs and memory installed in the system - still better this than nothing ;-)

rootfs commented 8 months ago

@rszmigiel would you please use the kepler 0.7.2 container image?

rootfs commented 8 months ago

cc @vprashar2929 @sthaha

rszmigiel commented 8 months ago

I've used kepler-operator-bundle:0.10.0 and it works!


Thank you!

rootfs commented 8 months ago

great news! thanks for the update @rszmigiel

vimalk78 commented 8 months ago

@rootfs i am really curious to know why it worked with libbpf but not with bcc. thats the only difference between two kepler versions. the approach to calculate the cpu cycles is same in both.

iconeb commented 2 weeks ago

It seems it's still happening with latest available version (left side of the graph), compared to 0.7.2 (reinstalled, on the right side)


sthaha commented 2 weeks ago

Reopening issue to continue investigation.

vimalk78 commented 2 weeks ago

I tried to reproduce this scenario. In a machine with 20 cores, i isolated 2 cores and executed stress-ng on these isolated cores. Kepler is able to get energy usage for these processes.

Screencast from 2024-08-29 19-11-31.webm

Screenshot from 2024-08-29 19-14-25

vimalk78 commented 2 weeks ago

since the cores are isolated, any task started without cpu pinning will not be allocated to the isolated cores. in this case the cpu 2 and 12 will not be loaded

Screencast from 2024-08-29 19-23-31.webm

vimalk78 commented 1 week ago

Cc: @iconeb PTAL

iconeb commented 1 week ago

I confirm we have a performance profile with reserved and isolated cpus

# oc get performanceprofile upf-performance-profile -o json | jq -r .spec.cpu
  "isolated": "2-31,34-63,66-95,98-127",
  "reserved": "0-1,32-33,64-65,96-97"

They are correctly applied at worker node's boot

# oc debug node/ -- cat /proc/cmdline
[...] intel_iommu=on iommu=pt systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=2-31,34-63,66-95,98-127 tuned.non_isolcpus=00000003,00000003,00000003,00000003 systemd.cpu_affinity=0,1,32,64,33,65,96,97 intel_iommu=on iommu=pt isolcpus=managed_irq,2-31,34-63,66-95,98-127 nohz_full=2-31,34-63,66-95,98-127 nosoftlockup nmi_watchdog=0 mce=off rcutree.kthread_prio=11 default_hugepagesz=1G hugepagesz=1G hugepages=200 idle=poll rcu_nocb_poll tsc=perfect selinux=0 enforcing=0 noswap clock=pit audit=0 processor.max_cstate=1 intel_idle.max_cstate=0 rcupdate.rcu_normal_after_boot=0 softlockup_panic=0 console=ttyS0,115200n8 pcie_aspm=off pci=noaer firmware_class.path=/var/lib/firmware intel_pstate=disable

Pod is running with requests and limits

$ oc get pod upf1 -o json | jq .spec.containers[0].resources
  "limits": {
    "cpu": "18",
    "hugepages-1Gi": "40Gi",
    "memory": "30Gi",
    "": "3"
  "requests": {
    "cpu": "18",
    "hugepages-1Gi": "40Gi",
    "memory": "30Gi",
    "": "3"

And on the worker node taskset affinity is assigned as expected

taskset -pc 656722      
pid 656722's current affinity list: 3-7,22-25,67-71,86-89

The strange thing is that previous graph was created running the same pod(s) on the same environment, just changing kepler's version in the meantime.

I will try another round of test to provide (if possible) further evidence

rootfs commented 1 day ago

I have tested Kepler on RHEL that started with isolated CPUs. The isolated CPUs were assigned to a VM. Kepler can capture the VM and report metrics. We have added this configuration in our CI.