sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.17k stars 182 forks source link

Incorrect measurements on ARM (Ampere Altra Max) #1074

Open simonarys opened 11 months ago

simonarys commented 11 months ago

What happened?

We are interested in running Kepler on an ARM Ampere Altra Max machine (BM). We managed to successfully build the Kepler image from the Dockerfiles available in the build/ folder on the main GitHub branch (hash: 88c82f384f10ba4deb39675b2c88450bc28ee7b8). We then ran the image on a Kubernetes cluster, both on a x86 and the ARM machine. However on the ARM one, we've observed an anomaly in the Grafana dashboard, which is indicating unexpectedly low energy consumption metrics and the "system" namespace is showing unrealistically high power consumptions (more than 1 million W). Moreover, the DRAM energy measurements are always 0. See pictures below.

Picture of very low energy consumption metrics with DRAM at 0 (namespace kepler) Picture of really high energy consumption with DRAM at 0 (namespace system)

We would appreciate any insights or guidance on potential ARM-specific optimizations or configurations that might be necessary to ensure accurate energy consumption measurements.

To aid in troubleshooting, we attached logs and configuration details. Please let us know if further information is needed.

What did you expect to happen?

We expected similar results to those obtained when running Kepler on a x86 Intel machine, since we followed the same steps on both architecture to build and deploy Kepler. On the x86 Intel machine we obtained plausible results, not so far from our PDU's power outlet metrics.

How can we reproduce it (as minimally and precisely as possible)?

We had to change a few lines in the Dockerfiles to use the ARM architecture instead of x86 because only the Dockerfile.bcc.base has an ARM version available in the GitHub repo.

We built the following images using the Dockerfiles from the /build folder in this order: 1) bcc.base 2) bcc.builder 3) kernel-source-images 4) bcc.kepler 5) manifest

For bcc.base, we built the dockerfile with an arm64 extension that is already in the GitHub repository.

For bcc.builder, we replaced the FROM to use the bcc.base image we just built and replaced the amd64 by arm64 in the line 10:

RUN curl -LO https://go.dev/dl/go1.18.10.linux-arm64.tar.gz; mkdir -p /usr/local; tar -C /usr/local -xvzf go1.18.10.linux-arm64.tar.gz; rm -f go1.18.10.linux-arm64.tar.gz

For kernel-source-image, we replaced the whole file by this and do not use the build-kernel-source-images.sh script:

FROM registry.access.redhat.com/ubi8/ubi

ARG ARCH=aarch64

RUN yum install -y http://mirror.centos.org/centos/8-stream/BaseOS/aarch64/os/Packages/centos-gpg-keys-8-6.el8.noarch.rpm && \
    yum install -y http://mirror.centos.org/centos/8-stream/BaseOS/aarch64/os/Packages/centos-stream-repos-8-6.el8.noarch.rpm

RUN yum install -y kernel-devel

For bcc.kepler, we changed the FROMs of line 1 and 25 to use our previously built images (builder then base) and moved the file to the root of the repository before building it using docker.

For the manifest, firstly, we built the manifest using:

make build-manifest OPTS="CI_DEPLOY PROMETHEUS_DEPLOY"

Then, we replaced the image source at line 152 in _output/generated_manifest/deployment.yaml with our kepler image built in the previous step and uploaded on DockerHub.

Lastly, we deployed the manifest to our empty Kubernetes (Kind) cluster.

Anything else we need to know?

We are using a kind cluster

Kepler pod logs :

``` I1116 09:52:00.944197 4588 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I1116 09:52:01.039707 4588 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127 I1116 09:52:01.055616 4588 exporter.go:157] Kepler running on version: bb2b1bb-dirty I1116 09:52:01.055722 4588 config.go:274] using gCgroup ID in the BPF program: true I1116 09:52:01.055743 4588 config.go:276] kernel version: 5.15 I1116 09:52:01.055779 4588 exporter.go:169] LibbpfBuilt: false, BccBuilt: true I1116 09:52:01.055917 4588 config.go:207] kernel source dir is set to /usr/share/kepler/kernel_sources I1116 09:52:01.055961 4588 exporter.go:188] EnabledBPFBatchDelete: true I1116 09:52:01.055985 4588 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory I1116 09:52:01.056179 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:01.056277 4588 power.go:66] use Ampere Xgene sysfs to obtain power I1116 09:52:01.056308 4588 redfish.go:173] failed to initialize node credential: no supported node credential implementation I1116 09:52:01.064097 4588 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I1116 09:52:01.172250 4588 exporter.go:203] Initializing the GPU collector I1116 09:52:07.175452 4588 watcher.go:66] Using in cluster k8s config I1116 09:52:07.276265 4588 watcher.go:134] k8s APIserver watcher was started cannot attach kprobe, probe entry may not exist I1116 09:52:08.550216 4588 bcc_attacher.go:94] attaching kprobe to finish_task_switch failed, trying finish_task_switch.isra.0 instead W1116 09:52:08.567229 4588 bcc_attacher.go:113] failed to load kprobe__set_page_dirty: Module: unable to find kprobe__set_page_dirty ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1116 09:52:08.758847 4588 bcc_attacher.go:119] failed to attach kprobe/set_page_dirty or mark_buffer_dirty: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache write events. This will affect the DRAM power model estimation on VMs. W1116 09:52:08.758962 4588 bcc_attacher.go:125] failed to load kprobe__mark_page_accessed: Module: unable to find kprobe__mark_page_accessed ioctl(PERF_EVENT_IOC_SET_BPF): Bad file descriptor W1116 09:52:08.858818 4588 bcc_attacher.go:129] failed to attach kprobe/mark_page_accessed: failed to attach BPF kprobe: bad file descriptor. Kepler will not collect page cache read events. This will affect the DRAM power model estimation on VMs. perf_event_open: No such file or directory W1116 09:52:08.919712 4588 bcc_attacher.go:142] could not attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory. Are you using a VM? I1116 09:52:08.946937 4588 bcc_attacher.go:150] Successfully load eBPF module from using bcc I1116 09:52:08.946964 4588 bcc_attacher.go:208] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=128 -DSAMPLE_RATE=0 -DSET_GROUP_ID] I1116 09:52:08.947046 4588 container_energy.go:114] Using the Ratio/DynPower Power Model to estimate Container Platform Power I1116 09:52:08.947058 4588 container_energy.go:115] Container feature names: [bpf_cpu_time_us] I1116 09:52:08.947078 4588 container_energy.go:124] Using the Ratio/DynPower Power Model to estimate Container Component Power I1116 09:52:08.947089 4588 container_energy.go:125] Container feature names: [bpf_cpu_time_us bpf_cpu_time_us bpf_cpu_time_us gpu_sm_util] I1116 09:52:08.947111 4588 process_power.go:113] Using the Ratio/DynPower Power Model to estimate Process Platform Power I1116 09:52:08.947121 4588 process_power.go:114] Container feature names: [bpf_cpu_time_us] I1116 09:52:08.947136 4588 process_power.go:123] Using the Ratio/DynPower Power Model to estimate Process Component Power I1116 09:52:08.947147 4588 process_power.go:124] Container feature names: [bpf_cpu_time_us bpf_cpu_time_us bpf_cpu_time_us gpu_sm_util] I1116 09:52:08.947426 4588 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power I1116 09:52:08.947695 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:08.947932 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:08.948172 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:08.948325 4588 exporter.go:267] Started Kepler in 7.89294259s I1116 09:52:16.433631 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:16.434094 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input I1116 09:52:23.974378 4588 apm_xgene_sysfs.go:57] Found power input file: /sys/class/hwmon/hwmon3/power1_input ```

Kepler image tag

latest-bcc built on ARM by ourselves

Kubernetes version

```console $ kubectl version Client Version: v1.28.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 ```

Cloud provider or bare metal

**Bare metal**: Ampere Altra Max ```console Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 128 Socket(s): 1 Stepping: r3p1 Frequency boost: disabled CPU max MHz: 3000.0000 CPU min MHz: 1000.0000 BogoMIPS: 50.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdr dm lrcpc dcpop asimddp ssbs Caches (sum of all): L1d: 8 MiB (128 instances) L1i: 8 MiB (128 instances) L2: 128 MiB (128 instances) ```

OS version

```console # On Linux: $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy $ uname -a Linux calcul9 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:23:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux ```

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

containerd://1.7.1

Related plugins (CNI, CSI, ...) and versions (if applicable)

jiere commented 11 months ago

We met similar issue on a new Intel platform, when we changed to libbpf based Kepler image, the issue is gone. Please have a try then. Since the latest Kepler image is by default built with libbpf yet.

rootfs commented 11 months ago

@simonarys please check if the libbpf image fixes this issue. For DRAM power, the current hwmon used by kepler doesn't support DRAM power reporting (https://docs.kernel.org/hwmon/xgene-hwmon.html). We need to support a much newer hwmon (https://docs.kernel.org/hwmon/smpro-hwmon.html) to get DRAM power. But I don't have an Ampere setup right now.

rootfs commented 11 months ago

@simonarys btw, if you build libbpf image for arm64, the latest Kepler build and base images from @vimalk78 are based on ubi9, they support multiarch. It will make arm image much easier.

simonarys commented 11 months ago

@rootfs Thank you for your response. Unfortunately we weren’t able to build Kepler using the base image from @vimalk78 neither on x86 nor on ARM.

We built the Dockerfile.base successfully on x86, and for ARM we simply had to replace the line:

RUN yum install -y cpuid

By this line found in your Dockerfile.bcc.base.arm64

RUN yum install -y python3 python3-pip && yum clean all -y && \
    pip3 install  --no-cache-dir archspec

Because cpuid is not available on ARM.

Next, we build the Dockerfile.libbpf.builder that installs make, git, gcc, rpm-build, systemd and go.

Finally, we tried to build the Dockerfile in the build/ folder. However it crashes during this command:

RUN make build SOURCE_GIT_TAG=$SOURCE_GIT_TAG BIN_TIMESTAMP=$BIN_TIMESTAMP

With the following error message:

[Makefile:191: _build_local] Error 2

We also tried building it from your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 but we got the exact same error. Do note that Go wasn’t installed on this image and we had to install it.

We also found out that it builds successfully when using one of your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0-go1.18. Consequently, do you know what step we should take to go from the base image to this builder image that would allow us to build Kepler locally from scratch?

vimalk78 commented 11 months ago

We also tried building it from your image: quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 but we got the exact same error. Do note that Go wasn’t installed on this image and we had to install it.

$ podman run -it --rm  quay.io/sustainable_computing_io/kepler_builder:ubi-9-libbpf-1.2.0 sh
sh-5.1# go version
go version go1.20.10 linux/amd64

i can see golang in builder image

vimalk78 commented 11 months ago

I have been able to build aarch image for kepler, but that is without CPUID. though i have not tested it.

simonarys commented 11 months ago

Indeed, you're right. Go is installed and the error is the following:

go: cannot find GOROOT directory: /usr/local/go

Thus re-installing Go into the /usr/local/go folder fixed the error for us, sorry for the confusion.

Since Go is already installed, we now had to replace the GOROOT path from usr/local/go to /lib/golang on line 11: ENV GOPATH=/opt/app-root GO111MODULE=off GOROOT=/lib/golang The path is now found when building the Dockerfile. However we are facing a new issue:

41.67 github.com/sustainable-computing-io/kepler/pkg/manager
42.22 command-line-arguments
44.56 # command-line-arguments
44.56 /lib/golang/pkg/tool/linux_amd64/link: running clang failed: exit status 1
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 clang-16: error: no such file or directory: '/usr/lib/x86_64-linux-gnu/libbpf.a'
44.56 
44.77 make: *** [Makefile:191: _build_local] Error 1
vimalk78 commented 11 months ago

GOROOT is already defined in the image

sh-5.1# go env | grep ROOT
GOROOT="/usr/lib/golang"
stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SamYuan1990 commented 8 months ago

@vimalk78 , is this issue been fixed?