sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.18k stars 184 forks source link

Find EC2 instance(s) to deploy Kepler on EKS #707

Closed nikimanoledaki closed 1 year ago

nikimanoledaki commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently there is no documented way to deploy Kepler on Amazon EKS.

This can be challenging given that there are specific requirements for Kepler to work with eBPF and access RAPL.

Describe the solution you'd like

Find and list EC2 instances that:

Then, it would be great to open a PR to the website to document these findings.

Additional context

Most likely the instances will have to be bare metal instances to have direct access to hardware. The tradeoff is that these are usually more expensive, but we will work with this to unblock EKS support.

This blog post documents how to access RAPL with the following baremetal instances: c5, m5, r5, m5zn, z1d, i3, c5n

Cross-check that these instance types are supported by EKS / https://docs.aws.amazon.com/eks/latest/userguide/choosing-instance-type.html

nikimanoledaki commented 1 year ago

@AntonioDiTuri mentioned he would be interested in working on this :)

AntonioDiTuri commented 1 year ago

Yes, correct! I would be happy to experiment in this direction! I wish to let you guys know more soon

SamYuan1990 commented 1 year ago

I am not sure, as in the feature, if we can have some EC2 instance added into our CI? Hence it can be a check point before we make a release tag? May I know for EKS and EC2 instance? if we have en EKS instance(k8s cluster), will we have the access the the EC2 instance(host VM which k8s cluster running on)?

AntonioDiTuri commented 1 year ago

I am not sure if this asks all your questions but With EKS you have access on the ec2 instance: more on the doc

SamYuan1990 commented 1 year ago

I am not sure if this asks all your questions but With EKS you have access on the ec2 instance: more on the doc

+@rootfs I am not sure, in the future , if we can have some ec2 instance with github agent installed and other steps following. Hence we can integrate kepler CI with AWS EC2/EKS? as a step to extend kepler platform supporting with different cloud service provider?

@AntonioDiTuri , please go ahead with current works, I just wondering and brainstorming.

jichenjc commented 1 year ago

Hence we can integrate kepler CI with AWS EC2/EKS

I doubt whether we can as seems current CI also run on VM but we didn't do much check on that (yet)

This can be challenging given that there are specific requirements for Kepler to work with eBPF and access RAPL.

I think we do have other ways like MSR etc other than RAPL ..

nikimanoledaki commented 1 year ago

I think we do have other ways like MSR etc other than RAPL ..

Interesting, thanks @jichenjc. I am not familiar with MSR. Is Kepler integrated with MSR currently?

jichenjc commented 1 year ago

https://github.com/sustainable-computing-io/kepler/blob/main/pkg/power/components/source/rapl_msr.go is the MSR code but I didn't notice actually it's also part of RAPL , maybe need more homework for me :(

nikimanoledaki commented 1 year ago

Thank you for the link! It's homework for me too. 😊 📚

marceloamaral commented 1 year ago
A model-specific register (MSR) is any of various control registers in the x86 instruction 
set used for debugging, program execution tracing, computer performance monitoring, 
and toggling certain CPU features. [wikipedia]

Intel RAPL estimates/save the CPU/DRAM power using the MSR and performance counters. How they do that is very obscure, but several works have evaluated the accuracy with external meters. RAPL also saves the information in /sys/fs/. So we can read RAPL power metrics from MSR or sys/fs.

AWS/EC2 are VMs and might have or not MSR exposed. https://medium.com/teads-engineering/estimating-aws-ec2-instances-power-consumption-c9745e347959

Does anyone have access to AWS/EC2 instance to test if we can read RAPL metrics?

The other approach to read power metrics on Bare-metal nodes is to read the power reported by the mother-board sensor. Kepler reads it via ACPI Interface. We should also verify it, but I think VMs does not have it exposed.

AntonioDiTuri commented 1 year ago

I can have access to EC2 instances through our company account, I am trying to get approval from the managers and find the best way to do the tests so that other people from the company can help. Will keep you posted as soon as I manage to run some tests

AntonioDiTuri commented 1 year ago

I have done my first test. But I would like some feedback. I thought it would have been easier to install a light version of a k8s cluster like k3d and then try to install kepler.

I have done this (installed latest k3d, latest docker and latest kubectl) for a t2.medium instance but I get some weird error around volume mount:

MountVolume.SetUp failed for volume "usr-src" : hostPath type check failed: /usr/src is not a directory

I checked the file system and the directory is there so I am not sure what is the problem

To reproduce:

# install docker, prerequisite for k3d
sudo yum install docker
# avoid using sudo
sudo usermod -a -G docker ec2-user
id ec2-user
# Reload a Linux user's group assignments to docker w/o logout
newgrp docker
# start service
sudo systemctl start docker.service

# install k3d
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

#create cluster with k3d
k3d cluster create mycluster

# install helm 
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

# install kubectl 
curl -LO https://dl.k8s.io/release/v1.27.3/bin/linux/amd64/kubectl
chmod +x kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# install helm
helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
sudo helm install kepler kepler/kepler --namespace kepler --create-namespace

Do you think I am mixing too many layers? Would it be better to try EKS with a single layer and be on the managed aws side? What do you think?

rootfs commented 1 year ago

@AntonioDiTuri /usr/src is usually on a linux host for kernel sources. In your setup, this directory is not created for you. Can you manually create it mkdir -p /usr/src?

AntonioDiTuri commented 1 year ago

Just to give an update, I reached out the slack community and we decided to first test on an ec2 instance the local configuration that was ready in this README just to try an already tested scenario. @husky-parul is helping me in this, last time we had some problems with a Clusterrole not having the right permission to scrape the metrics, as soon as we have news we will post them. Most probably we are going to create a specific ticket in the operator repo

rossf7 commented 1 year ago

Hi, I've been discussing this with @nikimanoledaki and @AntonioDiTuri as I've had similar issues with scaphandre.

I used eksctl to create a single node EKS cluster with a c5.metal bare metal instance. I chose it because it uses Intel CPUs and the spot price was low :)

eksctl create cluster --name rapl-test --region us-east-2 --nodes 1 --instance-types c5.metal --spot --ssh-access --node-ami-family Ubuntu2004

I installed kepler using the helm chart

helm install kepler kepler/kepler --namespace kepler --create-namespace

Initially /sys/class/powercap was empty and the estimation model was used. So I loaded the kernel module as per the scaphandre troubleshooting docs.

sudo apt-get install linux-modules-extra-$(uname -r)
sudo modprobe intel_rapl_common

RAPL then seemed to work and the logs no longer mention the estimation model.

k -n kepler logs ds/kepler
I0801 10:56:31.655025       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0801 10:56:31.701484       1 exporter.go:155] Kepler running on version: a7a6cb1
I0801 10:56:31.701503       1 config.go:258] using gCgroup ID in the BPF program: true
I0801 10:56:31.702340       1 config.go:260] kernel version: 5.15
I0801 10:56:31.702368       1 exporter.go:179] EnabledBPFBatchDelete: true
I0801 10:56:31.703117       1 power.go:53] use sysfs to obtain power
I0801 10:56:31.703148       1 redfish.go:169] failed to get redfish credential file path
I0801 10:56:31.703158       1 power.go:55] use acpi to obtain power
I0801 10:56:31.709196       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0801 10:56:31.754046       1 exporter.go:198] Initializing the GPU collector
I0801 10:56:37.759666       1 watcher.go:66] Using in cluster k8s config
I0801 10:56:37.861112       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
I0801 10:56:38.506749       1 bcc_attacher.go:186] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=96]
I0801 10:56:38.641252       1 exporter.go:251] Started Kepler in 6.939781332s

However the container metrics all have the same value which seems wrong. Is that a known issue?

curl -s http://localhost:9102/metrics | grep kepler_container_package_joules_total
# HELP kepler_container_package_joules_total Aggregated RAPL value in package (socket) in joules
# TYPE kepler_container_package_joules_total counter
kepler_container_package_joules_total{command="",container_id="003ce073b958b7073bfcc176cfbc9af29addaa155d5ac5d17c9004e18ed1917e",container_name="kube-state-metrics",container_namespace="default",mode="dynamic",pod_name="prometheus-kube-state-metrics-5fb6fbbf78-rh4rw"} 0
kepler_container_package_joules_total{command="",container_id="003ce073b958b7073bfcc176cfbc9af29addaa155d5ac5d17c9004e18ed1917e",container_name="kube-state-metrics",container_namespace="default",mode="idle",pod_name="prometheus-kube-state-metrics-5fb6fbbf78-rh4rw"} 4668.661
kepler_container_package_joules_total{command="",container_id="0a0593389e90ed921eb133c98a2eae7d2ad44671af1bc9f60f2617e9283af289",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-sbz6q"} 0
kepler_container_package_joules_total{command="",container_id="0a0593389e90ed921eb133c98a2eae7d2ad44671af1bc9f60f2617e9283af289",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-sbz6q"} 4668.661

Could it be related to this log line? Could not find any ACPI power meter path. Is it a VM?

I'm not familiar with ACPI so I'm not sure how to enable it.

rootfs commented 1 year ago

@rossf7 can you run

kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics|grep kepler_container_joules|sort -k2 -g"
rossf7 commented 1 year ago

@rootfs Yes sure, thanks

kubectl exec -ti -n kepler daemonset/kepler -- bash -c "curl -s localhost:9102/metrics|grep kepler_container_joules|sort -k2 -g"
# HELP kepler_container_joules_total Aggregated RAPL Package + Uncore + DRAM + GPU + other host components (platform - package - dram) in joules
# TYPE kepler_container_joules_total counter
kepler_container_joules_total{command="",container_id="0afc45bae3ac2ea9e1e57a9e20c85f9c05057d73ab55b951c07e1af26abb2055",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-qrj7v"} 0
kepler_container_joules_total{command="",container_id="305e2a60dde44b15d3f3fc0c2cdd5a91f844fcfcf872ebd6a2ae2d1b5da3d459",container_name="aws-node",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-qrj7v"} 0
kepler_container_joules_total{command="",container_id="33dc774ae114271f70a237723d6335d4e1b90dce9a37967664c98265af10dc57",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-6r666"} 0
kepler_container_joules_total{command="",container_id="60f6e118017aba1ac13c3eb8a19f2a8f0b880434ff1a19b833e5b046789a49c0",container_name="kepler-exporter",container_namespace="kepler",mode="dynamic",pod_name="kepler-bb5n2"} 0
kepler_container_joules_total{command="",container_id="81238f12414be885fb782c55495000516595345e7deacff1813654caa7e4b346",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-s6gph"} 0
kepler_container_joules_total{command="",container_id="fc9be886eaec22034b0ecaa85374b1ca39fd7f3516d9e2cd07da1111f3232f2e",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-svh8t"} 0
kepler_container_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 0
kepler_container_joules_total{command="",container_id="0afc45bae3ac2ea9e1e57a9e20c85f9c05057d73ab55b951c07e1af26abb2055",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="idle",pod_name="aws-node-qrj7v"} 60.213
kepler_container_joules_total{command="",container_id="305e2a60dde44b15d3f3fc0c2cdd5a91f844fcfcf872ebd6a2ae2d1b5da3d459",container_name="aws-node",container_namespace="kube-system",mode="idle",pod_name="aws-node-qrj7v"} 60.213
kepler_container_joules_total{command="",container_id="33dc774ae114271f70a237723d6335d4e1b90dce9a37967664c98265af10dc57",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-6r666"} 60.213
kepler_container_joules_total{command="",container_id="60f6e118017aba1ac13c3eb8a19f2a8f0b880434ff1a19b833e5b046789a49c0",container_name="kepler-exporter",container_namespace="kepler",mode="idle",pod_name="kepler-bb5n2"} 60.213
kepler_container_joules_total{command="",container_id="81238f12414be885fb782c55495000516595345e7deacff1813654caa7e4b346",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-s6gph"} 60.213
kepler_container_joules_total{command="",container_id="fc9be886eaec22034b0ecaa85374b1ca39fd7f3516d9e2cd07da1111f3232f2e",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-svh8t"} 60.213
kepler_container_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 60.213
rootfs commented 1 year ago

thank you for sharing @rossf7 the dynamic power is all zero, that could be due to either ebpf or cgroup stats. Can you get the following two container metrics and one node metric?

container metrics

kubectl exec -ti -n kepler daemonset/kepler -- bash -c "curl -s localhost:9102/metrics| grep kepler_container_cgroupfs |sort -k 2 -g

and

kubectl exec -ti -n kepler daemonset/kepler -- bash -c "curl -s localhost:9102/metrics| grep kepler_container_cpu_instructions_total |sort -k 2 -g

node metrics

kubectl exec -ti -n kepler daemonset/kepler -- bash -c "curl -s localhost:9102/metrics| grep kepler_node |grep rapl |sort -k 2 -g
rossf7 commented 1 year ago

@rootfs I put the metrics and the logs in this pastebin https://pastebin.com/mDU726Fu The 2nd container metrics command didn't find any results.

nikimanoledaki commented 1 year ago

To add another datapoint - this was the result with an EKS cluster created through eksctl with the following config:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: hack-green-metrics-2
  region: us-east-2
  version: "1.25"

managedNodeGroups:
  - name: test-group
    instanceType: m5a.2xlarge
    amiFamily: "AmazonLinux2"
    desiredCapacity: 1

The instance type m5a.2xlarge is a baremetal machine. However, I wasn't able to validate RAPL access.

Installed the headers:

sudo yum install kernel-devel-`uname -r` -y

# kernel header files could be found in the following dir: 
ls -l /usr/src/kernels/`uname -r`

Kepler logs:

➜  ~ k logs kepler-gpt7f -n kepler -f
I0807 15:40:04.303799       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
E0807 15:40:04.306277       1 utils.go:117] getCPUArch failure: no CPU power model found for architecture  AMD Zen, 14nm
I0807 15:40:04.316376       1 exporter.go:155] Kepler running on version: a7a6cb1
I0807 15:40:04.316800       1 config.go:258] using gCgroup ID in the BPF program: true
I0807 15:40:04.316886       1 config.go:260] kernel version: 5.1
I0807 15:40:04.316931       1 exporter.go:179] EnabledBPFBatchDelete: true
I0807 15:40:04.317147       1 rapl_msr_util.go:136] input/output error
I0807 15:40:04.317192       1 power.go:64] Not able to obtain power, use estimate method
I0807 15:40:04.317208       1 redfish.go:169] failed to get redfish credential file path
I0807 15:40:04.317214       1 power.go:55] use acpi to obtain power
I0807 15:40:04.317616       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0807 15:40:04.337075       1 exporter.go:198] Initializing the GPU collector
I0807 15:40:10.342287       1 watcher.go:66] Using in cluster k8s config
I0807 15:40:10.443009       1 bpf_perf.go:123] LibbpfBuilt: false, BccBuilt: true
I0807 15:40:11.521876       1 bcc_attacher.go:186] Successfully load eBPF module from bcc with option: [-DMAP_SIZE=10240 -DNUM_CPUS=8]
I0807 15:40:11.530807       1 exporter.go:251] Started Kepler in 7.214587669s

Kepler container metrics:

k exec -ti -n kepler daemonset/kepler -- bash -c "curl localhost:9102/metrics" | grep 'kepler_container_package_joules_total'
# HELP kepler_container_package_joules_total Aggregated RAPL value in package (socket) in joules
# TYPE kepler_container_package_joules_total counter
kepler_container_package_joules_total{command="",container_id="2a77d2f02be94287bb9102a942b725205ab0a9889a2a61877dcf895de54dd2db",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-rj8j8"} 7904.472
kepler_container_package_joules_total{command="",container_id="2a77d2f02be94287bb9102a942b725205ab0a9889a2a61877dcf895de54dd2db",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="idle",pod_name="aws-node-rj8j8"} 0
kepler_container_package_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 7944.603
kepler_container_package_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 0
kepler_container_package_joules_total{command="aws-k8s-ag",container_id="49067fa7f092ea3c54a529d427e420f94c4f5e5374d03d027d171e809a261ddc",container_name="aws-node",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-rj8j8"} 7911
kepler_container_package_joules_total{command="aws-k8s-ag",container_id="49067fa7f092ea3c54a529d427e420f94c4f5e5374d03d027d171e809a261ddc",container_name="aws-node",container_namespace="kube-system",mode="idle",pod_name="aws-node-rj8j8"} 0
kepler_container_package_joules_total{command="coredns",container_id="40edb9f8daf7eb281b77dd2a75860f9287e6a00cb9aaf4cc2c79c266d57ac008",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-xb9c5"} 7905.24
kepler_container_package_joules_total{command="coredns",container_id="40edb9f8daf7eb281b77dd2a75860f9287e6a00cb9aaf4cc2c79c266d57ac008",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-xb9c5"} 0
kepler_container_package_joules_total{command="coredns",container_id="90af3664a4a515b25154cd8e0ae029aa4231902a3f26ab963606d605f726d437",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-vchf7"} 7905.24
kepler_container_package_joules_total{command="coredns",container_id="90af3664a4a515b25154cd8e0ae029aa4231902a3f26ab963606d605f726d437",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-vchf7"} 0
kepler_container_package_joules_total{command="kepler",container_id="4fb403d409a1da40fd765ba12a0c1acb144e2848fd283ada5fe642ba6b71a758",container_name="kepler-exporter",container_namespace="kepler",mode="dynamic",pod_name="kepler-gpt7f"} 7922.171
kepler_container_package_joules_total{command="kepler",container_id="4fb403d409a1da40fd765ba12a0c1acb144e2848fd283ada5fe642ba6b71a758",container_name="kepler-exporter",container_namespace="kepler",mode="idle",pod_name="kepler-gpt7f"} 0
kepler_container_package_joules_total{command="kube-proxy",container_id="ea3399423eee315d3970bd1861535d912368d2be97ee8434878efefc5f40aac8",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-xtnbg"} 7904.832
kepler_container_package_joules_total{command="kube-proxy",container_id="ea3399423eee315d3970bd1861535d912368d2be97ee8434878efefc5f40aac8",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-xtnbg"} 0

The container energy data here show more data on the mode="dynamic" type than mode="idle" type - what is the difference between the two? Thanks!

rossf7 commented 1 year ago

the dynamic power is all zero, that could be due to either ebpf or cgroup stats

@rootfs looks like a problem with the cgroup stats as they don't have the container ID :( Is it possible to get the container ID with ebpf?

cat /proc/1529/cgroup
0::/system.slice/snap.kubelet-eks.daemon.service

I would happily switch to Amazon Linux and it doesn't have the cgroup problem.

However I'm blocked enabling RAPL because I can't load the kernel module modprobe intel_rapl_common.

On ubuntu I'd install it with apt-get install linux-modules-extra-$(uname -r). Does anyone know how to do this with Amazon Linux?

@nikimanoledaki I think m5a.2xlarge is a VM so the estimation method is being used.

With M5ad instances, local NVMe-based SSDs are physically connected to the host server and provide block-level storage that is coupled to the lifetime of the M5a instance

There are instance types that attach to physical storage but they are still VMs AFAIK.

rossf7 commented 1 year ago

@rootfs Apologies I checked the kubelet by mistake. These are the cgroup stats for the kepler pod which have the container ID.

cat /proc/26575/cgroup
13:rdma:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
12:memory:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
11:cpuset:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
10:pids:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
9:perf_event:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
8:net_cls,net_prio:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
7:cpu,cpuacct:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
6:misc:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
5:hugetlb:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
4:blkio:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
3:freezer:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
2:devices:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
1:name=systemd:/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
0::/kubepods/besteffort/podfc83e60c-74c4-4f9d-b93d-c6c4acb4b1a7/c6a96f4e7acba42d7b62cda92c9067e872bceea78120f25e51c208e947e92f75
rootfs commented 1 year ago

@rossf7 can you try the 0.5.4 manifests and use this image quay.io/sustainable_computing_io/kepler:latest-libbpf?

rossf7 commented 1 year ago

Thanks @rootfs there are now values for dynamic power as well idle. Although they are the same for each container?

kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl -s localhost:9102/metrics|grep kepler_container_joules|sort -k2 -g"
# HELP kepler_container_joules_total Aggregated RAPL Package + Uncore + DRAM + GPU + other host components (platform - package - dram) in joules
# TYPE kepler_container_joules_total counter
kepler_container_joules_total{command="",container_id="03a6617381d107c75c596b3d0b9b5848aeed0e60049520815f66af1ba182987d",container_name="aws-node",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-fqcwm"} 3.186
kepler_container_joules_total{command="",container_id="0fe4bb488c40e7b517becaa15fae8957abde5145bfa7858eeedc8f30ed1ade8c",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-g5r2r"} 3.186
kepler_container_joules_total{command="",container_id="25c4f3ec5c9dd315dc8961935b973a3384078975e150e16104b668dfd827bc63",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-fqcwm"} 3.186
kepler_container_joules_total{command="",container_id="3dae2afbdcc736670043df897b0589e23f48596f0f2e685f11b595b277b15748",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-k5xf8"} 3.186
kepler_container_joules_total{command="",container_id="46da5f7f01cfe69543e2031839438ece6dac8de87712ce30bad394fd03e244ea",container_name="kepler-exporter",container_namespace="kepler",mode="dynamic",pod_name="kepler-exporter-ww755"} 3.186
kepler_container_joules_total{command="",container_id="bc8ad9dd707908105c270cb2c09f95d8344dadc53480770987c44e62b8592c0b",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-8fd4db68f-s9kc4"} 3.186
kepler_container_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 10.512
kepler_container_joules_total{command="",container_id="03a6617381d107c75c596b3d0b9b5848aeed0e60049520815f66af1ba182987d",container_name="aws-node",container_namespace="kube-system",mode="idle",pod_name="aws-node-fqcwm"} 79.275
kepler_container_joules_total{command="",container_id="0fe4bb488c40e7b517becaa15fae8957abde5145bfa7858eeedc8f30ed1ade8c",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-g5r2r"} 79.275
kepler_container_joules_total{command="",container_id="25c4f3ec5c9dd315dc8961935b973a3384078975e150e16104b668dfd827bc63",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="idle",pod_name="aws-node-fqcwm"} 79.275
kepler_container_joules_total{command="",container_id="3dae2afbdcc736670043df897b0589e23f48596f0f2e685f11b595b277b15748",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-k5xf8"} 79.275
kepler_container_joules_total{command="",container_id="46da5f7f01cfe69543e2031839438ece6dac8de87712ce30bad394fd03e244ea",container_name="kepler-exporter",container_namespace="kepler",mode="idle",pod_name="kepler-exporter-ww755"} 79.275
kepler_container_joules_total{command="",container_id="bc8ad9dd707908105c270cb2c09f95d8344dadc53480770987c44e62b8592c0b",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-8fd4db68f-s9kc4"} 79.275
kepler_container_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 196.575

Logs https://pastebin.com/VCMGVvJT

rootfs commented 1 year ago

@marceloamaral can you take a look? thanks

rootfs commented 1 year ago

@rossf7 the EKS deployment issue looks fixed, can you create a separate issue on this discovery? @marceloamaral can you take a look? thanks

Thanks @rootfs there are now values for dynamic power as well idle. Although they are the same for each container?

AntonioDiTuri commented 1 year ago

I am not sure this issue is fully closed right? The objective was to find a list of ec2 instaces that would work with kepler. We only tried c5.metal and no trial have been done on simple VM right?

AntonioDiTuri commented 1 year ago

I tried to reproduce the issue following what @rossf7 did here.

Basically I tried it with a c5.metal instance in which I loaded the RAPL kernel module with those commands:

sudo apt-get install linux-modules-extra-$(uname -r)
sudo modprobe intel_rapl_common

I then installed keper via helm chart using APP RELEASE 0.5.4 and this image quay.io/sustainable_computing_io/kepler:latest-libbpf as suggested by @rootfs here.

I then run the same command to check if everything was working:

k -n kepler logs ds/kepler

I0901 09:28:19.685841       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0901 09:28:19.703312       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0901 09:28:19.711392       1 exporter.go:158] Kepler running on version: a258799
I0901 09:28:19.711413       1 config.go:270] using gCgroup ID in the BPF program: true
I0901 09:28:19.711448       1 config.go:272] kernel version: 5.15
I0901 09:28:19.711595       1 config.go:297] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0901 09:28:19.711603       1 exporter.go:170] LibbpfBuilt: true, BccBuilt: false
I0901 09:28:19.711614       1 exporter.go:189] EnabledBPFBatchDelete: true
I0901 09:28:19.711658       1 power.go:54] use sysfs to obtain power
I0901 09:28:19.711669       1 redfish.go:169] failed to get redfish credential file path
I0901 09:28:19.711675       1 power.go:56] use acpi to obtain power
I0901 09:28:19.715088       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0901 09:28:19.743999       1 exporter.go:204] Initializing the GPU collector
I0901 09:28:25.749634       1 watcher.go:66] Using in cluster k8s config
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2344, link 0, flags 6, type=1
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 293 insns (2344 bytes)
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 352, link 26, flags 40, type=9
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 26, flags 40, type=9
libbpf: elf: section(7) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(8) license, size 4, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL
libbpf: elf: section(17) .BTF, size 5816, link 0, flags 0, type=1
libbpf: elf: section(19) .BTF.ext, size 2040, link 0, flags 0, type=1
libbpf: elf: section(26) .symtab, size 984, link 1, flags 0, type=2
libbpf: looking for externs among 41 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 7, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 88.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time': at sec_idx 7, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 32768.
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 7, offset 64.
libbpf: map 'cpu_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 7, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 7, offset 128.
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_ref_cycles': at sec_idx 7, offset 160.
libbpf: map 'cpu_ref_cycles': found type = 2.
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.
libbpf: map 'cpu_ref_cycles': found max_entries = 128.
libbpf: map 'cpu_instructions_hc_reader': at sec_idx 7, offset 192.
libbpf: map 'cpu_instructions_hc_reader': found type = 4.
libbpf: map 'cpu_instructions_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_instructions_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_instructions_hc_reader': found max_entries = 128.
libbpf: map 'cpu_instructions': at sec_idx 7, offset 224.
libbpf: map 'cpu_instructions': found type = 2.
libbpf: map 'cpu_instructions': found key [6], sz = 4.
libbpf: map 'cpu_instructions': found value [12], sz = 8.
libbpf: map 'cpu_instructions': found max_entries = 128.
libbpf: map 'cache_miss_hc_reader': at sec_idx 7, offset 256.
libbpf: map 'cache_miss_hc_reader': found type = 4.
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 7, offset 288.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 7, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #17 against 'cpu_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 7, off 64) for insn #17
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #36 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #36
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #50 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #50
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #55 against 'cpu_ref_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 7, off 128) for insn #55
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #68 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #68
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #82 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #82
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #87 against 'cpu_instructions_hc_reader'
libbpf: prog 'kepler_trace': found map 6 (cpu_instructions_hc_reader, sec 7, off 192) for insn #87
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #104 against 'cpu_instructions'
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 7, off 224) for insn #104
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #117 against 'cpu_instructions'
libbpf: prog 'kepler_trace': found map 7 (cpu_instructions, sec 7, off 224) for insn #117
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #122 against 'cache_miss_hc_reader'
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 7, off 256) for insn #122
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #134 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #134
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #148 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #148
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #156 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #156
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #170 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #170
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #182 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #182
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #206 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #206
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #215 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #215
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #223 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #223
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #235 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #235
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #241 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #241
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #261 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #261
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #287 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #287
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 7, off 0) for insn #5
libbpf: map 'processes': created successfully, fd=10
libbpf: map 'pid_time': created successfully, fd=11
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12
libbpf: map 'cpu_cycles': created successfully, fd=13
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14
libbpf: map 'cpu_ref_cycles': created successfully, fd=15
libbpf: map 'cpu_instructions_hc_reader': created successfully, fd=16
libbpf: map 'cpu_instructions': created successfully, fd=17
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18
libbpf: map 'cache_miss': created successfully, fd=19
libbpf: map 'cpu_freq_array': created successfully, fd=20
I0901 09:28:25.941646       1 libbpf_attacher.go:157] Successfully load eBPF module from libbpf object
I0901 09:28:25.955182       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0901 09:28:25.955220       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0901 09:28:25.955240       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0901 09:28:25.955260       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0901 09:28:25.955595       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0901 09:28:25.955858       1 exporter.go:276] Started Kepler in 6.244478035s

Is kepler using the estimation model?

This also does not seem righ:

k exec -ti -n kepler daemonset/kepler -- bash -c "curl localhost:9102/metrics" | grep 'kepler_container_package_joules_total'
# HELP kepler_container_package_joules_total Aggregated RAPL value in package (socket) in joules
# TYPE kepler_container_package_joules_total counter
kepler_container_package_joules_total{command="",container_id="2641085811f2e467354555e4450b3b23a591cd2b42c56c62ff95a8720235e242",container_name="kepler-exporter",container_namespace="kepler",mode="dynamic",pod_name="kepler-99wsn"} 0
kepler_container_package_joules_total{command="",container_id="2641085811f2e467354555e4450b3b23a591cd2b42c56c62ff95a8720235e242",container_name="kepler-exporter",container_namespace="kepler",mode="idle",pod_name="kepler-99wsn"} 0
kepler_container_package_joules_total{command="",container_id="491136917149570c8e880c5eaa0583f454759e0c38010fa6c5361a96b93d676c",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-cbbbbb9cb-22dd4"} 0
kepler_container_package_joules_total{command="",container_id="491136917149570c8e880c5eaa0583f454759e0c38010fa6c5361a96b93d676c",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-cbbbbb9cb-22dd4"} 0
kepler_container_package_joules_total{command="",container_id="4ed8d09abda59733fe460dfb9c53de5c815d653cd9b89c2a75a73dcdaf257835",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-trjh8"} 0
kepler_container_package_joules_total{command="",container_id="4ed8d09abda59733fe460dfb9c53de5c815d653cd9b89c2a75a73dcdaf257835",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-trjh8"} 0
kepler_container_package_joules_total{command="",container_id="4fa3525f7c80f97f068d4980795cb28e0b8a52b09e1393f5bda19d191a1b5fba",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-cbbbbb9cb-dvnmt"} 0
kepler_container_package_joules_total{command="",container_id="4fa3525f7c80f97f068d4980795cb28e0b8a52b09e1393f5bda19d191a1b5fba",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-cbbbbb9cb-dvnmt"} 0
kepler_container_package_joules_total{command="",container_id="6ff21a22f38baf413aa49b1dbb69c423cd9f194db29f8bf397d0f2af133684da",container_name="aws-node",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-lxhsn"} 0
kepler_container_package_joules_total{command="",container_id="6ff21a22f38baf413aa49b1dbb69c423cd9f194db29f8bf397d0f2af133684da",container_name="aws-node",container_namespace="kube-system",mode="idle",pod_name="aws-node-lxhsn"} 0
kepler_container_package_joules_total{command="",container_id="d98b90b7199e5affa57dcbd1d106cad57705b1d23e8c2c66b2882425c03d0949",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="dynamic",pod_name="aws-node-lxhsn"} 0
kepler_container_package_joules_total{command="",container_id="d98b90b7199e5affa57dcbd1d106cad57705b1d23e8c2c66b2882425c03d0949",container_name="aws-vpc-cni-init",container_namespace="kube-system",mode="idle",pod_name="aws-node-lxhsn"} 0
kepler_container_package_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 3020.487
kepler_container_package_joules_total{command="",container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 37644.807

All the container report zero but this time we have some values for the dynamic power.

Do I need to create a separate ticket for this?

rootfs commented 1 year ago

@AntonioDiTuri please get all the kepler metrics and open an issue with the metric issue, thanks!

k exec -ti -n kepler daemonset/kepler -- bash -c "curl localhost:9102/metrics" | grep 'kepler_'

AntonioDiTuri commented 1 year ago

I think we can consider opening another ticket regarding other ec2 instances whenever the issue with the first one (c5.metal) it is solved, what do you think guys?