sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.1k stars 174 forks source link

Failed to open path to msr and to attach perf events on VMs #554

Closed andersonandrei closed 1 year ago

andersonandrei commented 1 year ago

Describe the bug I'm working with Azure AK8 (VMs) and I'm getting errors related to msr and perf events. Also, even when I see a few metric values, they are mostly 0.

To Reproduce Steps to reproduce the behavior:

  1. Deployed it manually, using one of the manifest available, with slight modifications:
    kind: Namespace
    metadata:
    labels:
    sustainable-computing.io/app: kepler
    name: kepler
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    labels:
    sustainable-computing.io/app: kepler
    name: kepler-sa
    namespace: monitoring
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
    labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    sustainable-computing.io/app: kepler
    name: prometheus-k8s
    namespace: monitoring
    rules:
    - apiGroups:
    - ""
    resources:
    - services
    - endpoints
    - pods
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - extensions
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - networking.k8s.io
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    labels:
    sustainable-computing.io/app: kepler
    name: kepler-clusterrole
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes/metrics
    - nodes/proxy
    - nodes/stats
    verbs:
    - get
    - watch
    - list
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
    labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: k8s
    app.kubernetes.io/name: prometheus
    sustainable-computing.io/app: kepler
    name: prometheus-k8s
    namespace: monitoring
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: prometheus-k8s
    subjects:
    - kind: ServiceAccount
    name: prometheus-k8s
    namespace: monitoring
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    labels:
    sustainable-computing.io/app: kepler
    name: kepler-clusterrole-binding
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: kepler-clusterrole
    subjects:
    - kind: ServiceAccount
    name: kepler-sa
    namespace: monitoring
    ---
    apiVersion: v1
    data:
    BIND_ADDRESS: 0.0.0.0:9102
    CGROUP_METRICS: '*'
    CPU_ARCH_OVERRIDE: ""
    ENABLE_EBPF_CGROUPID: "true"
    ENABLE_GPU: "true"
    KEPLER_LOG_LEVEL: "1"
    KEPLER_namespace: monitoring
    METRIC_PATH: /metrics
    MODEL_CONFIG: |
    CONTAINER_COMPONENTS_ESTIMATOR=false
    CONTAINER_COMPONENTS_INIT_URL=https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json
    kind: ConfigMap
    metadata:
    labels:
    sustainable-computing.io/app: kepler
    name: kepler-cfm
    namespace: monitoring
    ---
    apiVersion: v1
    kind: Service
    metadata:
    labels:
    app: kepler-exporter-service
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
    name: kepler-exporter
    namespace: monitoring
    spec:
    ports:
    - name: http
    port: 9102
    targetPort: http
    selector:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    labels:
    app: kepler-exporter-service
    sustainable-computing.io/app: kepler
    name: kepler-exporter
    namespace: monitoring
    spec:
    selector:
    matchLabels:
      app: kepler-exporter-service
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter
      sustainable-computing.io/app: kepler
    template:
    metadata:
      labels:
        app: kepler-exporter-service
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: kepler-exporter
        sustainable-computing.io/app: kepler
    spec:
      containers:
      - args:
        - /usr/bin/kepler -v=1
        command:
        - /bin/sh
        - -c
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: quay.io/sustainable_computing_io/kepler:latest
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz
            port: 9102
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 10
        name: kepler-exporter
        ports:
        - containerPort: 9102
          name: http
        resources:
          requests:
            cpu: 100m
            memory: 400Mi
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /lib/modules
          name: lib-modules
        - mountPath: /sys
          name: tracing
        - mountPath: /proc
          name: proc
        - mountPath: /etc/config
          name: cfm
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: kepler-sa
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - hostPath:
          path: /lib/modules
          type: Directory
        name: lib-modules
      - hostPath:
          path: /sys
          type: Directory
        name: tracing
      - hostPath:
          path: /proc
          type: Directory
        name: proc
      - configMap:
          name: kepler-cfm
        name: cfm
    ---
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    labels:
    app: prometheus-operator
    release: prometheus
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
    name: kepler-exporter
    namespace: monitoring
    spec:
    endpoints:
    - interval: 1s
    port: http
    namespaceSelector:
    matchNames:
    - ryaxns-monitoring
    selector:
    matchLabels:
      app: kepler-exporter-service
  2. See error

I0301 14:42:12.695339 1 gpu_nvml.go:45] could not init nvml: Failed to init nvml: could not init nvml: , using dummy source to obtain gpu power I0301 14:42:12.708031 1 exporter.go:150] Kepler running on version: 0d3e6ce I0301 14:42:12.708117 1 config.go:153] using gCgroup ID in the BPF program: true I0301 14:42:12.708205 1 config.go:154] kernel version: 5.4 I0301 14:42:12.708271 1 config.go:172] EnabledGPU: true I0301 14:42:12.708435 1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory I0301 14:42:12.708526 1 power.go:64] Not able to obtain power, use estimate method I0301 14:42:12.711548 1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0 I0301 14:42:13.016250 1 exporter.go:168] Initializing the GPU collector perf_event_open: No such file or directory I0301 14:42:15.542780 1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542866 1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542922 1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory perf_event_open: No such file or directory I0301 14:42:15.542990 1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory I0301 14:42:15.543017 1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2] I0301 14:42:15.601337 1 exporter.go:210] Started Kepler in 2.89332793s



3. Possible way around for the msr problem:

I was able to "solve" the problem of the path to msr, deploying a simple pod, getting inside it within a terminal, and installing msr manually with `apt-get update -y` `apt-get install msr-tools` `modprobe msr`. Besides, I'm not sure if it is the correct way to do that, and the other errors persist.

5. More details:
Even with the errors above, I get a few metrics values, but still, the majority is 0.

**Expected behavior**
To not see those errors and to get metrics of energy.

**Azure AK8 (VMs):**
 - OS: Ubuntu 20.04.5
 - Kernel: 5.14.0-1057-oem

**Additional context**
Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler's deployment. I need to re-deploy Kepler to see the metrics of such new pods.
jichenjc commented 1 year ago

Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler's deployment. I need to re-deploy Kepler to see the metrics of such new pods.

I reported this before but I think your version is 0d3e6ce which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei

jichenjc commented 1 year ago

from those log I am wondering 1) you didn't enable cgroupv2 2) you didn't enable eBPF seems you are using Azure VM which I don't know those can be enabled or not

I0301 14:42:12.708526       1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250       1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
rootfs commented 1 year ago

@andersonandrei msr error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.

What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?

containers:
      - args:
        - /usr/bin/kepler -v=5
andersonandrei commented 1 year ago

from those log I am wondering 1) you didn't enable cgroupv2 2) you didn't enable eBPF seems you are using Azure VM which I don't know those can be enabled or not

I0301 14:42:12.708526       1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250       1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory

I can see that cgroup2 is enabled:

root@server-vmss000001:/# grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

root@server-vmss000001:/# uname -a
Linux server-vmss000001 5.4.0-1091-azure #96~18.04.1-Ubuntu SMP Tue Aug 30 19:15:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

But I'm not sure how to check if eBPF is enabled. Can you help me, please?

Thanks a lot!

andersonandrei commented 1 year ago

Also, Kepler does not export metrics of new pods. I can just see the metrics for those pods that were already in the platform before Kepler's deployment. I need to re-deploy Kepler to see the metrics of such new pods.

I reported this before but I think your version is 0d3e6ce which is really new version ,so can you help report with more detail on this in another isue? @andersonandrei

Hello @jichenjc , thanks for your answer.

I can try to provide more details, but I'm not sure if I should open a new issue for that, or if I should comment on this issue. What do you suggest?

Thanks!

rootfs commented 1 year ago

let's keep this issue open and share comments here

andersonandrei commented 1 year ago

@andersonandrei msr error is benign: if msr cannot be accessed (typical on VMs), the power calculation model is then switched to linear regression based method.

What metrics can you see, can you post them here? In addition, can you change the verbosity to 5 (like below) and share the log?

containers:
      - args:
        - /usr/bin/kepler -v=5

Hello @rootfs , thanks for your answer.

I just updated the verbosity:

I0302 15:26:45.355370       1 gpu_nvml.go:45] could not init nvml: <nil>
Failed to init nvml: could not init nvml: <nil>, using dummy source to obtain gpu power
I0302 15:26:45.356652       1 exporter.go:150] Kepler running on version: 71ef9dc
I0302 15:26:45.356831       1 config.go:153] using gCgroup ID in the BPF program: true
I0302 15:26:45.356983       1 config.go:154] kernel version: 5.4
I0302 15:26:45.357110       1 config.go:172] EnabledGPU: true
I0302 15:26:45.357253       1 slice_handler.go:145] InitSliceHandler: &{map[] /sys/fs/cgroup/cpu/system.slice /sys/fs/cgroup/memory/system.slice /sys/fs/cgroup/blkio/system.slice}
I0302 15:26:45.357503       1 rapl_msr_util.go:143] failed to open path /dev/cpu/1/msr: no such file or directory
I0302 15:26:45.357662       1 power.go:64] Not able to obtain power, use estimate method
I0302 15:26:45.357796       1 bcc_attacher.go:144] hardeware counter metrics config true
I0302 15:26:45.357868       1 bcc_attacher.go:162] irq counter metrics config true
I0302 15:26:45.360564       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0302 15:26:45.543243       1 utils.go:58] Available ebpf metrics: [cpu_time irq_net_tx irq_net_rx irq_block]
I0302 15:26:45.543521       1 utils.go:59] Available counter metrics: [cpu_cycles cpu_ref_cycles cpu_instr cache_miss]
I0302 15:26:45.543534       1 utils.go:60] Available cgroup metrics from cgroup: [cgroupfs_kernel_memory_usage_bytes cgroupfs_tcp_memory_usage_bytes cgroupfs_cpu_usage_us cgroupfs_system_cpu_usage_us cgroupfs_user_cpu_usage_us cgroupfs_memory_usage_bytes]
I0302 15:26:45.543560       1 utils.go:61] Available cgroup metrics from kubelet: [container_cpu_usage_seconds_total container_memory_working_set_bytes]
I0302 15:26:45.543581       1 utils.go:62] Available I/O metrics: [bytes_read bytes_writes]
I0302 15:26:45.543674       1 model.go:86] Model Config NODE_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.543696       1 lr.go:171] LR Model (AbsModelWeight): no config
I0302 15:26:45.543701       1 model.go:78] Model AbsModelWeight initiated (false)
I0302 15:26:45.543710       1 model.go:86] Model Config NODE_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.544816       1 lr.go:164] LR Model (AbsComponentModelWeight): loadWeightFromURLorLocal(/var/lib/kepler/data/KerasCompWeightFullPipeline.json): map[core:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{0.5408945679664612} Broadwell:{17.9639892578125} Cascade Lake:{-0.49166440963745117} Coffee Lake:{0.5166589617729187} Haswell:{-0.5789095163345337} Ivy Bridge:{-0.024028241634368896} Sandy Bridge:{0.5239214301109314} Sky Lake:{0.4193417429924011}]] map[cpu_cycles:{6.85713664e+09 7.560771917192364e+18 -0.11352460086345673} cpu_instr:{3.374244864e+09 8.408530291701842e+17 -0.414739191532135} cpu_time:{192019.5 5.2761312e+08 -0.06457684189081192}]}} dram:{{17.49994468688965 map[cpu_architecture:map[Alder Lake:{-0.11559933423995972} Broadwell:{16.972564697265625} Cascade Lake:{0.5505847334861755} Coffee Lake:{-0.4564790725708008} Haswell:{-0.13912856578826904} Ivy Bridge:{-0.018331050872802734} Sandy Bridge:{-0.6695247888565063} Sky Lake:{0.29698115587234497}]] map[cache_miss:{9.329199e+06 2.2145245642752e+13 -0.4680119752883911} container_memory_working_set_bytes:{253952 1.93474854912e+11 0.6805805563926697}]}}]
I0302 15:26:45.544906       1 model.go:78] Model AbsComponentModelWeight initiated (true)
I0302 15:26:45.544919       1 model.go:86] Model Config CONTAINER_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.544927       1 lr.go:171] LR Model (DynModelWeight): no config
I0302 15:26:45.544931       1 model.go:78] Model DynModelWeight initiated (false)
I0302 15:26:45.544954       1 model.go:86] Model Config CONTAINER_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json}
I0302 15:26:45.571424       1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CgroupOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8339537395297559 map[] map[cgroupfs_cpu_usage_us:{6.557768917682226e+06 1.0111459601975505e+14 0.30000764700549093} cgroupfs_memory_usage_bytes:{1.1568395713957231e+07 1.4831755031743084e+14 0.010739137690300415} cgroupfs_system_cpu_usage_us:{320904.6193377788 3.07640381448747e+10 0.2612100015047149} cgroupfs_user_cpu_usage_us:{6.236864298459416e+06 1.0251834339511184e+14 0}]}} pkg:{{24.603798628318298 map[] map[cgroupfs_cpu_usage_us:{6.557768917682226e+06 1.0111459601975505e+14 0} cgroupfs_memory_usage_bytes:{1.1568395713957231e+07 1.4831755031743084e+14 0} cgroupfs_system_cpu_usage_us:{320904.6193377788 3.07640381448747e+10 0} cgroupfs_user_cpu_usage_us:{6.236864298459416e+06 1.0251834339511184e+14 24.50009569917875}]}}]
I0302 15:26:45.571713       1 model.go:78] Model DynComponentModelWeight initiated (true)
I0302 15:26:45.571842       1 model.go:86] Model Config PROCESS_TOTAL: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.572019       1 lr.go:171] LR Model (DynModelWeight): no config
I0302 15:26:45.572119       1 model.go:78] Model DynModelWeight initiated (false)
I0302 15:26:45.572256       1 model.go:86] Model Config PROCESS_COMPONENTS: {UseEstimatorSidecar:false SelectedModel: SelectFilter: InitModelURL:}
I0302 15:26:45.577089       1 lr.go:164] LR Model (DynComponentModelWeight): loadWeightFromURLorLocal(https://raw.githubusercontent.com/sustainable-computing-io/kepler-model-server/main/tests/test_models/DynComponentModelWeight/CounterOnly/ScikitMixed/ScikitMixed.json): map[dram:{{0.8318076441807906 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0.22602678738192125} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 0.146880994775066} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 0}]}} pkg:{{24.388564716241596 map[] map[cache_miss:{1.9615158772131525e+07 8.320351534182539e+13 0} cpu_cycles:{1.6455473194342838e+10 5.6028839338593524e+20 15.858373957810427} cpu_instr:{2.3490652312518856e+10 3.0816587041591017e+21 8.25749138735891}]}}]
I0302 15:26:45.577314       1 model.go:78] Model DynComponentModelWeight initiated (true)
I0302 15:26:45.577409       1 exporter.go:168] Initializing the GPU collector
I0302 15:26:45.580707       1 acpi.go:75] Using the ACPI power meter path: /sys/class/hwmon/hwmon2/device/
perf_event_open: No such file or directory
I0302 15:26:47.492899       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493069       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493216       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0302 15:26:47.493336       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory
I0302 15:26:47.493365       1 bcc_attacher.go:132] Successfully load eBPF module with option: [-DNUM_CPUS=2]
I0302 15:26:47.531205       1 node_energy_collector.go:60] Node components power model collection is supported
I0302 15:26:47.532142       1 exporter.go:210] Started Kepler in 2.175512593s
I0302 15:26:51.778338       1 container_cgroup_collector.go:28] overall cgroup stats &{map[:[{/sys/fs/cgroup/cpu/system.slice} {/sys/fs/cgroup/memory/system.slice} {/sys/fs/cgroup/blkio/system.slice}]] /sys/fs/cgroup/cpu/system.slice /sys/fs/cgroup/memory/system.slice /sys/fs/cgroup/blkio/system.slice}
I0302 15:26:52.094148       1 node_energy_collector.go:60] Node components power model collection is supported
I0302 15:26:52.094461       1 metric_collector.go:137] energy from pod/container (0 active processes): name: owdev-kafkaprovider-69977b75cc-hf27p/kafkaprovider namespace: monitoring 
rootfs commented 1 year ago

@andersonandrei the logs looks kepler is running. Can you get the kepler container metrics, e.g. through this command?

kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash  -c "curl localhost:9102/metrics|grep kepler_container_core|sort -k 2 -g"
andersonandrei commented 1 year ago
# HELP kepler_container_core_joules_total Aggregated RAPL value in core in joules
# TYPE kepler_container_core_joules_total counter
kepler_container_core_joules_total{command="",container_name="alarmprovider",container_namespace="monitoring",mode="idle",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 0
kepler_container_core_joules_total{command="",container_name="alertmanager",container_namespace="monitoring",mode="idle",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 0
kepler_container_core_joules_total{command="",container_name="apigateway",container_namespace="monitoring",mode="idle",pod_name="owdev-apigateway-6d8b89b6c-zk8rs"} 0
kepler_container_core_joules_total{command="",container_name="autoscaler",container_namespace="kube-system",mode="idle",pod_name="coredns-autoscaler-5589fb5654-hc72l"} 0
kepler_container_core_joules_total{command="",container_name="azure-ip-masq-agent",container_namespace="kube-system",mode="idle",pod_name="azure-ip-masq-agent-5l5gm"} 0
kepler_container_core_joules_total{command="",container_name="azuredisk",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="azurefile",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="cloud-node-manager",container_namespace="kube-system",mode="idle",pod_name="cloud-node-manager-m9t7h"} 0
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="idle",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 0
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-b4854dd98-shw8j"} 0
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="idle",pod_name="coredns-b4854dd98-wxvlr"} 0
kepler_container_core_joules_total{command="",container_name="couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-couchdb-7cf946b654-vkmk8"} 0
kepler_container_core_joules_total{command="",container_name="gen-certs",container_namespace="monitoring",mode="idle",pod_name="owdev-gen-certs-s6cp4"} 0
kepler_container_core_joules_total{command="",container_name="grafana",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="grafana-sc-dashboard",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="grafana-sc-datasources",container_namespace="monitoring",mode="idle",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 0
kepler_container_core_joules_total{command="",container_name="init-config-reloader",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="init-couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-init-couchdb-9vrb7"} 0
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="server",mode="idle",pod_name="server-registry-cert-setup-sx25m"} 0
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="monitoring",mode="idle",pod_name="debug-wk948"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-6fdwf"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-gt7gl"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-rjksd"} 0
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-vr42v"} 0
kepler_container_core_joules_total{command="",container_name="invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-invoker-0"} 0
kepler_container_core_joules_total{command="",container_name="kafka",container_namespace="monitoring",mode="idle",pod_name="owdev-kafka-0"} 0
kepler_container_core_joules_total{command="",container_name="kafkaprovider",container_namespace="monitoring",mode="idle",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 0
kepler_container_core_joules_total{command="",container_name="kepler-exporter",container_namespace="monitoring",mode="idle",pod_name="kepler-exporter-8x47p"} 0
kepler_container_core_joules_total{command="",container_name="konnectivity-agent",container_namespace="kube-system",mode="idle",pod_name="konnectivity-agent-6fcc478f7d-z57d2"} 0
kepler_container_core_joules_total{command="",container_name="kube-prometheus-stack",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-prometheus-operator-5fd846f56-fvjcg"} 0
kepler_container_core_joules_total{command="",container_name="kube-proxy",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-t99c7"} 0
kepler_container_core_joules_total{command="",container_name="kube-proxy-bootstrap",container_namespace="kube-system",mode="idle",pod_name="kube-proxy-t99c7"} 0
kepler_container_core_joules_total{command="",container_name="kube-state-metrics",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-state-metrics-84b79bbdcf-59vj8"} 0
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="idle",pod_name="metrics-server-f77b4cd8-46qs7"} 0
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="idle",pod_name="metrics-server-f77b4cd8-54gt6"} 0
kepler_container_core_joules_total{command="",container_name="nginx",container_namespace="monitoring",mode="idle",pod_name="owdev-nginx-857fb7dc66-jtngj"} 0
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="idle",pod_name="csi-azuredisk-node-k25rk"} 0
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="idle",pod_name="csi-azurefile-node-7k9qt"} 0
kepler_container_core_joules_total{command="",container_name="prometheus",container_namespace="monitoring",mode="idle",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 0
kepler_container_core_joules_total{command="",container_name="promtail",container_namespace="monitoring",mode="idle",pod_name="loki-promtail-wlcxl"} 0
kepler_container_core_joules_total{command="",container_name="rabbitmq",container_namespace="server",mode="idle",pod_name="server-broker-0"} 0
kepler_container_core_joules_total{command="",container_name="redis",container_namespace="monitoring",mode="idle",pod_name="owdev-redis-bc89c877-tblns"} 0
kepler_container_core_joules_total{command="",container_name="redis-init",container_namespace="monitoring",mode="idle",pod_name="owdev-redis-bc89c877-tblns"} 0
kepler_container_core_joules_total{command="",container_name="server-authorization",container_namespace="server",mode="idle",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 0
kepler_container_core_joules_total{command="",container_name="server-authorization-database-migration",container_namespace="server",mode="idle",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 0
kepler_container_core_joules_total{command="",container_name="server-filestore",container_namespace="server",mode="idle",pod_name="server-filestore-6686ffc6-b4sft"} 0
kepler_container_core_joules_total{command="",container_name="server-front",container_namespace="server",mode="idle",pod_name="server-front-559f9b597c-h68bz"} 0
kepler_container_core_joules_total{command="",container_name="server-registry",container_namespace="server",mode="idle",pod_name="server-registry-fb6bbcd75-lcrjd"} 0
kepler_container_core_joules_total{command="",container_name="scaphandre",container_namespace="default",mode="idle",pod_name="scaphandre-slbq8"} 0
kepler_container_core_joules_total{command="",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-14-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-15-guest-matmul"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev2-invoker-00-1-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="idle",pod_name="wskowdev2-invoker-00-2-prewarm-nodejs10"} 0
kepler_container_core_joules_total{command="",container_name="wait",container_namespace="server",mode="idle",pod_name="server-registry-cert-setup-sx25m"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-invoker-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="idle",pod_name="owdev-nginx-857fb7dc66-jtngj"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-couchdb",container_namespace="monitoring",mode="idle",pod_name="owdev-controller-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-6fdwf"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-gt7gl"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-rjksd"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="idle",pod_name="owdev-install-packages-vr42v"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-kafka",container_namespace="monitoring",mode="idle",pod_name="owdev-controller-0"} 0
kepler_container_core_joules_total{command="",container_name="wait-for-zookeeper",container_namespace="monitoring",mode="idle",pod_name="owdev-kafka-0"} 0
kepler_container_core_joules_total{command="",container_name="wskadmin",container_namespace="monitoring",mode="idle",pod_name="owdev-wskadmin"} 0
kepler_container_core_joules_total{command="",container_name="alarmprovider",container_namespace="monitoring",mode="dynamic",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 19737.4
kepler_container_core_joules_total{command="",container_name="alertmanager",container_namespace="monitoring",mode="dynamic",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="apigateway",container_namespace="monitoring",mode="dynamic",pod_name="owdev-apigateway-6d8b89b6c-zk8rs"} 19737.4
kepler_container_core_joules_total{command="",container_name="autoscaler",container_namespace="kube-system",mode="dynamic",pod_name="coredns-autoscaler-5589fb5654-hc72l"} 19737.4
kepler_container_core_joules_total{command="",container_name="azure-ip-masq-agent",container_namespace="kube-system",mode="dynamic",pod_name="azure-ip-masq-agent-5l5gm"} 19737.4
kepler_container_core_joules_total{command="",container_name="azuredisk",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="azurefile",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="cloud-node-manager",container_namespace="kube-system",mode="dynamic",pod_name="cloud-node-manager-m9t7h"} 19737.4
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="alertmanager-prometheus-kube-prometheus-alertmanager-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-b4854dd98-shw8j"} 19737.4
kepler_container_core_joules_total{command="",container_name="coredns",container_namespace="kube-system",mode="dynamic",pod_name="coredns-b4854dd98-wxvlr"} 19737.4
kepler_container_core_joules_total{command="",container_name="couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-couchdb-7cf946b654-vkmk8"} 19737.4
kepler_container_core_joules_total{command="",container_name="gen-certs",container_namespace="monitoring",mode="dynamic",pod_name="owdev-gen-certs-s6cp4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana-sc-dashboard",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="grafana-sc-datasources",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-grafana-656d56dd94-rpnz4"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-config-reloader",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-init-couchdb-9vrb7"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="server",mode="dynamic",pod_name="server-registry-cert-setup-sx25m"} 19737.4
kepler_container_core_joules_total{command="",container_name="init-node",container_namespace="monitoring",mode="dynamic",pod_name="debug-wk948"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-6fdwf"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-gt7gl"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-rjksd"} 19737.4
kepler_container_core_joules_total{command="",container_name="install-packages",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-vr42v"} 19737.4
kepler_container_core_joules_total{command="",container_name="invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-invoker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="kafka",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafka-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="kafkaprovider",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 19737.4
kepler_container_core_joules_total{command="",container_name="kepler-exporter",container_namespace="monitoring",mode="dynamic",pod_name="kepler-exporter-8x47p"} 19737.4
kepler_container_core_joules_total{command="",container_name="konnectivity-agent",container_namespace="kube-system",mode="dynamic",pod_name="konnectivity-agent-6fcc478f7d-z57d2"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-prometheus-stack",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-kube-prometheus-operator-5fd846f56-fvjcg"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-proxy",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-t99c7"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-proxy-bootstrap",container_namespace="kube-system",mode="dynamic",pod_name="kube-proxy-t99c7"} 19737.4
kepler_container_core_joules_total{command="",container_name="kube-state-metrics",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-kube-state-metrics-84b79bbdcf-59vj8"} 19737.4
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="liveness-probe",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="dynamic",pod_name="metrics-server-f77b4cd8-46qs7"} 19737.4
kepler_container_core_joules_total{command="",container_name="metrics-server",container_namespace="kube-system",mode="dynamic",pod_name="metrics-server-f77b4cd8-54gt6"} 19737.4
kepler_container_core_joules_total{command="",container_name="nginx",container_namespace="monitoring",mode="dynamic",pod_name="owdev-nginx-857fb7dc66-jtngj"} 19737.4
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="dynamic",pod_name="csi-azuredisk-node-k25rk"} 19737.4
kepler_container_core_joules_total{command="",container_name="node-driver-registrar",container_namespace="kube-system",mode="dynamic",pod_name="csi-azurefile-node-7k9qt"} 19737.4
kepler_container_core_joules_total{command="",container_name="prometheus",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-prometheus-kube-prometheus-prometheus-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="promtail",container_namespace="monitoring",mode="dynamic",pod_name="loki-promtail-wlcxl"} 19737.4
kepler_container_core_joules_total{command="",container_name="rabbitmq",container_namespace="server",mode="dynamic",pod_name="server-broker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="redis",container_namespace="monitoring",mode="dynamic",pod_name="owdev-redis-bc89c877-tblns"} 19737.4
kepler_container_core_joules_total{command="",container_name="redis-init",container_namespace="monitoring",mode="dynamic",pod_name="owdev-redis-bc89c877-tblns"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-authorization",container_namespace="server",mode="dynamic",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-authorization-database-migration",container_namespace="server",mode="dynamic",pod_name="server-authorization-fbcbcdbb7-6kqnh"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-filestore",container_namespace="server",mode="dynamic",pod_name="server-filestore-6686ffc6-b4sft"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-front",container_namespace="server",mode="dynamic",pod_name="server-front-559f9b597c-h68bz"} 19737.4
kepler_container_core_joules_total{command="",container_name="server-registry",container_namespace="server",mode="dynamic",pod_name="server-registry-fb6bbcd75-lcrjd"} 19737.4
kepler_container_core_joules_total{command="",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-1-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-14-guest-matmul"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-15-guest-matmul"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev-invoker-00-2-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev2-invoker-00-1-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="user-action",container_namespace="monitoring",mode="dynamic",pod_name="wskowdev2-invoker-00-2-prewarm-nodejs10"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait",container_namespace="server",mode="dynamic",pod_name="server-registry-cert-setup-sx25m"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-alarmprovider-7b6dbf84d9-5x5ls"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-invoker-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafkaprovider-69977b75cc-hf27p"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-controller",container_namespace="monitoring",mode="dynamic",pod_name="owdev-nginx-857fb7dc66-jtngj"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-couchdb",container_namespace="monitoring",mode="dynamic",pod_name="owdev-controller-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-6fdwf"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-gt7gl"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-rjksd"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-healthy-invoker",container_namespace="monitoring",mode="dynamic",pod_name="owdev-install-packages-vr42v"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-kafka",container_namespace="monitoring",mode="dynamic",pod_name="owdev-controller-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wait-for-zookeeper",container_namespace="monitoring",mode="dynamic",pod_name="owdev-kafka-0"} 19737.4
kepler_container_core_joules_total{command="",container_name="wskadmin",container_namespace="monitoring",mode="dynamic",pod_name="owdev-wskadmin"} 19737.4
kepler_container_core_joules_total{command="",container_name="scaphandre",container_namespace="default",mode="dynamic",pod_name="scaphandre-slbq8"} 23204.033
rootfs commented 1 year ago

@andersonandrei that's a good sign. The kepler metrics are created. Can you see them on prometheus or grafana?

andersonandrei commented 1 year ago

@rootfs , yes, I can see those metrics on Prometheus and Grafana. But shouldn't I worry about the errors with eBPF? They make me wonder if I can get any errors with the estimations.

I0301 14:42:12.708526       1 power.go:64] Not able to obtain power, use estimate method
I0301 14:42:12.711548       1 slice_handler.go:179] Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0
I0301 14:42:13.016250       1 exporter.go:168] Initializing the GPU collector
perf_event_open: No such file or directory
I0301 14:42:15.542780       1 bcc_attacher.go:98] failed to attach perf event cpu_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542866       1 bcc_attacher.go:98] failed to attach perf event cpu_ref_cycles_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542922       1 bcc_attacher.go:98] failed to attach perf event cpu_instr_hc_reader: failed to open bpf perf event: no such file or directory
perf_event_open: No such file or directory
I0301 14:42:15.542990       1 bcc_attacher.go:98] failed to attach perf event cache_miss_hc_reader: failed to open bpf perf event: no such file or directory

In the first line above, for example, it says : Not able to find any valid .scope file in /sys/fs/cgroup/cpu/system.slice, this likely cause all cgroup metrics to be 0

rootfs commented 1 year ago

@andersonandrei the errors are benign. Kepler uses different models based on availability of hardware counters, RAPL, etc. On baremetal env where these counters and RAPL are accessible, Kepler uses ratio based model. On VMs where no counters or RAPL, Kepler uses regression based models.

andersonandrei commented 1 year ago

@rootfs, thanks for the details.

About the problem of not seeing the metrics of pods deployed after Kepler's deployment, do you have any idea, please? I'm running a serverless platform on top of AK8, so, for each function that I execute in the platform, a new pod is created. Besides, Kepler does not export metrics for those pods. For instance, let us consider the following actions: 1) to deploy kepler, 2) to create a new pod, and 3) to delete kepler and deploy it again. If I do 1) and 2), Kepler does not export such pod metrics, so I need to do 1), 2) and 3), and still, the problem persists with new pods created after 3).

rootfs commented 1 year ago

Do you see the pod metrics from directly query kepler metrics endpoint (i.e. using curl in kepler pod)?

Kepler does delete inactive pods from time to time though. In this case, the long range memory of Prometheus should be the final source for kepler metrics.

andersonandrei commented 1 year ago

I’m using the Kepler interface though Prometheus, but I try the queries with the Kepler metric endpoint with curl from time to time as well.

In both cases, I can just see new pods’ metrics after steps 1), 2), and 3).

rootfs commented 1 year ago

can you share your pod yaml?

andersonandrei commented 1 year ago

One pod example, wskowdev-invoker-00-43-guest-linpack, is:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-03-06T13:48:14Z"
  labels:
    invoker: invoker0
    name: wskowdev-invoker-00-43-guest-linpack
    openwhisk/action: linpack
    openwhisk/namespace: guest
    release: owdev
    user-action-pod: "true"
  name: wskowdev-invoker-00-43-guest-linpack
  namespace: monitoring
  resourceVersion: "44115468"
  uid: 63c7e294-c6e2-49fe-a4db-9d403fdce033
spec:
  containers:
  - env:
    - name: __OW_API_HOST
      value: https://ourserver.io:31001
    - name: __OW_ALLOW_CONCURRENT
      value: "false"
    image: andersonandrei/python3action:linpack
    imagePullPolicy: IfNotPresent
    name: user-action
    ports:
    - containerPort: 8080
      name: action
      protocol: TCP
    resources:
      limits:
        memory: 256Mi
      requests:
        memory: 256Mi
    securityContext:
      capabilities:
        drop:
        - NET_RAW
        - NET_ADMIN
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-cvxfp
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: aks-intra-99364876-vmss000000
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: kube-api-access-cvxfp
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-03-06T13:48:14Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-03-06T13:48:15Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-03-06T13:48:15Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-03-06T13:48:14Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://fdf5ec08f8821455a548ba5bb6a025a7aaca26fb8806d493bdca060e45557218
    image: docker.io/andersonandrei/python3action:linpack
    imageID: docker.io/andersonandrei/python3action@sha256:c1292175aa3129f1fa8cec1e39017c8a00a3244cbf3900bb79b4a794a27bbe7e
    lastState: {}
    name: user-action
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-03-06T13:48:15Z"
  hostIP: 10.224.0.4
  phase: Running
  podIP: 10.244.0.100
  podIPs:
  - ip: 10.244.0.100
  qosClass: Burstable
  startTime: "2023-03-06T13:48:14Z"

And the Kepler pod is:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-03-03T14:34:00Z"
  generateName: kepler-exporter-
  labels:
    app: kepler-exporter-service
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    controller-revision-hash: 579564b6b8
    pod-template-generation: "1"
    sustainable-computing.io/app: kepler
  name: kepler-exporter-jq9nb
  namespace: monitoring
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: kepler-exporter
    uid: 4624edc0-57bf-4824-b0f5-20a1a6584a1f
  resourceVersion: "43091573"
  uid: 635939c6-3b0e-4902-b339-67907ff2f88d
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - aks-intra-99364876-vmss000001
  containers:
  - args:
    - /usr/bin/kepler -v=1
    command:
    - /bin/sh
    - -c
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: quay.io/sustainable_computing_io/kepler:release-0.4
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthz
        port: 9102
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 60
      successThreshold: 1
      timeoutSeconds: 10
    name: kepler-exporter
    ports:
    - containerPort: 9102
      name: http
      protocol: TCP
    resources:
      requests:
        cpu: 100m
        memory: 400Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /lib/modules
      name: lib-modules
    - mountPath: /sys
      name: tracing
    - mountPath: /proc
      name: proc
    - mountPath: /etc/config
      name: cfm
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-kks8t
      readOnly: true
  dnsPolicy: ClusterFirstWithHostNet
  enableServiceLinks: true
  nodeName: aks-intra-99364876-vmss000001
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: kepler-sa
  serviceAccountName: kepler-sa
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /lib/modules
      type: Directory
    name: lib-modules
  - hostPath:
      path: /sys
      type: Directory
    name: tracing
  - hostPath:
      path: /proc
      type: Directory
    name: proc
  - configMap:
      defaultMode: 420
      name: kepler-cfm
    name: cfm
  - name: kube-api-access-kks8t
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-03-03T14:34:00Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-03-03T14:34:02Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-03-03T14:34:02Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-03-03T14:34:00Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://fa495a86a17d5ca721517533fbbb5659d718f942d61ffbbd383a91befbf3a6de
    image: quay.io/sustainable_computing_io/kepler:release-0.4
    imageID: quay.io/sustainable_computing_io/kepler@sha256:67c34e1ade5f17cc444aa134f7d95b424077af6bc7c05d2ff82d536a3e0a6174
    lastState: {}
    name: kepler-exporter
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-03-03T14:34:02Z"
  hostIP: 10.224.0.5
  phase: Running
  podIP: 10.244.1.89
  podIPs:
  - ip: 10.244.1.89
  qosClass: Burstable
  startTime: "2023-03-03T14:34:00Z"

Thanks!

andersonandrei commented 1 year ago

Hello,

Please, do you have any updates about this issue? Or should I open a new one to discuss the last messages above?

In addition to the message above, I also tried to use different versions of Kepler, changing the tags at image: quay.io/sustainable_computing_io/kepler:release-0.4, but it did not work as well. I tried latest, release-0.4, and v0.3.

Thanks!

rootfs commented 1 year ago

@andersonandrei yes, please open an issue to track this issue. Thanks