Kepler triggers eviction manager to reclaim ephemeral-storage

JonJon-Esto commented 1 month ago

What happened?

At some point during the 24 hours running kepler, the imagefs.available went below the threshold of 15% in a matter of 8-9 minutes.

38580 --------------------------------------------------------------------- 38581 1/1 | team 38582 --------------------------------------------------------------------- 38583 -- Current State ---------------------------------------------------- 38584 --------------------------------------------------------------------- 38585 1. memory.available: 4902.88Mi 38586 2. nodefs.available: 8.00% [16.80/193.64Gi] 38587 3. nodefs.inodesFree: 98.00% [25428939/25804800] 38588 4. imagefs.available: 8.00% [16.80/193.64Gi] 38589 5. imagefs.inodesFree: 98.00% [25428939/25804800] 38590 6. pid.available: 62168 38591 --------------------------------------------------------------------- 38592 -- Eviction Thresholds ---------------------------------------------- 38593 --------------------------------------------------------------------- 38594 1. memory.available: 100Mi 38595 2. nodefs.available: 10% 38596 3. nodefs.inodesFree: 5% 38597 4. imagefs.available: 15% 38598 5. imagefs.inodesFree: 5% 38599 6. pid.available: null 38600 ---------------------------------------------------------------------

The ephemeral-storage seems to have degraded really fast without good enough log information about it. We can see from 05:03:07 today:

37669 ================================================================================== 37670 05:03:07 | INFO | Eviction signals for all nodes 37671 ================================================================================== .... .... 37718 /var/lib utilization: 37719 Filesystem Size Used Avail Use% Mounted on 37720 /dev/vda1 194G 30G 164G 16% / 37721 ---------------------------------------------------------------------

to 05:14:02:

38572 ================================================================================== 38573 05:14:02 | INFO | Eviction signals for all nodes 38574 ================================================================================== .... .... 38621 /var/lib utilization: 38622 Filesystem Size Used Avail Use% Mounted on 38623 /dev/vda1 194G 178G 16G 92% / 38624 ---------------------------------------------------------------------

Basically, disk utilization went up from 16% to 92% in just ~9mins.

This happened after installing kepler on top of my kubernetes setup.

Setup description: 09:01:09-estooald@team:-:~:kubectl NAMESPACE NAME kepler kepler-exporter-lkzdj kube-system coredns-7c65d6cfc9-8bwcp kube-system coredns-7c65d6cfc9-clbhr kube-system etcd-team kube-system kube-apiserver-team kube-system kube-controller-manager-team kube-system kube-proxy-vkthn kube-system kube-scheduler-team monitoring alertmanager-main-0 monitoring alertmanager-main-1 monitoring alertmanager-main-2 monitoring blackbox-exporter-8 monitoring grafana-785fb96d65-mcq5h monitoring kube-state-metrics- monitoring node-exporter-wlx6h monitoring prometheus-adapter- monitoring prometheus-adapter- monitoring prometheus-k8s-0 monitoring prometheus-k8s-1 monitoring prometheus-operator get pods --all-namespaces READY STATUS RESTARTS AGE 1/1 Running 0 30h 1/1 Running 0 6d19h 1/1 Running 0 6d19h 1/1 Running 0 6d19h 1/1 Running 0 6d19h 1/1 Running 13 (3h53m ago) 6d19h 1/1 Running 0 6d19h 1/1 Running 13 (3h53m ago) 6d19h 2/2 Running 0 40m 2/2 Running 0 40m 2/2 Running 0 40m 6745676c9-pdx9j 3/3 Running 0 40m 1/1 Running 0 40m 6f46974967-l8685 3/3 Running 0 40m 2/2 Running 0 40m 784f566c54-k4r5w 1/1 Running 0 40m 784f566c54-xgmqk 1/1 Running 0 40m 2/2 Running 0 40m 2/2 Running 0 40m -57b579d5b9-t968b 2/2 Running 0 40m

OS version: 09:01:58-estooald@team:-:~:sudo cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy 09:02:54-estooald@team:-:~:

Versions: 09:01:05-estooald@team:-:~:kubectl version Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.31.0

09:02:54-estooald@team:-:~:dpkg -l | grep kube ii kubeadm 1.31.1-1.1 amd64 Command-line utility for administering a Kubernetes cluster ii kubectl 1.31.1-1.1 amd64 Command-line utility for interacting with a Kubernetes cluster ii kubelet 1.31.1-1.1 amd64 Node agent for Kubernetes clusters ii kubernetes-cni 1.5.1-1.1 amd64 Binaries required to provision kubernetes container networking 09:03:57-estooald@team:-:~:dpkg -l | grep containerd ii containerd 1.7.12-0ubuntu2~22.04.1 amd64 daemon to control runC rc containerd.io 1.7.22-1 amd64 An open and reliable container runtime 09:04:08-estooald@team:-:~:

09:05:10-estooald@team:-:~/kepler:cat CHANGELOG.md in kepler 0.7 release

switch to libbpf as default ebpf provider
base image update decouple GPU driver from kepler image itself
use kprobe instead of tracepoint for ebpf to obtain context switch information
add task clock event to ebpf and use it to calculate cpu usage for each process. The event is also exported to prometheus
add initial NVIDIA DCGM support, this help monitor power consumption by NVIDIA GPU especially MIG instances.
add new curvefit regressors to predict component power consumption
add workload pipeline to build container base images on demand
add ARM64 container image and RPM build support 09:05:30-estooald@team:-:~/kepler:

09:06:05-estooald@team:-:~/kepler:go version go version go1.23.1 linux/amd64

What did you expect to happen?

My kubernetes system was very stable and didn't have any pod restarts.

After installing kepler, at some point in time, the disk utilization goes up very fast, then eviction manager gets triggered and kubernetes controller and scheduler gets restarted.

How can we reproduce it (as minimally and precisely as possible)?

Just do a default install of kubernetes system, then install kepler monitoring. Observe for 24 hours and look for eviction manager getting triggered to reclaim ephemeral-storage resource.

Anything else we need to know?

No response

Kepler image tag

:~/kepler:cat CHANGELOG.md in kepler 0.7 release - switch to libbpf as default ebpf provider - base image update decouple GPU driver from kepler image itself - use kprobe instead of tracepoint for ebpf to obtain context switch information - add task clock event to ebpf and use it to calculate cpu usage for each process. The event is also exported to prometheus - add initial NVIDIA DCGM support, this help monitor power consumption by NVIDIA GPU especially MIG instances. - add new curvefit regressors to predict component power consumption - add workload pipeline to build container base images on demand - add ARM64 container image and RPM build support

Kubernetes version

:~:kubectl version Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.31.0

Cloud provider or bare metal

Virtual servers in cloud that can be managed in our DevOps space.

OS version

```console # On Linux: $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy $ uname -a Linux team 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

**Kubernetes installation:** https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#kubectl-install-0 **Prometheus operator installation:** https://sustainable-computing.io/installation/kepler/#deploy-the-prometheus-operator **Kepler installation:** git clone https://github.com/sustainable-computing-io/kepler.git cd ./kepler make build-manifest OPTS="PROMETHEUS_DEPLOY" kubectl apply -f _output/generated-manifest/deployment.yaml **Ensure that all monitoring pods are scheduled and in running state:** kubectl taint nodes --all node-role.kubernetes.io/control-plane- **Forward ports to finalize monitoring:** kubectl port-forward --address localhost -n kepler service/kepler-exporter 9102:9102 & kubectl port-forward --address localhost -n monitoring service/prometheus-k8s 9090:9090 & kubectl port-forward --address localhost -n monitoring service/grafana 3000:3000 &

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} NAME DATA AGE kepler-cfm 16 5d23h # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ```Error from server (NotFound): deployments.apps "kepler-exporter" not found For standalone: # put your Kepler command argument here kubectl describe pods kepler-exporter-lkzdj -n kepler Name: kepler-exporter-lkzdj Namespace: kepler Priority: 0 Service Account: kepler-sa Node: team/10.64.22.3 Start Time: Wed, 25 Sep 2024 02:28:57 +0200 Labels: app.kubernetes.io/component=exporter app.kubernetes.io/name=kepler-exporter controller-revision-hash=64d94db7d5 pod-template-generation=1 sustainable-computing.io/app=kepler Annotations: Status: Running IP: 10.88.0.46 IPs: IP: 10.88.0.46 IP: 2001:4860:4860::2e Controlled By: DaemonSet/kepler-exporter Containers: kepler-exporter: Container ID: containerd://1a15254d40feee6a490c08679b3f0c732166d25bc14104f93c4c8a64aea970f1 Image: quay.io/sustainable_computing_io/kepler:latest Image ID: quay.io/sustainable_computing_io/kepler@sha256:0f6f20c7123afc984e481c2197845ab89750782990a86c0898fcd69743b998a8 Port: 9102/TCP Host Port: 0/TCP Command: /bin/sh -c Args: /usr/bin/kepler -v=1 -redfish-cred-file-path=/etc/redfish/redfish.csv State: Running Started: Wed, 25 Sep 2024 02:29:07 +0200 Ready: True Restart Count: 0 Requests: cpu: 100m memory: 400Mi Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5 Environment: NODE_IP: (v1:status.hostIP) NODE_NAME: (v1:spec.nodeName) Mounts: /etc/kepler/kepler.config from cfm (ro) /etc/redfish from redfish (ro) /lib/modules from lib-modules (ro) /proc from proc (rw) /sys from tracing (ro) /var/run from var-run (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9pd4m (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready True ContainersReady True PodScheduled True Volumes: lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: Directory tracing: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory var-run: Type: HostPath (bare host directory volume) Path: /var/run HostPathType: Directory cfm: Type: ConfigMap (a volume populated by a ConfigMap) Name: kepler-cfm Optional: false redfish: Type: Secret (a volume populated by a Secret) SecretName: redfish-4kh9d7bc7m Optional: false kube-api-access-9pd4m: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node-role.kubernetes.io/control-plane:NoSchedule op=Exists node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events:

Container runtime (CRI) and version (if applicable)

ii cri-tools 1.31.1-1.1 amd64 Command-line utility for interacting with a container runtime

Related plugins (CNI, CSI, ...) and versions (if applicable)

ii kubernetes-cni 1.5.1-1.1 amd64 Binaries required to provision kubernetes container networking

JonJon-Esto commented 1 month ago

I have done an isolation test by removing kepler app and retained prometheus monitoring and observed that the eviction issue continues to happen.

Sep 28 05:37:59 team kubelet[2865036]: I0928 05:37:59.659033 2865036 eviction_manager.go:369] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"

The setup now looks like:

Every 2.0s: kubectl get pods --all-namespaces team: Sat Sep 28 08:57:39 2024

NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7c65d6cfc9-jw2ck 1/1 Running 0 22h kube-system coredns-7c65d6cfc9-krzq2 1/1 Running 0 22h kube-system etcd-team 1/1 Running 8 22h kube-system kube-apiserver-team 1/1 Running 7 (8h ago) 22h kube-system kube-controller-manager-team 1/1 Running 24 (8h ago) 22h kube-system kube-proxy-cwtfx 1/1 Running 0 22h kube-system kube-scheduler-team 1/1 Running 24 (8h ago) 22h monitoring alertmanager-main-0 2/2 Running 0 18h monitoring alertmanager-main-1 2/2 Running 0 18h monitoring alertmanager-main-2 2/2 Running 0 18h monitoring blackbox-exporter-86745676c9-25xrm 3/3 Running 0 18h monitoring grafana-785fb96d65-8qh8s 0/1 ContainerStatusUnknown 1 18h monitoring grafana-785fb96d65-dcbcg 1/1 Running 0 3h17m monitoring kube-state-metrics-6f46974967-c7fkk 3/3 Running 0 18h monitoring node-exporter-z8f8n 2/2 Running 0 18h monitoring prometheus-adapter-784f566c54-9688f 1/1 Running 0 18h monitoring prometheus-adapter-784f566c54-x4h52 1/1 Running 0 18h monitoring prometheus-k8s-0 2/2 Running 0 3h18m monitoring prometheus-k8s-1 2/2 Running 0 3h17m monitoring prometheus-operator-57b579d5b9-gwmk9 2/2 Running 0 18h

So, it seems that prometheus is causing the eviction issue.

Anyone who can help in digging further as to why this is happening? Thanks!

JonJon-Esto commented 1 month ago

Hi,

Please close this ticket. I have managed to isolate the Kepler and prometheus system by removing them and leaving the kubernetes base system on its own. Still, eviction manager was triggered. Something else is taking up huge amount of disk space somewhere in the system.

Thank you!

sthaha commented 1 month ago

closing the issue based on https://github.com/sustainable-computing-io/kepler/issues/1799#issuecomment-2382156762

sustainable-computing-io / kepler