Closed JonJon-Esto closed 1 month ago
I have done an isolation test by removing kepler app and retained prometheus monitoring and observed that the eviction issue continues to happen.
Sep 28 05:37:59 team kubelet[2865036]: I0928 05:37:59.659033 2865036 eviction_manager.go:369] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
The setup now looks like:
Every 2.0s: kubectl get pods --all-namespaces team: Sat Sep 28 08:57:39 2024
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-7c65d6cfc9-jw2ck 1/1 Running 0 22h kube-system coredns-7c65d6cfc9-krzq2 1/1 Running 0 22h kube-system etcd-team 1/1 Running 8 22h kube-system kube-apiserver-team 1/1 Running 7 (8h ago) 22h kube-system kube-controller-manager-team 1/1 Running 24 (8h ago) 22h kube-system kube-proxy-cwtfx 1/1 Running 0 22h kube-system kube-scheduler-team 1/1 Running 24 (8h ago) 22h monitoring alertmanager-main-0 2/2 Running 0 18h monitoring alertmanager-main-1 2/2 Running 0 18h monitoring alertmanager-main-2 2/2 Running 0 18h monitoring blackbox-exporter-86745676c9-25xrm 3/3 Running 0 18h monitoring grafana-785fb96d65-8qh8s 0/1 ContainerStatusUnknown 1 18h monitoring grafana-785fb96d65-dcbcg 1/1 Running 0 3h17m monitoring kube-state-metrics-6f46974967-c7fkk 3/3 Running 0 18h monitoring node-exporter-z8f8n 2/2 Running 0 18h monitoring prometheus-adapter-784f566c54-9688f 1/1 Running 0 18h monitoring prometheus-adapter-784f566c54-x4h52 1/1 Running 0 18h monitoring prometheus-k8s-0 2/2 Running 0 3h18m monitoring prometheus-k8s-1 2/2 Running 0 3h17m monitoring prometheus-operator-57b579d5b9-gwmk9 2/2 Running 0 18h
So, it seems that prometheus is causing the eviction issue.
Anyone who can help in digging further as to why this is happening? Thanks!
Hi,
Please close this ticket. I have managed to isolate the Kepler and prometheus system by removing them and leaving the kubernetes base system on its own. Still, eviction manager was triggered. Something else is taking up huge amount of disk space somewhere in the system.
Thank you!
closing the issue based on https://github.com/sustainable-computing-io/kepler/issues/1799#issuecomment-2382156762
What happened?
At some point during the 24 hours running kepler, the imagefs.available went below the threshold of 15% in a matter of 8-9 minutes.
38580 --------------------------------------------------------------------- 38581 1/1 | team 38582 --------------------------------------------------------------------- 38583 -- Current State ---------------------------------------------------- 38584 --------------------------------------------------------------------- 38585 1. memory.available: 4902.88Mi 38586 2. nodefs.available: 8.00% [16.80/193.64Gi] 38587 3. nodefs.inodesFree: 98.00% [25428939/25804800] 38588 4. imagefs.available: 8.00% [16.80/193.64Gi] 38589 5. imagefs.inodesFree: 98.00% [25428939/25804800] 38590 6. pid.available: 62168 38591 --------------------------------------------------------------------- 38592 -- Eviction Thresholds ---------------------------------------------- 38593 --------------------------------------------------------------------- 38594 1. memory.available: 100Mi 38595 2. nodefs.available: 10% 38596 3. nodefs.inodesFree: 5% 38597 4. imagefs.available: 15% 38598 5. imagefs.inodesFree: 5% 38599 6. pid.available: null 38600 ---------------------------------------------------------------------
The ephemeral-storage seems to have degraded really fast without good enough log information about it. We can see from 05:03:07 today:
37669 ================================================================================== 37670 05:03:07 | INFO | Eviction signals for all nodes 37671 ================================================================================== .... .... 37718 /var/lib utilization: 37719 Filesystem Size Used Avail Use% Mounted on 37720 /dev/vda1 194G 30G 164G 16% / 37721 ---------------------------------------------------------------------
to 05:14:02:
38572 ================================================================================== 38573 05:14:02 | INFO | Eviction signals for all nodes 38574 ================================================================================== .... .... 38621 /var/lib utilization: 38622 Filesystem Size Used Avail Use% Mounted on 38623 /dev/vda1 194G 178G 16G 92% / 38624 ---------------------------------------------------------------------
Basically, disk utilization went up from 16% to 92% in just ~9mins.
This happened after installing kepler on top of my kubernetes setup.
Setup description: 09:01:09-estooald@team:-:~:kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kepler kepler-exporter-lkzdj 1/1 Running 0 30h kube-system coredns-7c65d6cfc9-8bwcp 1/1 Running 0 6d19h kube-system coredns-7c65d6cfc9-clbhr 1/1 Running 0 6d19h kube-system etcd-team 1/1 Running 0 6d19h kube-system kube-apiserver-team 1/1 Running 0 6d19h kube-system kube-controller-manager-team 1/1 Running 13 (3h53m ago) 6d19h kube-system kube-proxy-vkthn 1/1 Running 0 6d19h kube-system kube-scheduler-team 1/1 Running 13 (3h53m ago) 6d19h monitoring alertmanager-main-0 2/2 Running 0 40m monitoring alertmanager-main-1 2/2 Running 0 40m monitoring alertmanager-main-2 2/2 Running 0 40m monitoring blackbox-exporter-86745676c9-pdx9j 3/3 Running 0 40m monitoring grafana-785fb96d65-mcq5h 1/1 Running 0 40m monitoring kube-state-metrics-6f46974967-l8685 3/3 Running 0 40m monitoring node-exporter-wlx6h 2/2 Running 0 40m monitoring prometheus-adapter-784f566c54-k4r5w 1/1 Running 0 40m monitoring prometheus-adapter-784f566c54-xgmqk 1/1 Running 0 40m monitoring prometheus-k8s-0 2/2 Running 0 40m monitoring prometheus-k8s-1 2/2 Running 0 40m monitoring prometheus-operator-57b579d5b9-t968b 2/2 Running 0 40m
OS version: 09:01:58-estooald@team:-:~:sudo cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy 09:02:54-estooald@team:-:~:
Versions: 09:01:05-estooald@team:-:~:kubectl version Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.31.0
09:02:54-estooald@team:-:~:dpkg -l | grep kube ii kubeadm 1.31.1-1.1 amd64 Command-line utility for administering a Kubernetes cluster ii kubectl 1.31.1-1.1 amd64 Command-line utility for interacting with a Kubernetes cluster ii kubelet 1.31.1-1.1 amd64 Node agent for Kubernetes clusters ii kubernetes-cni 1.5.1-1.1 amd64 Binaries required to provision kubernetes container networking 09:03:57-estooald@team:-:~:dpkg -l | grep containerd ii containerd 1.7.12-0ubuntu2~22.04.1 amd64 daemon to control runC rc containerd.io 1.7.22-1 amd64 An open and reliable container runtime 09:04:08-estooald@team:-:~:
09:05:10-estooald@team:-:~/kepler:cat CHANGELOG.md in kepler 0.7 release
09:06:05-estooald@team:-:~/kepler:go version go version go1.23.1 linux/amd64
What did you expect to happen?
My kubernetes system was very stable and didn't have any pod restarts.
After installing kepler, at some point in time, the disk utilization goes up very fast, then eviction manager gets triggered and kubernetes controller and scheduler gets restarted.
How can we reproduce it (as minimally and precisely as possible)?
Just do a default install of kubernetes system, then install kepler monitoring. Observe for 24 hours and look for eviction manager getting triggered to reclaim ephemeral-storage resource.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)