[BUG] Koordlet has a memory leak issue, leading to the pod being killed due to OOM.

What happened:

After running for a while, it was found that the koordlet pod was killed due to OOM. Without any new container scheduling, the memory consumption of the koordlet pod continued to increase, indicating a possible memory leak issue with koordlet.

What you expected to happen:

koordlet can run stably without continuous increase in memory consumption.

How to reproduce it (as minimally and precisely as possible):

Prerequisites: 
- Koordinator has been deployed.

Deploy the workload.

[root@k1 test]# cat pod-group.yaml
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: gang-example
  namespace: default
spec:
  scheduleTimeoutSeconds: 100
  minMember: 2

# Generate test yaml
[root@k1 test]# cat setup.sh
#!/bin/bash

for i in {1..200}
do
    new_name="test-$i"
template='apiVersion: v1
kind: Pod
metadata:
  name: pod-example1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: gang-example
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
      requests:
        cpu: 40m
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always'

new_template=$(echo "$template" | sed "/^ *name: /s/:.*/: $new_name/")

echo "$new_template"  >> ./demo/$new_name.yaml
done

[root@k1 test]# ./setup.sh
[root@k1 test]# kubectl apply -f ./pod-group.yaml
[root@k1 test]# kubectl apply -f ./demo/

https://koordinator.sh/zh-Hans/docs/

Monitor the running status of koordlet

[root@k1 ~]# kc -n koordinator-system get pods
NAME                                READY   STATUS    RESTARTS   AGE
koord-descheduler-dc5dc679c-f95r9   1/1     Running   0          3h53m
koord-descheduler-dc5dc679c-lh27b   1/1     Running   0          3h53m
koord-manager-db6f4bdb9-94rxz       1/1     Running   0          3h53m
koord-manager-db6f4bdb9-fk6lz       1/1     Running   0          3h53m
koord-scheduler-7db78c8867-b87br    1/1     Running   0          3h53m
koord-scheduler-7db78c8867-bb8sr    1/1     Running   0          3h53m
koordlet-g4gcx                      1/1     Running   0          3h53m
koordlet-mn746                      1/1     Running   0          3h53m
koordlet-nccv8                      1/1     Running   0          3h53m

Prometheus monitoring data：

From the graph, it can be seen that from 10:00 to 16:00, the memory consumption of Koordinator has been continuously increasing, and eventually it was killed due to OOM. The relevant log information is as follows.

[root@dev ~]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS      AGE     IP             NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-f95r9   1/1     Running   0             7h22m   10.234.200.6   k1     <none>           <none>
koord-descheduler-dc5dc679c-lh27b   1/1     Running   0             7h22m   10.234.28.6    k3     <none>           <none>
koord-manager-db6f4bdb9-94rxz       1/1     Running   0             7h22m   10.234.24.5    k2     <none>           <none>
koord-manager-db6f4bdb9-fk6lz       1/1     Running   0             7h22m   10.234.28.5    k3     <none>           <none>
koord-scheduler-7db78c8867-b87br    1/1     Running   0             7h22m   10.234.28.4    k3     <none>           <none>
koord-scheduler-7db78c8867-bb8sr    1/1     Running   0             7h22m   10.234.24.4    k2     <none>           <none>
koordlet-g4gcx                      1/1     Running   0             7h22m   10.0.0.62      k1     <none>           <none>
koordlet-mn746                      1/1     Running   0             7h22m   10.0.0.106     k3     <none>           <none>
koordlet-nccv8                      1/1     Running   1 (49m ago)   7h22m   10.0.0.100     k2     <none>           <none>

[root@k2 ~]# cp /var/log/messages ./
[root@k2 ~]# grep "15:53:29" ./messages
Apr  1 15:53:29 k2 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod055b768b_b04a_4702_9cc0_f93d211ca1ad.slice/cri-containerd-8363bd95dd74556d6cac0c0aa3cc00b8dd06f98e14f5a64b3e8727baa4474148.scope: cache:0KB rss:36KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:36KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr  1 15:53:29 k2 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod055b768b_b04a_4702_9cc0_f93d211ca1ad.slice/cri-containerd-1d66d33329f9cc8a36725192721acee4ef6b67dfe779bd0074bc6ce96453f425.scope: cache:64204KB rss:197904KB rss_huge:0KB mapped_file:3196KB swap:0KB inactive_anon:19824KB active_anon:242272KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr  1 15:53:29 k2 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr  1 15:53:29 k2 kernel: [28475] 65535 28475      245        1       3        0          -998 pause
Apr  1 15:53:29 k2 kernel: [28877]     0 28877   536907    55782     180        0           999 koordlet
Apr  1 15:53:29 k2 kernel: Memory cgroup out of memory: Kill process 1964 (koordlet) score 1819 or sacrifice child
Apr  1 15:53:29 k2 kernel: Killed process 28877 (koordlet), UID 0, total-vm:2147628kB, anon-rss:196880kB, file-rss:23052kB, shmem-rss:3196kB
Apr  1 15:53:29 k2 containerd: time="2024-04-01T15:53:29.850019056Z" level=info msg="shim disconnected" id=1d66d33329f9cc8a36725192721acee4ef6b67dfe779bd0074bc6ce96453f425
Apr  1 15:53:29 k2 containerd: time="2024-04-01T15:53:29.850099748Z" level=warning msg="cleaning up after shim disconnected" id=1d66d33329f9cc8a36725192721acee4ef6b67dfe779bd0074bc6ce96453f425 namespace=k8s.io
Apr  1 15:53:29 k2 containerd: time="2024-04-01T15:53:29.850111360Z" level=info msg="cleaning up dead shim"
Apr  1 15:53:29 k2 containerd: time="2024-04-01T15:53:29.858460911Z" level=warning msg="cleanup warnings time=\"2024-04-01T15:53:29Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=3172 runtime=io.containerd.runc.v2\n"
Apr  1 15:53:29 k2 kubelet: E0401 15:53:29.925682   20334 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/kube.slice/containerd.service\": failed to get container info for \"/kube.slice/containerd.service\": unknown container \"/kube.slice/containerd.service\"" containerName="/kube.slice/containerd.service"

Anything else we need to know?:

Environment:

App version: NA
Kubernetes version (use kubectl version):

[root@k1 test]# kubectl version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:39:03Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.4", GitCommit:"f89670c3aa4059d6999cb42e23ccb4f0b9a03979", GitTreeState:"clean", BuildDate:"2023-04-12T12:05:35Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Install details (e.g. helm install args):

helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
helm repo update
helm install koordinator koordinator-sh/koordinator --version 1.4.1

Node environment (for koordlet/runtime-proxy issue):
- Containerd/Docker version: containerd_1.6.19
- OS version: CentOS Linux release 7.9.2009 (Core)
- Kernal version: 3.10.0-1160.108.1.el7.x86_64
- Cgroup driver: cgroupfs/systemd

Others:

K8s cluster configuration: 
master node: k1
worker nodes: k1/k2/k3
all nodes configured with 8 vCPU/16GB memory.

It seems an under-provision problem, nothing to do with the memory leak. The koordlet collects and stores a series of pod-level metrics, so its memory usage has a linear correlation with the number of pods on the node. Since the koordlet is initialized with an empty TSDB, it can have a rise of RSS by the pod metrics collection.

hi jason

The following configuration is for reference only, all generated after installation by Helm.

[root@k1 test]# kubectl -n koordinator-system get ds -o yaml
apiVersion: v1
items:
- apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    annotations:
      deprecated.daemonset.template.generation: "1"
      meta.helm.sh/release-name: koordinator
      meta.helm.sh/release-namespace: default
    creationTimestamp: "2024-04-02T06:56:51Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      koord-app: koordlet
    name: koordlet
    namespace: koordinator-system
    resourceVersion: "379111"
    uid: def34102-53df-4a8e-91df-b62de603dab2
  spec:
    minReadySeconds: 10
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        koord-app: koordlet
    template:
      metadata:
        creationTimestamp: null
        labels:
          koord-app: koordlet
          runtimeproxy.koordinator.sh/skip-hookserver: "true"
      spec:
        containers:
        - args:
          - -cgroup-root-dir=/host-cgroup/
          - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
          - -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
          - --logtostderr=true
          - --v=4
          command:
          - /koordlet
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koordlet:v1.4.1
          imagePullPolicy: Always
          name: koordlet
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: "0"
              memory: "0"
          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
              - SYS_ADMIN
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/localtime
            name: host-time
            readOnly: true
          - mountPath: /host-cgroup/
            name: host-cgroup-root
          - mountPath: /host-sys-fs/
            mountPropagation: Bidirectional
            name: host-sys-fs
          - mountPath: /host-var-run/
            name: host-var-run
            readOnly: true
          - mountPath: /host-run/
            name: host-run
            readOnly: true
          - mountPath: /host-var-run-koordlet/
            mountPropagation: Bidirectional
            name: host-var-run-koordlet
          - mountPath: /prediction-checkpoints
            mountPropagation: Bidirectional
            name: host-koordlet-checkpoint-dir
          - mountPath: /host-sys/
            name: host-sys
            readOnly: true
          - mountPath: /etc/kubernetes/
            name: host-kubernetes
            readOnly: true
          - mountPath: /host-etc-hookserver/
            mountPropagation: Bidirectional
            name: host-etc-hookserver
          - mountPath: /var/lib/kubelet
            name: host-kubelet-rootdir
            readOnly: true
          - mountPath: /dev
            mountPropagation: HostToContainer
            name: host-dev
          - mountPath: /metric-data/
            name: metric-db-path
        dnsPolicy: ClusterFirst
        hostNetwork: true
        hostPID: true
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: koordlet
        serviceAccountName: koordlet
        terminationGracePeriodSeconds: 10
        tolerations:
        - operator: Exists
        volumes:
        - hostPath:
            path: /etc/localtime
            type: ""
          name: host-time
        - hostPath:
            path: /sys/fs/cgroup/
            type: ""
          name: host-cgroup-root
        - hostPath:
            path: /sys/fs/
            type: ""
          name: host-sys-fs
        - hostPath:
            path: /var/run/
            type: ""
          name: host-var-run
        - hostPath:
            path: /run/
            type: ""
          name: host-run
        - hostPath:
            path: /var/run/koordlet
            type: DirectoryOrCreate
          name: host-var-run-koordlet
        - hostPath:
            path: /var/run/koordlet/prediction-checkpoints
            type: DirectoryOrCreate
          name: host-koordlet-checkpoint-dir
        - hostPath:
            path: /sys/
            type: ""
          name: host-sys
        - hostPath:
            path: /etc/kubernetes/
            type: ""
          name: host-kubernetes
        - hostPath:
            path: /etc/runtime/hookserver.d/
            type: ""
          name: host-etc-hookserver
        - hostPath:
            path: /var/lib/kubelet/
            type: ""
          name: host-kubelet-rootdir
        - hostPath:
            path: /dev
            type: ""
          name: host-dev
        - emptyDir:
            medium: Memory
            sizeLimit: 150Mi
          name: metric-db-path
    updateStrategy:
      rollingUpdate:
        maxSurge: 0
        maxUnavailable: 20%
      type: RollingUpdate
  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 1
    numberMisscheduled: 0
    numberReady: 3
    numberUnavailable: 2
    observedGeneration: 1
    updatedNumberScheduled: 3
kind: List
metadata:
  resourceVersion: ""

@saintube Even without adding new pods, the memory consumption of koordlet keeps increasing. As a result, koordlet is eventually killed due to OOM.

Based on today's communication in the group, I started validating the configuration of koordlet with 512MB specifications in the afternoon, and deployed 200 pods. The verification environment is the same as described in the above issue.

From the Prometheus monitoring data, it can be observed that the memory usage of koordlet is continuously increasing. Increasing the configuration specifications can only delay the occurrence of OOM events. The relevant monitoring information is as follows:

[root@k1 test]# date
Tue Apr  2 08:43:33 GMT 2024

[root@k1 test]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-fs6cm   1/1     Running   0          106m   10.234.201.20   k1     <none>           <none>
koord-descheduler-dc5dc679c-whw6q   1/1     Running   0          106m   10.234.29.178   k3     <none>           <none>
koord-manager-db6f4bdb9-mlsvs       1/1     Running   0          106m   10.234.24.91    k2     <none>           <none>
koord-manager-db6f4bdb9-vtt46       1/1     Running   0          106m   10.234.29.167   k3     <none>           <none>
koord-scheduler-7db78c8867-tg8q6    1/1     Running   0          106m   10.234.29.188   k3     <none>           <none>
koord-scheduler-7db78c8867-xdmk5    1/1     Running   0          106m   10.234.24.101   k2     <none>           <none>
koordlet-hbdmp                      1/1     Running   0          99m    10.0.0.62       k1     <none>           <none>
koordlet-mjlrx                      1/1     Running   0          100m   10.0.0.106      k3     <none>           <none>
koordlet-x9ddf                      1/1     Running   0          99m    10.0.0.100      k2     <none>           <none>

[root@k1 test]# date
Tue Apr  2 13:44:01 GMT 2024

[root@k1 test]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS       AGE     IP              NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-fs6cm   1/1     Running   0              6h47m   10.234.201.20   k1     <none>           <none>
koord-descheduler-dc5dc679c-whw6q   1/1     Running   0              6h47m   10.234.29.178   k3     <none>           <none>
koord-manager-db6f4bdb9-mlsvs       1/1     Running   0              6h47m   10.234.24.91    k2     <none>           <none>
koord-manager-db6f4bdb9-vtt46       1/1     Running   0              6h47m   10.234.29.167   k3     <none>           <none>
koord-scheduler-7db78c8867-tg8q6    1/1     Running   0              6h47m   10.234.29.188   k3     <none>           <none>
koord-scheduler-7db78c8867-xdmk5    1/1     Running   0              6h47m   10.234.24.101   k2     <none>           <none>
koordlet-hbdmp                      1/1     Running   1 (6m4s ago)   6h39m   10.0.0.62       k1     <none>           <none>
koordlet-mjlrx                      1/1     Running   0              6h40m   10.0.0.106      k3     <none>           <none>
koordlet-x9ddf                      1/1     Running   0              6h40m   10.0.0.100      k2     <none>           <none>

[root@k1 test]# kc -n koordinator-system get ds -o yaml
apiVersion: v1
items:
- apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    annotations:
      deprecated.daemonset.template.generation: "2"
      meta.helm.sh/release-name: koordinator
      meta.helm.sh/release-namespace: default
    creationTimestamp: "2024-04-02T06:56:51Z"
    generation: 2
    labels:
      app.kubernetes.io/managed-by: Helm
      koord-app: koordlet
    name: koordlet
    namespace: koordinator-system
    resourceVersion: "477218"
    uid: def34102-53df-4a8e-91df-b62de603dab2
  spec:
    minReadySeconds: 10
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        koord-app: koordlet
    template:
      metadata:
        creationTimestamp: null
        labels:
          koord-app: koordlet
          runtimeproxy.koordinator.sh/skip-hookserver: "true"
      spec:
        containers:
        - args:
          - -cgroup-root-dir=/host-cgroup/
          - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
          - -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
          - --logtostderr=true
          - --v=4
          command:
          - /koordlet
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koordlet:v1.4.1
          imagePullPolicy: Always
          name: koordlet
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: "0"
              memory: "0"
          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
              - SYS_ADMIN
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/localtime
            name: host-time
            readOnly: true
          - mountPath: /host-cgroup/
            name: host-cgroup-root
          - mountPath: /host-sys-fs/
            mountPropagation: Bidirectional
            name: host-sys-fs
          - mountPath: /host-var-run/
            name: host-var-run
            readOnly: true
          - mountPath: /host-run/
            name: host-run
            readOnly: true
          - mountPath: /host-var-run-koordlet/
            mountPropagation: Bidirectional
            name: host-var-run-koordlet
          - mountPath: /prediction-checkpoints
            mountPropagation: Bidirectional
            name: host-koordlet-checkpoint-dir
          - mountPath: /host-sys/
            name: host-sys
            readOnly: true
          - mountPath: /etc/kubernetes/
            name: host-kubernetes
            readOnly: true
          - mountPath: /host-etc-hookserver/
            mountPropagation: Bidirectional
            name: host-etc-hookserver
          - mountPath: /var/lib/kubelet
            name: host-kubelet-rootdir
            readOnly: true
          - mountPath: /dev
            mountPropagation: HostToContainer
            name: host-dev
          - mountPath: /metric-data/
            name: metric-db-path
        dnsPolicy: ClusterFirst
        hostNetwork: true
        hostPID: true
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: koordlet
        serviceAccountName: koordlet
        terminationGracePeriodSeconds: 10
        tolerations:
        - operator: Exists
        volumes:
        - hostPath:
            path: /etc/localtime
            type: ""
          name: host-time
        - hostPath:
            path: /sys/fs/cgroup/
            type: ""
          name: host-cgroup-root
        - hostPath:
            path: /sys/fs/
            type: ""
          name: host-sys-fs
        - hostPath:
            path: /var/run/
            type: ""
          name: host-var-run
        - hostPath:
            path: /run/
            type: ""
          name: host-run
        - hostPath:
            path: /var/run/koordlet
            type: DirectoryOrCreate
          name: host-var-run-koordlet
        - hostPath:
            path: /var/run/koordlet/prediction-checkpoints
            type: DirectoryOrCreate
          name: host-koordlet-checkpoint-dir
        - hostPath:
            path: /sys/
            type: ""
          name: host-sys
        - hostPath:
            path: /etc/kubernetes/
            type: ""
          name: host-kubernetes
        - hostPath:
            path: /etc/runtime/hookserver.d/
            type: ""
          name: host-etc-hookserver
        - hostPath:
            path: /var/lib/kubelet/
            type: ""
          name: host-kubelet-rootdir
        - hostPath:
            path: /dev
            type: ""
          name: host-dev
        - emptyDir:
            medium: Memory
            sizeLimit: 150Mi
          name: metric-db-path
    updateStrategy:
      rollingUpdate:
        maxSurge: 0
        maxUnavailable: 20%
      type: RollingUpdate
  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 3
    numberMisscheduled: 0
    numberReady: 3
    observedGeneration: 2
    updatedNumberScheduled: 3
kind: List
metadata:
  resourceVersion: ""

[root@k1 test]# date
Tue Apr  2 08:43:33 GMT 2024

[root@k1 test]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-fs6cm   1/1     Running   0          106m   10.234.201.20   k1     <none>           <none>
koord-descheduler-dc5dc679c-whw6q   1/1     Running   0          106m   10.234.29.178   k3     <none>           <none>
koord-manager-db6f4bdb9-mlsvs       1/1     Running   0          106m   10.234.24.91    k2     <none>           <none>
koord-manager-db6f4bdb9-vtt46       1/1     Running   0          106m   10.234.29.167   k3     <none>           <none>
koord-scheduler-7db78c8867-tg8q6    1/1     Running   0          106m   10.234.29.188   k3     <none>           <none>
koord-scheduler-7db78c8867-xdmk5    1/1     Running   0          106m   10.234.24.101   k2     <none>           <none>
koordlet-hbdmp                      1/1     Running   0          99m    10.0.0.62       k1     <none>           <none>
koordlet-mjlrx                      1/1     Running   0          100m   10.0.0.106      k3     <none>           <none>
koordlet-x9ddf                      1/1     Running   0          99m    10.0.0.100      k2     <none>           <none>

[root@k1 test]# date
Tue Apr  2 13:44:01 GMT 2024

[root@k1 test]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS       AGE     IP              NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-fs6cm   1/1     Running   0              6h47m   10.234.201.20   k1     <none>           <none>
koord-descheduler-dc5dc679c-whw6q   1/1     Running   0              6h47m   10.234.29.178   k3     <none>           <none>
koord-manager-db6f4bdb9-mlsvs       1/1     Running   0              6h47m   10.234.24.91    k2     <none>           <none>
koord-manager-db6f4bdb9-vtt46       1/1     Running   0              6h47m   10.234.29.167   k3     <none>           <none>
koord-scheduler-7db78c8867-tg8q6    1/1     Running   0              6h47m   10.234.29.188   k3     <none>           <none>
koord-scheduler-7db78c8867-xdmk5    1/1     Running   0              6h47m   10.234.24.101   k2     <none>           <none>
koordlet-hbdmp                      1/1     Running   1 (6m4s ago)   6h39m   10.0.0.62       k1     <none>           <none>
koordlet-mjlrx                      1/1     Running   0              6h40m   10.0.0.106      k3     <none>           <none>
koordlet-x9ddf                      1/1     Running   0              6h40m   10.0.0.100      k2     <none>           <none>

[root@k1 test]# kc -n koordinator-system get ds -o yaml
apiVersion: v1
items:
- apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    annotations:
      deprecated.daemonset.template.generation: "2"
      meta.helm.sh/release-name: koordinator
      meta.helm.sh/release-namespace: default
    creationTimestamp: "2024-04-02T06:56:51Z"
    generation: 2
    labels:
      app.kubernetes.io/managed-by: Helm
      koord-app: koordlet
    name: koordlet
    namespace: koordinator-system
    resourceVersion: "477218"
    uid: def34102-53df-4a8e-91df-b62de603dab2
  spec:
    minReadySeconds: 10
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        koord-app: koordlet
    template:
      metadata:
        creationTimestamp: null
        labels:
          koord-app: koordlet
          runtimeproxy.koordinator.sh/skip-hookserver: "true"
      spec:
        containers:
        - args:
          - -cgroup-root-dir=/host-cgroup/
          - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
          - -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
          - --logtostderr=true
          - --v=4
          command:
          - /koordlet
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koordlet:v1.4.1
          imagePullPolicy: Always
          name: koordlet
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: "0"
              memory: "0"
          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
              - SYS_ADMIN
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/localtime
            name: host-time
            readOnly: true
          - mountPath: /host-cgroup/
            name: host-cgroup-root
          - mountPath: /host-sys-fs/
            mountPropagation: Bidirectional
            name: host-sys-fs
          - mountPath: /host-var-run/
            name: host-var-run
            readOnly: true
          - mountPath: /host-run/
            name: host-run
            readOnly: true
          - mountPath: /host-var-run-koordlet/
            mountPropagation: Bidirectional
            name: host-var-run-koordlet
          - mountPath: /prediction-checkpoints
            mountPropagation: Bidirectional
            name: host-koordlet-checkpoint-dir
          - mountPath: /host-sys/
            name: host-sys
            readOnly: true
          - mountPath: /etc/kubernetes/
            name: host-kubernetes
            readOnly: true
          - mountPath: /host-etc-hookserver/
            mountPropagation: Bidirectional
            name: host-etc-hookserver
          - mountPath: /var/lib/kubelet
            name: host-kubelet-rootdir
            readOnly: true
          - mountPath: /dev
            mountPropagation: HostToContainer
            name: host-dev
          - mountPath: /metric-data/
            name: metric-db-path
        dnsPolicy: ClusterFirst
        hostNetwork: true
        hostPID: true
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: koordlet
        serviceAccountName: koordlet
        terminationGracePeriodSeconds: 10
        tolerations:
        - operator: Exists
        volumes:
        - hostPath:
            path: /etc/localtime
            type: ""
          name: host-time
        - hostPath:
            path: /sys/fs/cgroup/
            type: ""
          name: host-cgroup-root
        - hostPath:
            path: /sys/fs/
            type: ""
          name: host-sys-fs
        - hostPath:
            path: /var/run/
            type: ""
          name: host-var-run
        - hostPath:
            path: /run/
            type: ""
          name: host-run
        - hostPath:
            path: /var/run/koordlet
            type: DirectoryOrCreate
          name: host-var-run-koordlet
        - hostPath:
            path: /var/run/koordlet/prediction-checkpoints
            type: DirectoryOrCreate
          name: host-koordlet-checkpoint-dir
        - hostPath:
            path: /sys/
            type: ""
          name: host-sys
        - hostPath:
            path: /etc/kubernetes/
            type: ""
          name: host-kubernetes
        - hostPath:
            path: /etc/runtime/hookserver.d/
            type: ""
          name: host-etc-hookserver
        - hostPath:
            path: /var/lib/kubelet/
            type: ""
          name: host-kubelet-rootdir
        - hostPath:
            path: /dev
            type: ""
          name: host-dev
        - emptyDir:
            medium: Memory
            sizeLimit: 150Mi
          name: metric-db-path
    updateStrategy:
      rollingUpdate:
        maxSurge: 0
        maxUnavailable: 20%
      type: RollingUpdate
  status:
    currentNumberScheduled: 3
    desiredNumberScheduled: 3
    numberAvailable: 3
    numberMisscheduled: 0
    numberReady: 3
    observedGeneration: 2
    updatedNumberScheduled: 3
kind: List
metadata:
  resourceVersion: ""

@b43646 OK, we are investigating the described case. BTW, did you find any suspicious logs of the koordlet pod?

k2.log

@saintube Based on the description in the first post, I selected the log information from 15:50 to 15:54 for reference.

k2.log

@saintube Based on the description in the first post, I selected the log information from 15:50 to 15:54 for reference.

@b43646 It seems to be the systemd logs of the kubelet and container, instead of the koordlet's. You could use kubectl logs -n koordinator-system $KOORDLET_POD_NAME | less.

@saintube According to your guidance, I found that the logs for my pod only contain today's data. Additionally, I observed that the /metric-data directory is already full of data. I will rerun it to capture the logs of the OOM pod. Also, please help reproduce the issue in your environment.

koordlet-hbdmp-512M.log

kubectl -n koordinator-system logs koordlet-hbdmp >> koordlet-hbdmp-512M.log

[root@k1 ~]# kubectl -n koordinator-system exec -it koordlet-hbdmp -- /bin/bash
root@k1:/# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
overlay        overlay    39G   21G   18G  54% /
tmpfs          tmpfs     7.8G     0  7.8G   0% /sys/fs/cgroup
tmpfs          tmpfs     150M  150M     0 100% /metric-data
devtmpfs       devtmpfs  7.8G     0  7.8G   0% /dev
shm            tmpfs      64M     0   64M   0% /dev/shm
/dev/sda3      xfs        39G   21G   18G  54% /etc/localtime
tmpfs          tmpfs     7.8G  267M  7.5G   4% /host-run

@saintube According to your guidance, I found that the logs for my pod only contain today's data. Additionally, I observed that the /metric-data directory is already full of data. I will rerun it to capture the logs of the OOM pod. Also, please help reproduce the issue in your environment.

koordlet-hbdmp-512M.log
kubectl -n koordinator-system logs koordlet-hbdmp >> koordlet-hbdmp-512M.log
[root@k1 ~]# kubectl -n koordinator-system exec -it koordlet-hbdmp -- /bin/bash
root@k1:/# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
overlay        overlay    39G   21G   18G  54% /
tmpfs          tmpfs     7.8G     0  7.8G   0% /sys/fs/cgroup
tmpfs          tmpfs     150M  150M     0 100% /metric-data
devtmpfs       devtmpfs  7.8G     0  7.8G   0% /dev
shm            tmpfs      64M     0   64M   0% /dev/shm
/dev/sda3      xfs        39G   21G   18G  54% /etc/localtime
tmpfs          tmpfs     7.8G  267M  7.5G   4% /host-run

@b43646 Thanks for your information. We're trying to reproduce the case. The key point is distinguishing between the regular metrics storage overhead and the memory leak. The latter is buggy, while the former is an under-provision problem where the OOM can be easily avoided by increasing the pod memory limit or enlarging the collect/store intervals. We are also planning to add a sheet to illustrate the relationship between the memory cost of koordlet and the number of pod metrics on each node, hoping to help users configure their koordlet's resource requirements.

@saintube

Thanks for your prompt response. There's an issue worth noting here: although the koordinator's default configuration for TSDB storage is a maximum of 100MB, in actual operation, the /metric-data directory is filling up with metric data, reaching a size of 150MB, exceeding the default value.

func NewDefaultConfig() *Config {
    return &Config{
        MetricGCIntervalSeconds: 300,
        MetricExpireSeconds:     1800,

        TSDBPath:              "/metric-data/",
        TSDBRetentionDuration: 12 * time.Hour,
        TSDBEnablePromMetrics: true,
        TSDBStripeSize:        tsdb.DefaultStripeSize,
        TSDBMaxBytes:          100 * 1024 * 1024, // 100 MB

        TSDBWALSegmentSize:            1 * 1024 * 1024,  // 1 MB
        TSDBMaxBlockChunkSegmentSize:  5 * 1024 * 1024,  // 5 MB
        TSDBMinBlockDuration:          30 * time.Minute, // 30 minutes
        TSDBMaxBlockDuration:          30 * time.Minute, // 30 minutes
        TSDBHeadChunksWriteBufferSize: 1024 * 1024,      // 1 MB
    }
}

@saintube

koordlet-g5bhr.log

oom.log

koordlet-ds-yaml.txt

The OOM issue can still be reproduced. The occurrence of OOM happened around 09:39. Here are the related details:

[root@k1 ~]# kc -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS      AGE     IP              NODE   NOMINATED NODE   READINESS GATES
koord-descheduler-dc5dc679c-fs6cm   1/1     Running   0             2d3h    10.234.201.20   k1     <none>           <none>
koord-descheduler-dc5dc679c-whw6q   1/1     Running   0             2d3h    10.234.29.178   k3     <none>           <none>
koord-manager-db6f4bdb9-mlsvs       1/1     Running   0             2d3h    10.234.24.91    k2     <none>           <none>
koord-manager-db6f4bdb9-vtt46       1/1     Running   0             2d3h    10.234.29.167   k3     <none>           <none>
koord-scheduler-7db78c8867-tg8q6    1/1     Running   0             2d3h    10.234.29.188   k3     <none>           <none>
koord-scheduler-7db78c8867-xdmk5    1/1     Running   0             2d3h    10.234.24.101   k2     <none>           <none>
koordlet-g5bhr                      1/1     Running   1 (30m ago)   6h35m   10.0.0.106      k3     <none>           <none>
koordlet-q9rw9                      1/1     Running   0             6h35m   10.0.0.100      k2     <none>           <none>
koordlet-s8fgg                      1/1     Running   0             6h35m   10.0.0.62       k1     <none>           <none>

[root@k3 ~]# cat /var/log/messages | grep "out of memory"

Apr  4 09:39:34 k3 kernel: Memory cgroup out of memory: Kill process 16498 (koordlet) score 1723 or sacrifice child

Hello @saintube, Have you been able to reproduce this issue in your environment?Is there any additional verification that I need to provide?

Hello @saintube, Have you been able to reproduce this issue in your environment?Is there any additional verification that I need to provide?

@b43646 Thanks for your information. We have reproduced the case and been investigating the problem. This issue would be fixed before v1.5.

Hi @b43646, we've found a memory leak problem on the unclosed TSDB querier. Please take a look at #1995 and try the latest koordlet to check if your issue is resolved.

After fixing the above issue, we've tested the recommended memory limits of the koordlet DaemonSet according to different pod numbers per node:

The default 150MiB tmpfs for the metric-data directory that represents the TSDB storage is enough for no larger than 600 pods per node. If needed, you can set the metric-data mounted at a particular hostPath of the disk to store metrics for a longer period or more pods. Please note that this part of memory usage is counted in the pod memory working set when using the tmpfs.
No more than 200 pods per node: The default 256MiB memory limit should be enough.
300 pods per node: The pod memory limit is recommended to set no less than 300MiB (the default 256MiB may be enough, but not guaranteed to be stable when enabling more feature-gates).
400 pods per node: Recommend setting the pod memory limit to no less than 400MiB.
600 pods per node: Recommend setting the pod memory limit to no less than 500MiB.

Appendix

pprof heap inuse_space (200 pods)

@saintube Great Job, thanks for your help. I will verify it as soon as possible

@saintube After 7 days of testing and validation, with the default 256MiB memory, Koordlet did not encounter any Out Of Memory (OOM) exceptions. The validation results passed.

[root@demo ~]# kubectl get nodes
NAME           STATUS   ROLES   AGE     VERSION
10.0.132.100   Ready    node    7d12h   v1.29.1
10.0.132.152   Ready    node    7d12h   v1.29.1
10.0.132.228   Ready    node    7d12h   v1.29.1

[root@demo ~]# kubectl -n koordinator-system get pods | grep koordlet
koordlet-289qf                      1/1     Running   1 (7d12h ago)   7d12h
koordlet-6mwch                      1/1     Running   1 (7d12h ago)   7d12h
koordlet-kfhqj                      1/1     Running   1 (7d12h ago)   7d12h

[root@demo ~]# kubectl describe podgroup gang-example
Name:         gang-example
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.sigs.k8s.io/v1alpha1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2024-04-28T13:48:03Z
  Generation:          208
  Resource Version:    1596886
  UID:                 529228b0-9d94-4fa1-9494-d4c47548d176
Spec:
  Min Member:                100
  Schedule Timeout Seconds:  100
Status:
  Phase:                Running
  Running:              200
  Schedule Start Time:  2024-04-28T13:48:19Z
  Scheduled:            102
Events:                 <none>

@hormes: Closing this issue.

In response to [this](https://github.com/koordinator-sh/koordinator/issues/1981#issuecomment-2097195406): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

koordinator-sh / koordinator

[BUG] Koordlet has a memory leak issue, leading to the pod being killed due to OOM. #1981