prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.18k stars 2.36k forks source link

systemd collector and high memory usage #1108

Closed mmiller1 closed 6 years ago

mmiller1 commented 6 years ago

Host operating system: output of uname -a

Linux bizhpa0s10k8s 4.15.0-32-generic #35~16.04.1-Ubuntu SMP Fri Aug 10 21:54:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f) build user: root@a67a9bc13a69 build date: 20180515-15:52:42 go version: go1.9.6

node_exporter command line flags

    - "--path.procfs=/host/proc"
    - "--path.sysfs=/host/sys"
    - "--no-collector.hwmon"
    - "--collector.systemd"
    - "--collector.systemd.unit-whitelist=(keepalived)\\.service"
    - "--no-collector.wifi"
    - "--collector.filesystem.ignored-fs-types=\"^fuse.lxcfs|tmpfs$|^/rootfs/(var/lib/docker/)|(run/docker/netns/).*|^/host/sys/kernel.*\""

I've observed node_exporter running in a kubernetes pod is consuming a considerable amount of memory, I've raised the memory limit to 1G and still observed OOM kills. I captured some info with pprof, and it led me to believe the problem lies in the systemd collector, disabling it did resolve the memory usage issue, but this is obviously not ideal.

heap.svg.gz

Here is the yaml that makes up the k8s daemonset

---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      name: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - image: quay.io/prometheus/node-exporter:v0.16.0
        args:
        - "--path.procfs=/host/proc"
        - "--path.sysfs=/host/sys"
        - "--no-collector.hwmon"
        - "--collector.systemd"
        - "--collector.systemd.unit-whitelist=(keepalived)\\.service"
        - "--no-collector.wifi"
        - "--collector.filesystem.ignored-fs-types=\"^fuse.lxcfs|tmpfs$|^/rootfs/(var/lib/docker/)|(run/docker/netns/).*|^/host/sys/kernel.*\""
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: scrape
        resources:
          requests:
            memory: 200Mi
            cpu: 200m
          limits:
            memory: 1Gi
            cpu: 500m
        securityContext:
          privileged: true
        volumeMounts:
        - name: proc
          readOnly:  true
          mountPath: /host/proc
        - name: sys
          readOnly: true
          mountPath: /host/sys
        - name: dbus
          readOnly: true
          mountPath: /var/run/dbus/
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: dbus
        hostPath: 
          path: /var/run/dbus/
gusnakada commented 6 years ago

I'm having the same issue. I've not been able to get it working with systemd collector on CoreOS.

mmiller1 commented 6 years ago

I believe I found the cause of this, see these two issues reporting other issues regarding systemd: https://github.com/kubernetes/kops/issues/5916 https://github.com/kubernetes/kubernetes/issues/64137

It does seem like node-exporter has a memory leak or similar if systemd becomes unresponsive, but the two issues linked above contain workarounds that resolved my issue. I'm going to close this.