traas-stack / holoinsight

HoloInsight is a cloud-native observability platform with a special focus on real-time log analysis and AI integration.
Apache License 2.0
310 stars 67 forks source link

cadvisor init failure in kubernetes #195

Open dragonTour opened 1 year ago

dragonTour commented 1 year ago

Describe this problem

cadvisor's pod failed to run

[root@xxx ~]# kubectl get po -n holoinsight-example
NAME                           READY   STATUS             RESTARTS      AGE
cadvisor-kpxbl                 0/1     CrashLoopBackOff   3 (35s ago)   90s
cadvisor-zwc4q                 0/1     CrashLoopBackOff   3 (16s ago)   90s
ceresdb-0                      1/1     Running            0             91s
clusteragent-0                 1/1     Running            0             91s
daemonagent-7xk4d              1/1     Running            0             90s
daemonagent-8n5gg              1/1     Running            0             90s
holoinsight-server-example-0   0/1     Running            0             91s
mongo-0                        1/1     Running            0             91s
mysql-0                        0/1     Running            0             90s

Viewing pod(cadvisor-kpxbl) logs:

[root@host-10-19-37-88 ~]# kubectl describe po cadvisor-kpxbl -n holoinsight-example
...
Containers:
  cadvisor:
    Container ID:  docker://7a3b2aab591d147b4dbf9e804e7b1837817696e50cd540ce1f63aff1ca27dac1
    Image:         gcr.io/cadvisor/cadvisor:v0.44.0
    Image ID:      docker-pullable://gcr.io/cadvisor/cadvisor@sha256:ef1e224267584fc9cb8d189867f178598443c122d9068686f9c3898c735b711f
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      --allow_dynamic_housekeeping=false
      --housekeeping_interval=5s
      --max_housekeeping_interval=5s
      --storage_duration=2m
      --enable_metrics=cpu,memory,network,tcp,disk,diskIO,cpuLoad
      --enable_load_reader=true
      --store_container_labels=false
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/69dee914b194de362188cb07318446b62fa3559fc5cb03a54c1169e0cf4bda4c/merged/run/secrets: read-only file system: unknown

...
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  5m8s                    default-scheduler  Successfully assigned holoinsight-example/cadvisor-kpxbl to host-10-19-37-88
  Warning  Failed     4m55s                   kubelet            Error: failed to start container "cadvisor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/ce6a7eebbaedec94eb7fdaf3a4f1427526613fb4cf7be485908299219d32ac4c/merged/run/secrets: read-only file system: unknown
  Warning  Failed     4m52s                   kubelet            Error: failed to start container "cadvisor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/944ffd264146660a5f9ede7638c677a1c47b98061838741cfb29bd3241c5babf/merged/run/secrets: read-only file system: unknown
  Warning  Failed     4m37s                   kubelet            Error: failed to start container "cadvisor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/f441c927545adff71fcbdc8c5056ebeaa2b441a112c321e8205b7bc2c5eadb0d/merged/run/secrets: read-only file system: unknown
  Warning  Failed     4m6s                    kubelet            Error: failed to start container "cadvisor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/bde54086f8560ea1fcbf9488fe25c83d8567d5bd0d9645ca5e59dd6c4940ffea/merged/run/secrets: read-only file system: unknown
  Normal   Pulled     3m21s (x5 over 5m1s)    kubelet            Container image "gcr.io/cadvisor/cadvisor:v0.44.0" already present on machine
  Normal   Created    3m20s (x5 over 5m)      kubelet            Created container cadvisor
  Warning  Failed     3m19s                   kubelet            Error: failed to start container "cadvisor": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/kubelet/pods/ab235440-bbca-45b0-94db-eb859ffdf763/volumes/kubernetes.io~projected/kube-api-access-hkwwk" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount" caused: mkdir /data/docker/overlay2/8ba07a89c4755c857fbc1e11e48169dde0ac9c6a8aca0c4384c792d91e961f0a/merged/run/secrets: read-only file system: unknown
  Warning  BackOff    2m51s (x10 over 4m51s)  kubelet            Back-off restarting failed container

Steps to reproduce

kubernetes version:1.23 docker version: 20.10.6 linux kernal: 4.18.0-1.el7.elrepo.x86_64

Expected behavior

No response

Additional Information

No response

xzchaoo commented 1 year ago

Is your k8s cluster a real cluster? Or a minikube version?

xzchaoo commented 1 year ago

I can't find an environment exactly like yours to reproduce the problem in the short term. Maybe You can try to modify(e.g. comment out some configuration) the cadvisor.yaml and redeploy it.

dragonTour commented 1 year ago

I used kubeadm to boot the cluster

dragonTour commented 1 year ago

我把docker的运行数据的目录改了,不在/var/ 下面,是不是这个引起的

[root@]# more /etc/docker/daemon.json 
{
    "data-root": "/data/docker",
    "exec-opts": [
        "native.cgroupdriver=systemd"
    ]
}
xzchaoo commented 1 year ago

Is the original value of data-root '/var/lib/docker' ? If so, maybe You need to change the cadvisor.yaml :

      volumes:
...
      - name: docker
        hostPath:
          path: /var/lib/docker
     ...

to

      volumes:
...
      - name: docker
        hostPath:
          path: /data/docker
     ...
dragonTour commented 1 year ago

change readOnly to false, run successfully

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: holoinsight-example
spec:
  selector:
    matchLabels:
      app: cadvisor
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: cadvisor
        hi_common_version: '3'
    spec:
      restartPolicy: Always
      containers:
      - name: cadvisor
        image: gcr.io/cadvisor/cadvisor:v0.44.0
        args:
        - --allow_dynamic_housekeeping=false
        - --housekeeping_interval=5s
        - --max_housekeeping_interval=5s
        - --storage_duration=2m
        - --enable_metrics=cpu,memory,network,tcp,disk,diskIO,cpuLoad
        - --enable_load_reader=true
        - --store_container_labels=false
        volumeMounts:
        - name: rootfs
          mountPath: /rootfs
          readOnly: false
        - name: var-run
          mountPath: /var/run
          readOnly: false
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: docker
          mountPath: /var/lib/docker
          readOnly: false
        - name: disk
          mountPath: /dev/disk
          readOnly: true
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP

        resources:
          requests:
            cpu: "0"
            memory: "0"
          limits:
            cpu: "0.25"
            memory: "256Mi"
      volumes:
      - name: rootfs
        hostPath:
          path: /
      - name: var-run
        hostPath:
          path: /var/run
      - name: sys
        hostPath:
          path: /sys
      - name: docker
        hostPath:
          path: /data/docker
      - name: disk
        hostPath:
          path: /dev/disk
xzchaoo commented 1 year ago

The volumeMounts config in cadvisor yaml are copied from cadvisor official repository without any changes. And our internal deployments (through Aliyun k8s cluster) are all successful with this cadvisor config. I think there is some special particularity in your k8s cluster, leading to deployment failure.

If you would like to explore the root cause of this issue, and contribute a corresponding solution, then this is quite welcome.

gigi-at-zymergen commented 1 month ago

dragonTour is not alone. I'm seeing this same issue in EKS 1.24 which uses containerd runtime.

cadvisor:
    Container ID:   containerd://80ad9ce8b85e077f50dd9c1bfd1e248801afa3126f94793b91bbdb5ea33acf29
    Image:          gcr.io/cadvisor/cadvisor:v0.49.1
    Image ID:       gcr.io/cadvisor/cadvisor@sha256:3cde6faf0791ebf7b41d6f8ae7145466fed712ea6f252c935294d2608b1af388
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/kubelet/pods/882dfec1-613f-4a83-8705-424230f18271/volumes/kubernetes.io~projected/kube-api-access-phx22" to rootfs at "/var/run/secrets/kubernetes.io/serviceaccount": mkdir /run/containerd/io.containerd.runtime.v2.task/k8s.io/80ad9ce8b85e077f50dd9c1bfd1e248801afa3126f94793b91bbdb5ea33acf29/rootfs/run/secrets: read-only file system: unknown