squat / generic-device-plugin

A Kubernetes device plugin to schedule generic Linux devices
Apache License 2.0
155 stars 20 forks source link

Device manager does not provide devices to the application container after reboot #63

Open ratermir opened 4 months ago

ratermir commented 4 months ago

I am not sure whether it is real issue or I am doing something wrong ... but

I have prepared my single node microk8s cluster for Home Assistant, installed this device plugin and propagated /dev/ttyUSB0 and /dev/zigbee2 (symlink to the first one) to the "zigbee2mqtt" pod.

After the first installation everything worked well, but after reboot the the "zigbee2mqtt" pod (with the /dev/ttyUSB0 and /dev/zigbee2 imported) didn't start. The pod stood in the state "UnexpectedAdmissionError", the other pod is created which is in state "Pending", falls of, new one is created ... etc.

zmq-5984c5f8cd-fxxhl    0/1     Pending                    0               3m14s
zmq-5984c5f8cd-z7dvd    0/1     UnexpectedAdmissionError   0               9m24s

In the pod description followin error is written (there is no log since pod didn't start):

Events:
  Type     Reason                    Age   From     Message
  ----     ------                    ----  ----     -------
  Warning  UnexpectedAdmissionError  48s   kubelet  Allocate failed due to no healthy devices present; cannot allocate unhealthy devices squat.ai/serial, which is unexpected

The situation repeats after each reboot.

When I kill all pods manually (the device manager one and also the application pods that don't work), new pods are started and everything works.

Here is log of the device manager container after the first boot (the situation, when it doesn't work - doesn't mount devices into the application container)

ha-test@zmh-lip:/home/k8s/_system/kube-system$ kcn logs device-plugin-zigbee-4vdbz
{"caller":"main.go:218","msg":"Starting the generic-device-plugin for \"squat.ai/zigbee\".","ts":"2024-03-07T13:23:42.934752026Z"}
{"caller":"main.go:218","msg":"Starting the generic-device-plugin for \"squat.ai/serial\".","ts":"2024-03-07T13:23:41.735530601Z"}
{"caller":"plugin.go:114","level":"info","msg":"listening on Unix socket","resource":"squat.ai/serial","socket":"/var/lib/kubelet/device-plugins/gdp-c3F1YXQuYWkvc2VyaWFs-1709817821.sock","ts":"2024-03-07T13:23:45.835273691Z"}
{"caller":"plugin.go:114","level":"info","msg":"listening on Unix socket","resource":"squat.ai/zigbee","socket":"/var/lib/kubelet/device-plugins/gdp-c3F1YXQuYWkvemlnYmVl-1709817821.sock","ts":"2024-03-07T13:23:43.93416597Z"}
{"caller":"plugin.go:122","level":"info","msg":"starting gRPC server","resource":"squat.ai/zigbee","ts":"2024-03-07T13:23:57.034351851Z"}
{"caller":"plugin.go:176","level":"info","msg":"waiting for the gRPC server to be ready","resource":"squat.ai/zigbee","ts":"2024-03-07T13:23:57.034340518Z"}
{"caller":"plugin.go:122","level":"info","msg":"starting gRPC server","resource":"squat.ai/serial","ts":"2024-03-07T13:23:58.334925943Z"}
{"caller":"plugin.go:176","level":"info","msg":"waiting for the gRPC server to be ready","resource":"squat.ai/serial","ts":"2024-03-07T13:23:58.434129443Z"}
{"caller":"plugin.go:188","level":"info","msg":"the gRPC server is ready","resource":"squat.ai/serial","ts":"2024-03-07T13:24:00.434686183Z"}
{"caller":"plugin.go:188","level":"info","msg":"the gRPC server is ready","resource":"squat.ai/zigbee","ts":"2024-03-07T13:24:01.334968608Z"}
{"caller":"plugin.go:226","level":"info","msg":"registering plugin with kubelet","resource":"squat.ai/zigbee","ts":"2024-03-07T13:24:01.335104052Z"}
{"caller":"plugin.go:226","level":"info","msg":"registering plugin with kubelet","resource":"squat.ai/serial","ts":"2024-03-07T13:24:01.237744071Z"}
ha-test@zmh-lip:/home/k8s/_system/kube-system$

This leds to the state described above (non - working application container).

Here is the same log from container after the first one was killed (and re-created by k8s):

ha-test@zmh-lip:/home/k8s/_system/kube-system$ kcn logs device-plugin-zigbee-tkgsv
{"caller":"main.go:218","msg":"Starting the generic-device-plugin for \"squat.ai/zigbee\".","ts":"2024-03-07T13:26:41.838459105Z"}
{"caller":"plugin.go:114","level":"info","msg":"listening on Unix socket","resource":"squat.ai/zigbee","socket":"/var/lib/kubelet/device-plugins/gdp-c3F1YXQuYWkvemlnYmVl-1709818001.sock","ts":"2024-03-07T13:26:41.839476438Z"}
{"caller":"plugin.go:122","level":"info","msg":"starting gRPC server","resource":"squat.ai/zigbee","ts":"2024-03-07T13:26:41.840105846Z"}
{"caller":"plugin.go:176","level":"info","msg":"waiting for the gRPC server to be ready","resource":"squat.ai/zigbee","ts":"2024-03-07T13:26:41.840379364Z"}
{"caller":"main.go:218","msg":"Starting the generic-device-plugin for \"squat.ai/serial\".","ts":"2024-03-07T13:26:41.934053346Z"}
{"caller":"plugin.go:114","level":"info","msg":"listening on Unix socket","resource":"squat.ai/serial","socket":"/var/lib/kubelet/device-plugins/gdp-c3F1YXQuYWkvc2VyaWFs-1709818001.sock","ts":"2024-03-07T13:26:41.934398123Z"}
{"caller":"plugin.go:188","level":"info","msg":"the gRPC server is ready","resource":"squat.ai/zigbee","ts":"2024-03-07T13:26:41.936471975Z"}
{"caller":"plugin.go:226","level":"info","msg":"registering plugin with kubelet","resource":"squat.ai/zigbee","ts":"2024-03-07T13:26:41.936560327Z"}
{"caller":"plugin.go:122","level":"info","msg":"starting gRPC server","resource":"squat.ai/serial","ts":"2024-03-07T13:26:42.034449846Z"}
{"caller":"plugin.go:176","level":"info","msg":"waiting for the gRPC server to be ready","resource":"squat.ai/serial","ts":"2024-03-07T13:26:42.034767179Z"}
{"caller":"plugin.go:188","level":"info","msg":"the gRPC server is ready","resource":"squat.ai/serial","ts":"2024-03-07T13:26:42.03771179Z"}
{"caller":"plugin.go:226","level":"info","msg":"registering plugin with kubelet","resource":"squat.ai/serial","ts":"2024-03-07T13:26:42.037867012Z"}
{"caller":"generic.go:232","level":"info","msg":"starting listwatch","resource":"squat.ai/zigbee","ts":"2024-03-07T13:26:42.53712666Z"}
{"caller":"generic.go:232","level":"info","msg":"starting listwatch","resource":"squat.ai/serial","ts":"2024-03-07T13:26:42.634323382Z"}
ha-test@zmh-lip:/home/k8s/_system/kube-system$

My environment is RaspberryPI 4/8GB (Arm64), dietpi OS (variant of Debian), USB drive. The system doesn't show any other issues. I am not too experienced in k8s devices so I am not sure what can cause this strange behaviour.

squat commented 4 months ago

Hi, can you please share:

  1. the output of kubectl describe node <your-node>
  2. the pod/deployment manifest
  3. configuration for generic-device-plugin

I have never seen the UnexpectedAdmissionError with generic-device-plugin.

So you reboot your node and then the machine gets stuck in a bad state? Are you able to make it work by killing just the zigbee2mqtt pods or do you have to kill the plugin pod?

ratermir commented 4 months ago

yes, after reboot it is always in the "bad" state. Killing just the zigbee2mqtt pod doesn't help; i need to kill both (zigbee2mqtt and also device manager).

Also note that I plaid with delayed container starts (using init container) - both zigbee2mqtt and device manager in various manners (the zigbee2mqtt was delayed max. 1 minute), but the results were always the same.

device manager config:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: device-plugin-zigbee
  namespace: kube-system
  labels:
    app.kubernetes.io/name: device-plugin-zigbee
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: device-plugin-zigbee
  template:
    metadata:
      labels:
        app.kubernetes.io/name: device-plugin-zigbee
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      containers:
      - image: squat/generic-device-plugin
        args:
        - --device
        - |
          name: zigbee
          groups:
            - paths:
                - path: /dev/zigbee*
        - --device
        - |
          name: serial
          groups:
            - paths:
                - path: /dev/ttyUSB*
            - paths:
                - path: /dev/ttyACM*
            - paths:
                - path: /dev/tty.usb*
            - paths:
                - path: /dev/cu.*
            - paths:
                - path: /dev/cuaU*
            - paths:
                - path: /dev/rfcomm*
        name: device-plugin-zigbee
        resources:
          requests:
            cpu: 50m
            memory: 10Mi
          limits:
            cpu: 50m
            memory: 20Mi
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      initContainers:
        - name: wait
          image: busybox:1.35.0-uclibc
          command: ['sh', '-c', 'echo "Wait for serial device" && sleep 5']
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
  updateStrategy:
    type: RollingUpdate

Describe node:

Name:               zmh-lip
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=zmh-lip
                    kubernetes.io/os=linux
                    microk8s.io/cluster=true
                    node.kubernetes.io/microk8s-controlplane=microk8s-controlplane
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 04 Mar 2024 09:17:12 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  zmh-lip
  AcquireTime:     <unset>
  RenewTime:       Thu, 07 Mar 2024 15:19:16 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 07 Mar 2024 15:17:43 +0000   Mon, 04 Mar 2024 09:17:12 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 07 Mar 2024 15:17:43 +0000   Mon, 04 Mar 2024 09:17:12 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 07 Mar 2024 15:17:43 +0000   Mon, 04 Mar 2024 09:17:12 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 07 Mar 2024 15:17:43 +0000   Thu, 07 Mar 2024 14:52:11 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.181.23
  Hostname:    zmh-lip
Capacity:
  cpu:                4
  ephemeral-storage:  18888700Ki
  memory:             8050912Ki
  pods:               110
  squat.ai/audio:     0
  squat.ai/capture:   0
  squat.ai/fuse:      0
  squat.ai/serial:    0
  squat.ai/video:     0
  squat.ai/zigbee:    0
Allocatable:
  cpu:                4
  ephemeral-storage:  17840124Ki
  memory:             7948512Ki
  pods:               110
  squat.ai/audio:     0
  squat.ai/capture:   0
  squat.ai/fuse:      0
  squat.ai/serial:    0
  squat.ai/video:     0
  squat.ai/zigbee:    0
System Info:
  Machine ID:                 2ac1bb8637c04c49be0973177b80132d
  System UUID:                2ac1bb8637c04c49be0973177b80132d
  Boot ID:                    1883a437-3795-4ad1-8a5a-9a27c9735ae8
  Kernel Version:             6.1.21-v8+
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.28
  Kubelet Version:            v1.29.2
  Kube-Proxy Version:         v1.29.2
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  cert-manager                cert-manager-7cf97bbd47-6qg52              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d22h
  cert-manager                cert-manager-cainjector-99677759d-5ttn7    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d22h
  cert-manager                cert-manager-webhook-8486cb8479-8mzq6      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d22h
  hass                        mqtt-5dd56b975d-mvg2n                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         43h
  hass                        tsdb-22vkn                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         44h
  ingress                     nginx-ingress-microk8s-controller-qrgdq    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d23h
  kube-system                 coredns-864597b5fd-8xrx2                   100m (2%)     0 (0%)      70Mi (0%)        170Mi (2%)     3d4h
  kube-system                 device-plugin-zigbee-tkgsv                 50m (1%)      50m (1%)    10Mi (0%)        20Mi (0%)      112m
  kube-system                 hostpath-provisioner-756cd956bc-kw6zb      0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d4h
  metallb-system              controller-5f7bb57799-7c824                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
  metallb-system              speaker-7jzfn                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d5h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                150m (3%)  50m (1%)
  memory             80Mi (1%)  190Mi (2%)
  ephemeral-storage  0 (0%)     0 (0%)
  squat.ai/audio     0          0
  squat.ai/capture   0          0
  squat.ai/fuse      0          0
  squat.ai/serial    0          0
  squat.ai/video     0          0
  squat.ai/zigbee    0          0
Events:              <none>

Deployment for the pod

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zmq
  namespace: {{namespace}}
  labels:
    app.kubernetes.io/name: zmq
spec:
  revisionHistoryLimit: 3
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app.kubernetes.io/name: zmq
  template:
    metadata:
      labels:
        app.kubernetes.io/name: zmq
    spec:
      containers:
        - name: zmq
          image: "docker.io/koenkk/zigbee2mqtt"
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              squat.ai/zigbee: 1
              squat.ai/serial: 1
          ports:
            - name: zmq
              containerPort: {{zmq_port}}
              protocol: TCP
              hostPort: {{zmq_port}}
          volumeMounts:
            - name: data
              mountPath: /app/data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: zmq-data
squat commented 4 months ago

It's good that you're using Recreate for your deployment strategy to ensure that pods don't get stuck waiting for the device to become available.

One thing I notice is that you added an init container to the device plugin to wait for the serial device. This is an anti-pattern: the device plugin checks for new devices as the appear on your OS every 5 seconds.

I also notice that your node shows 0 serial and 0 zigbee devices. Why is that?

ratermir commented 4 months ago

Additional observation:

When I set number of replicas to 0 for zigbee2mqtt pod (to avoid it starting automatically) and a while after reboot I started it manually, pod in the state "UnexpectedAdmissionError" didn't apear. The started pod ended in the state "Pending" and stayes such until I killed the device manager pod created during boot.

After killed the device manager pod and new one was created, the zigbee2mqtt pod started normally.
Also, serial and zigbee devices count was (as expected) after this.

Here is part of describe node after killing the pod created diring system boot:

Capacity:
  cpu:                4
  ephemeral-storage:  18888700Ki
  memory:             8050912Ki
  pods:               110
  squat.ai/audio:     0
  squat.ai/capture:   0
  squat.ai/fuse:      0
  squat.ai/serial:    1
  squat.ai/video:     0
  squat.ai/zigbee:    1
Allocatable:
  cpu:                4
  ephemeral-storage:  17840124Ki
  memory:             7948512Ki
  pods:               110
  squat.ai/audio:     0
  squat.ai/capture:   0
  squat.ai/fuse:      0
  squat.ai/serial:    1
  squat.ai/video:     0
  squat.ai/zigbee:    1
System Info:
ratermir commented 4 months ago

I also noticed that your node shows 0 serial and 0 zigbee devices. Why is that?

This is probably why the dependent pod is not starting. But why this is happening ... I don't know. I'd note that both devices ("/dev/ttyUSB0" and "/dev/zigbee2") exist after boot and are working (I have alternative - a "podman" version of zigbee2mqtt, which I used before; but now it is disabled).