Open ratermir opened 4 months ago
Hi, can you please share:
kubectl describe node <your-node>
I have never seen the UnexpectedAdmissionError with generic-device-plugin.
So you reboot your node and then the machine gets stuck in a bad state? Are you able to make it work by killing just the zigbee2mqtt pods or do you have to kill the plugin pod?
yes, after reboot it is always in the "bad" state. Killing just the zigbee2mqtt pod doesn't help; i need to kill both (zigbee2mqtt and also device manager).
Also note that I plaid with delayed container starts (using init container) - both zigbee2mqtt and device manager in various manners (the zigbee2mqtt was delayed max. 1 minute), but the results were always the same.
device manager config:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: device-plugin-zigbee
namespace: kube-system
labels:
app.kubernetes.io/name: device-plugin-zigbee
spec:
selector:
matchLabels:
app.kubernetes.io/name: device-plugin-zigbee
template:
metadata:
labels:
app.kubernetes.io/name: device-plugin-zigbee
spec:
priorityClassName: system-node-critical
tolerations:
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
containers:
- image: squat/generic-device-plugin
args:
- --device
- |
name: zigbee
groups:
- paths:
- path: /dev/zigbee*
- --device
- |
name: serial
groups:
- paths:
- path: /dev/ttyUSB*
- paths:
- path: /dev/ttyACM*
- paths:
- path: /dev/tty.usb*
- paths:
- path: /dev/cu.*
- paths:
- path: /dev/cuaU*
- paths:
- path: /dev/rfcomm*
name: device-plugin-zigbee
resources:
requests:
cpu: 50m
memory: 10Mi
limits:
cpu: 50m
memory: 20Mi
ports:
- containerPort: 8080
name: http
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
initContainers:
- name: wait
image: busybox:1.35.0-uclibc
command: ['sh', '-c', 'echo "Wait for serial device" && sleep 5']
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
updateStrategy:
type: RollingUpdate
Describe node:
Name: zmh-lip
Roles: <none>
Labels: beta.kubernetes.io/arch=arm64
beta.kubernetes.io/os=linux
kubernetes.io/arch=arm64
kubernetes.io/hostname=zmh-lip
kubernetes.io/os=linux
microk8s.io/cluster=true
node.kubernetes.io/microk8s-controlplane=microk8s-controlplane
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 04 Mar 2024 09:17:12 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: zmh-lip
AcquireTime: <unset>
RenewTime: Thu, 07 Mar 2024 15:19:16 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Thu, 07 Mar 2024 15:17:43 +0000 Mon, 04 Mar 2024 09:17:12 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 07 Mar 2024 15:17:43 +0000 Mon, 04 Mar 2024 09:17:12 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 07 Mar 2024 15:17:43 +0000 Mon, 04 Mar 2024 09:17:12 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 07 Mar 2024 15:17:43 +0000 Thu, 07 Mar 2024 14:52:11 +0000 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.181.23
Hostname: zmh-lip
Capacity:
cpu: 4
ephemeral-storage: 18888700Ki
memory: 8050912Ki
pods: 110
squat.ai/audio: 0
squat.ai/capture: 0
squat.ai/fuse: 0
squat.ai/serial: 0
squat.ai/video: 0
squat.ai/zigbee: 0
Allocatable:
cpu: 4
ephemeral-storage: 17840124Ki
memory: 7948512Ki
pods: 110
squat.ai/audio: 0
squat.ai/capture: 0
squat.ai/fuse: 0
squat.ai/serial: 0
squat.ai/video: 0
squat.ai/zigbee: 0
System Info:
Machine ID: 2ac1bb8637c04c49be0973177b80132d
System UUID: 2ac1bb8637c04c49be0973177b80132d
Boot ID: 1883a437-3795-4ad1-8a5a-9a27c9735ae8
Kernel Version: 6.1.21-v8+
OS Image: Debian GNU/Linux 12 (bookworm)
Operating System: linux
Architecture: arm64
Container Runtime Version: containerd://1.6.28
Kubelet Version: v1.29.2
Kube-Proxy Version: v1.29.2
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-7cf97bbd47-6qg52 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d22h
cert-manager cert-manager-cainjector-99677759d-5ttn7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d22h
cert-manager cert-manager-webhook-8486cb8479-8mzq6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d22h
hass mqtt-5dd56b975d-mvg2n 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43h
hass tsdb-22vkn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
ingress nginx-ingress-microk8s-controller-qrgdq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d23h
kube-system coredns-864597b5fd-8xrx2 100m (2%) 0 (0%) 70Mi (0%) 170Mi (2%) 3d4h
kube-system device-plugin-zigbee-tkgsv 50m (1%) 50m (1%) 10Mi (0%) 20Mi (0%) 112m
kube-system hostpath-provisioner-756cd956bc-kw6zb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d4h
metallb-system controller-5f7bb57799-7c824 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
metallb-system speaker-7jzfn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d5h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 150m (3%) 50m (1%)
memory 80Mi (1%) 190Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
squat.ai/audio 0 0
squat.ai/capture 0 0
squat.ai/fuse 0 0
squat.ai/serial 0 0
squat.ai/video 0 0
squat.ai/zigbee 0 0
Events: <none>
Deployment for the pod
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: zmq
namespace: {{namespace}}
labels:
app.kubernetes.io/name: zmq
spec:
revisionHistoryLimit: 3
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: zmq
template:
metadata:
labels:
app.kubernetes.io/name: zmq
spec:
containers:
- name: zmq
image: "docker.io/koenkk/zigbee2mqtt"
imagePullPolicy: IfNotPresent
resources:
limits:
squat.ai/zigbee: 1
squat.ai/serial: 1
ports:
- name: zmq
containerPort: {{zmq_port}}
protocol: TCP
hostPort: {{zmq_port}}
volumeMounts:
- name: data
mountPath: /app/data
volumes:
- name: data
persistentVolumeClaim:
claimName: zmq-data
It's good that you're using Recreate
for your deployment strategy to ensure that pods don't get stuck waiting for the device to become available.
One thing I notice is that you added an init container to the device plugin to wait for the serial device. This is an anti-pattern: the device plugin checks for new devices as the appear on your OS every 5 seconds.
I also notice that your node shows 0 serial and 0 zigbee devices. Why is that?
Additional observation:
When I set number of replicas to 0 for zigbee2mqtt pod (to avoid it starting automatically) and a while after reboot I started it manually, pod in the state "UnexpectedAdmissionError" didn't apear. The started pod ended in the state "Pending" and stayes such until I killed the device manager pod created during boot.
After killed the device manager pod and new one was created, the zigbee2mqtt pod started normally.
Also, serial and zigbee devices count was (as expected) after this.
Here is part of describe node after killing the pod created diring system boot:
Capacity:
cpu: 4
ephemeral-storage: 18888700Ki
memory: 8050912Ki
pods: 110
squat.ai/audio: 0
squat.ai/capture: 0
squat.ai/fuse: 0
squat.ai/serial: 1
squat.ai/video: 0
squat.ai/zigbee: 1
Allocatable:
cpu: 4
ephemeral-storage: 17840124Ki
memory: 7948512Ki
pods: 110
squat.ai/audio: 0
squat.ai/capture: 0
squat.ai/fuse: 0
squat.ai/serial: 1
squat.ai/video: 0
squat.ai/zigbee: 1
System Info:
I also noticed that your node shows 0 serial and 0 zigbee devices. Why is that?
This is probably why the dependent pod is not starting. But why this is happening ... I don't know. I'd note that both devices ("/dev/ttyUSB0" and "/dev/zigbee2") exist after boot and are working (I have alternative - a "podman" version of zigbee2mqtt, which I used before; but now it is disabled).
I am not sure whether it is real issue or I am doing something wrong ... but
I have prepared my single node microk8s cluster for Home Assistant, installed this device plugin and propagated /dev/ttyUSB0 and /dev/zigbee2 (symlink to the first one) to the "zigbee2mqtt" pod.
After the first installation everything worked well, but after reboot the the "zigbee2mqtt" pod (with the /dev/ttyUSB0 and /dev/zigbee2 imported) didn't start. The pod stood in the state "UnexpectedAdmissionError", the other pod is created which is in state "Pending", falls of, new one is created ... etc.
In the pod description followin error is written (there is no log since pod didn't start):
The situation repeats after each reboot.
When I kill all pods manually (the device manager one and also the application pods that don't work), new pods are started and everything works.
Here is log of the device manager container after the first boot (the situation, when it doesn't work - doesn't mount devices into the application container)
This leds to the state described above (non - working application container).
Here is the same log from container after the first one was killed (and re-created by k8s):
My environment is RaspberryPI 4/8GB (Arm64), dietpi OS (variant of Debian), USB drive. The system doesn't show any other issues. I am not too experienced in k8s devices so I am not sure what can cause this strange behaviour.