Closed detscorn closed 1 year ago
Hi @detscorn does the whole MachiatoBin node crash? Or just the generic-device-plugin container? Also, what exact tag are you using?
Hey @squat. No just the daemon set container. I'm sorry I don't know what you mean by tag? Are you taking about image tag?
Yes, exactly :) what image tag / version of the plugin are you running?
image: squat/generic-device-plugin
It would be helpful to find the exact version of the plugin you are running. This helps rule out changes PRs to the project that might have introduced a breaking change.
Could you please try pinning the image to: squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3
This is right before we introduced USB device support and I'd like to rule that out as the issue.
Also, could you please share the output of kubectl describe pod -n kube-system <name of generic-device-plugin pod on MachiatoBin>
?
This can potentially help surface some extra information. Otherwise, we'll need to look through the Kubelet logs in your journald to see if there's any other helpful information.
So far so good. It's now been up for 6 min, but it's run longer before crashing with other image.
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: mach-1/192.168.253.204
Start Time: Mon, 27 Mar 2023 09:50:00 -0600
Labels: app.kubernetes.io/name=generic-device-plugin
controller-revision-hash=545d6d95b6
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 10.42.3.27
IPs:
IP: 10.42.3.27
Controlled By: DaemonSet/generic-device-plugin
Containers:
generic-device-plugin:
Container ID: containerd://589b9bf44f8bb9a77fa8669d68f8d1f76456424c087f53fe28d84b45d020f91b
Image: squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3
Image ID: docker.io/squat/generic-device-plugin@sha256:479881f1f337562d8b81d77101327cc591973f5db3b4f3c6978fe7f9c2fda6c2
Port: 8080/TCP
Host Port: 0/TCP
Args:
--device
{"name": "adsb", "groups": [{"usb": [{"vendor": "0bda", "product": "2838"}]} ]}
State: Running
Started: Mon, 27 Mar 2023 09:50:47 -0600
Ready: True
Restart Count: 0
Limits:
cpu: 50m
memory: 10Mi
Requests:
cpu: 50m
memory: 10Mi
Environment: <none>
Mounts:
/dev from dev (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5l6n (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
kube-api-access-r5l6n:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m43s default-scheduler Successfully assigned kube-system/generic-device-plugin-7n2bj to mach-1
Normal Pulling 3m42s kubelet Pulling image "squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3"
Normal Pulled 2m58s kubelet Successfully pulled image "squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3" in 44.423511929s (44.423543331s including waiting)
Normal Created 2m58s kubelet Created container generic-device-plugin
Normal Started 2m56s kubelet Started container generic-device-plugin
Ah ok, so you're using the recently added USB feature of generic-device-plugin. Indeed, if the USB code isn't exercised and the plugin doesn't crash, then there must be something buggy in the USB feature.
cc @duckfullstop maybe you have some idea of what might be going on / how to debug?
@detscorn, can you please bump back up to a recent, explicit tag of the plugin and share the output of that kubectl describe...
command again once the plugin has already crashed? e.g. ghcr.io/squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d
I might not be able to help too much in the short term due to current personal health shenanigans, but in the interim, could you provide the output of ls -Rla /sys/bus/usb/devices
?
Sorry to hear that @duckfullstop! Take care of yourself above all else
Sorry was in meetings. So the plugin hasn't crashed with image you had me load. I'll load the newer image and give you the output of describe in a moment.
mach-1:~$ ls -Rla /sys/bus/usb/devices
/sys/bus/usb/devices:
total 0
drwxr-xr-x 2 root root 0 Mar 27 17:06 .
drwxr-xr-x 4 root root 0 Jan 1 1970 ..
lrwxrwxrwx 1 root root 0 Feb 14 2019 1-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 1-1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1
lrwxrwxrwx 1 root root 0 Feb 14 2019 1-1:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1/1-1:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 1-1:1.1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1/1-1:1.1
lrwxrwxrwx 1 root root 0 Feb 14 2019 2-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb2/2-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 3-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb3/3-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 4-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb4/4-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 5-0:1.0 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb5/5-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 6-0:1.0 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb6/6-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb2 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb2
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb3 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb3
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb4 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb4
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb5 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb5
lrwxrwxrwx 1 root root 0 Feb 14 2019 usb6 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb6
Thanks. Yes, that image has no USB device support, so naturally it's not a solution, but helps us ide tofu the source of the issue. BTW, do the non-MacchiatoBin nodes also have adsb devices connected to them?
I have used the adsb devices on the pi4 successfully, but currently each cluster has only 1 adsb device.
Here is the "describe pod"
genericDevicePlugin]$ kubectl --kubeconfig ../k3s-edge.yaml describe pod generic-device-plugin-8p5vd -n kube-system
Name: generic-device-plugin-8p5vd
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: mach-1/192.168.253.204
Start Time: Mon, 27 Mar 2023 11:05:51 -0600
Labels: app.kubernetes.io/name=generic-device-plugin
controller-revision-hash=6d49b8849c
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 10.42.3.28
IPs:
IP: 10.42.3.28
Controlled By: DaemonSet/generic-device-plugin
Containers:
generic-device-plugin:
Container ID: containerd://902f7211b5489fac5a5ef86f7e74d48b25c7e4992feb7e01bfbeeed838ed0f23
Image: squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d
Image ID: docker.io/squat/generic-device-plugin@sha256:0b622cbffac78598d46b7c5fa4f186235f33901182ac5c46cff86452bad1e06e
Port: 8080/TCP
Host Port: 0/TCP
Args:
--device
{"name": "adsb", "groups": [{"usb": [{"vendor": "0bda", "product": "2838"}]} ]}
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 27 Mar 2023 11:16:45 -0600
Finished: Mon, 27 Mar 2023 11:17:41 -0600
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 27 Mar 2023 11:15:05 -0600
Finished: Mon, 27 Mar 2023 11:16:01 -0600
Ready: False
Restart Count: 4
Limits:
cpu: 50m
memory: 10Mi
Requests:
cpu: 50m
memory: 10Mi
Environment: <none>
Mounts:
/dev from dev (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zp49n (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
kube-api-access-zp49n:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned kube-system/generic-device-plugin-8p5vd to mach-1
Normal Pulling 11m kubelet Pulling image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d"
Normal Pulled 11m kubelet Successfully pulled image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d" in 44.044855976s (44.044889698s including waiting)
Normal Created 67s (x5 over 11m) kubelet Created container generic-device-plugin
Normal Pulled 67s (x4 over 5m43s) kubelet Container image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d" already present on machine
Normal Started 65s (x5 over 11m) kubelet Started container generic-device-plugin
Warning BackOff 9s (x7 over 4m47s) kubelet Back-off restarting failed container
Amazing! Thanks for that :) the answer to the issue was in that output
State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 27 Mar 2023 11:16:45 -0600 Finished: Mon, 27 Mar 2023 11:17:41 -0600 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 27 Mar 2023 11:15:05 -0600 Finished: Mon, 27 Mar 2023 11:16:01 -0600
The plugin is getting OOM killed! Can you try doubling the memory allocation and seeing if it still gets killed after some time? I wonder if we have a memory leak or it it indeed was simply too little for that node. It would be great to see some Prometheus metrics of the memory use of that pod on your cluster to confirm if there is a leak
Good catch! i totally missed that! Thanks for the help! I'll double the memory and see if that helps.
For what it's worth I'm not seeing memory leaking issues on my install - did check my Prometheus just to be sure, but I'm hovering at around 8MiB of 10MiB limits.
If you're still getting slaughtered by the OOM reaper, check to make sure you're not under memory pressure from other stuff - I've had this problem before and it ended up being caused by having swap on, but your mileage will vary!
It hasn't crashed since I increase the memory requirements! I appreciate everyone's help and incredibly fast response time!!
Love it :)) in that case i propose we close the issue and reopen if the issue persists <3
Hello! I'm using this in 2 different k3s clusters and I have MacchiatoBin SBC (along with Rockpro64s and Pi-4s) in both. The MacchiatoBins keep crashing every few minutes.(Crashloopbackoff). While they are running they do seem to work as they will schedule workloads on the right node. The logs on the pod do not seem to indicate what is wrong, just looks like start-up messages.
Logs
Architecture
Sorry for very little information, there just doesn't seem to be much. Let me know if you need anything else.