squat / generic-device-plugin

A Kubernetes device plugin to schedule generic Linux devices
Apache License 2.0
208 stars 23 forks source link

Keeps crashing on a MacchiatoBin SBC #11

Closed detscorn closed 1 year ago

detscorn commented 1 year ago

Hello! I'm using this in 2 different k3s clusters and I have MacchiatoBin SBC (along with Rockpro64s and Pi-4s) in both. The MacchiatoBins keep crashing every few minutes.(Crashloopbackoff). While they are running they do seem to work as they will schedule workloads on the right node. The logs on the pod do not seem to indicate what is wrong, just looks like start-up messages.

Logs

Mon, Mar 27 2023 8:40:03 am | {"caller":"main.go:227","msg":"Starting the generic-device-plugin for \"squat.ai/adsb\".","ts":"2023-03-27T14:40:03.571928222Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"plugin.go:116","level":"info","msg":"listening on Unix socket","resource":"squat.ai/adsb","socket":"/var/lib/kubelet/device-plugins/gdp-c3F1YXQuYWkvYWRzYg==-1679928003.sock","ts":"2023-03-27T14:40:03.572079912Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"plugin.go:123","level":"info","msg":"starting gRPC server","resource":"squat.ai/adsb","ts":"2023-03-27T14:40:03.572447257Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"plugin.go:138","level":"info","msg":"waiting for the gRPC server to be ready","resource":"squat.ai/adsb","ts":"2023-03-27T14:40:03.572464418Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"plugin.go:150","level":"info","msg":"the gRPC server is ready","resource":"squat.ai/adsb","ts":"2023-03-27T14:40:03.573457284Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"plugin.go:188","level":"info","msg":"registering plugin with kubelet","resource":"squat.ai/adsb","ts":"2023-03-27T14:40:03.573553451Z"}
Mon, Mar 27 2023 8:40:03 am | {"caller":"generic.go:215","level":"info","msg":"starting listwatch","resource":"squat.ai/adsb","ts":"2023-03-27T14:40:03.771650039Z"}

Architecture

mach-1:~$ cat /etc/*release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

mach-1:~$ uname -a
Linux mach-1 5.1.0-trunk-arm64 #1 SMP Debian 5.1.10-1~exp1sr2 (2019-06-18) aarch64 GNU/Linux

Sorry for very little information, there just doesn't seem to be much. Let me know if you need anything else.

squat commented 1 year ago

Hi @detscorn does the whole MachiatoBin node crash? Or just the generic-device-plugin container? Also, what exact tag are you using?

detscorn commented 1 year ago

Hey @squat. No just the daemon set container. I'm sorry I don't know what you mean by tag? Are you taking about image tag?

squat commented 1 year ago

Yes, exactly :) what image tag / version of the plugin are you running?

detscorn commented 1 year ago

image: squat/generic-device-plugin

squat commented 1 year ago

It would be helpful to find the exact version of the plugin you are running. This helps rule out changes PRs to the project that might have introduced a breaking change.

Could you please try pinning the image to: squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3

This is right before we introduced USB device support and I'd like to rule that out as the issue.

squat commented 1 year ago

Also, could you please share the output of kubectl describe pod -n kube-system <name of generic-device-plugin pod on MachiatoBin>? This can potentially help surface some extra information. Otherwise, we'll need to look through the Kubelet logs in your journald to see if there's any other helpful information.

detscorn commented 1 year ago

So far so good. It's now been up for 6 min, but it's run longer before crashing with other image.

Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 mach-1/192.168.253.204
Start Time:           Mon, 27 Mar 2023 09:50:00 -0600
Labels:               app.kubernetes.io/name=generic-device-plugin
                      controller-revision-hash=545d6d95b6
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   10.42.3.27
IPs:
  IP:           10.42.3.27
Controlled By:  DaemonSet/generic-device-plugin
Containers:
  generic-device-plugin:
    Container ID:  containerd://589b9bf44f8bb9a77fa8669d68f8d1f76456424c087f53fe28d84b45d020f91b
    Image:         squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3
    Image ID:      docker.io/squat/generic-device-plugin@sha256:479881f1f337562d8b81d77101327cc591973f5db3b4f3c6978fe7f9c2fda6c2
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      --device
      {"name": "adsb", "groups": [{"usb": [{"vendor": "0bda", "product": "2838"}]} ]}
    State:          Running
      Started:      Mon, 27 Mar 2023 09:50:47 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  10Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /dev from dev (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r5l6n (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  kube-api-access-r5l6n:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m43s  default-scheduler  Successfully assigned kube-system/generic-device-plugin-7n2bj to mach-1
  Normal  Pulling    3m42s  kubelet            Pulling image "squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3"
  Normal  Pulled     2m58s  kubelet            Successfully pulled image "squat/generic-device-plugin:bd0d5d18081e0b56b00271688f2ded15e6a1b3c3" in 44.423511929s (44.423543331s including waiting)
  Normal  Created    2m58s  kubelet            Created container generic-device-plugin
  Normal  Started    2m56s  kubelet            Started container generic-device-plugin
squat commented 1 year ago

Ah ok, so you're using the recently added USB feature of generic-device-plugin. Indeed, if the USB code isn't exercised and the plugin doesn't crash, then there must be something buggy in the USB feature.

cc @duckfullstop maybe you have some idea of what might be going on / how to debug?

@detscorn, can you please bump back up to a recent, explicit tag of the plugin and share the output of that kubectl describe... command again once the plugin has already crashed? e.g. ghcr.io/squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d

duckfullstop commented 1 year ago

I might not be able to help too much in the short term due to current personal health shenanigans, but in the interim, could you provide the output of ls -Rla /sys/bus/usb/devices?

squat commented 1 year ago

Sorry to hear that @duckfullstop! Take care of yourself above all else

detscorn commented 1 year ago

Sorry was in meetings. So the plugin hasn't crashed with image you had me load. I'll load the newer image and give you the output of describe in a moment.

detscorn commented 1 year ago
mach-1:~$ ls -Rla /sys/bus/usb/devices
/sys/bus/usb/devices:
total 0
drwxr-xr-x 2 root root 0 Mar 27 17:06 .
drwxr-xr-x 4 root root 0 Jan  1  1970 ..
lrwxrwxrwx 1 root root 0 Feb 14  2019 1-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 1-1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1
lrwxrwxrwx 1 root root 0 Feb 14  2019 1-1:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1/1-1:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 1-1:1.1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1/1-1/1-1:1.1
lrwxrwxrwx 1 root root 0 Feb 14  2019 2-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb2/2-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 3-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb3/3-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 4-0:1.0 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb4/4-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 5-0:1.0 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb5/5-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 6-0:1.0 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb6/6-0:1.0
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb1 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb1
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb2 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2500000.usb3/usb2
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb3 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb3
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb4 -> ../../../devices/platform/cp0/cp0:config-space@f2000000/f2510000.usb3/usb4
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb5 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb5
lrwxrwxrwx 1 root root 0 Feb 14  2019 usb6 -> ../../../devices/platform/cp1/cp1:config-space@f4000000/f4500000.usb3/usb6
squat commented 1 year ago

Thanks. Yes, that image has no USB device support, so naturally it's not a solution, but helps us ide tofu the source of the issue. BTW, do the non-MacchiatoBin nodes also have adsb devices connected to them?

detscorn commented 1 year ago

I have used the adsb devices on the pi4 successfully, but currently each cluster has only 1 adsb device.

Here is the "describe pod"

genericDevicePlugin]$ kubectl --kubeconfig ../k3s-edge.yaml describe pod generic-device-plugin-8p5vd -n kube-system
Name:                 generic-device-plugin-8p5vd
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 mach-1/192.168.253.204
Start Time:           Mon, 27 Mar 2023 11:05:51 -0600
Labels:               app.kubernetes.io/name=generic-device-plugin
                      controller-revision-hash=6d49b8849c
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   10.42.3.28
IPs:
  IP:           10.42.3.28
Controlled By:  DaemonSet/generic-device-plugin
Containers:
  generic-device-plugin:
    Container ID:  containerd://902f7211b5489fac5a5ef86f7e74d48b25c7e4992feb7e01bfbeeed838ed0f23
    Image:         squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d
    Image ID:      docker.io/squat/generic-device-plugin@sha256:0b622cbffac78598d46b7c5fa4f186235f33901182ac5c46cff86452bad1e06e
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      --device
      {"name": "adsb", "groups": [{"usb": [{"vendor": "0bda", "product": "2838"}]} ]}
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 27 Mar 2023 11:16:45 -0600
      Finished:     Mon, 27 Mar 2023 11:17:41 -0600
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 27 Mar 2023 11:15:05 -0600
      Finished:     Mon, 27 Mar 2023 11:16:01 -0600
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     50m
      memory:  10Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /dev from dev (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zp49n (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  kube-api-access-zp49n:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  11m                  default-scheduler  Successfully assigned kube-system/generic-device-plugin-8p5vd to mach-1
  Normal   Pulling    11m                  kubelet            Pulling image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d"
  Normal   Pulled     11m                  kubelet            Successfully pulled image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d" in 44.044855976s (44.044889698s including waiting)
  Normal   Created    67s (x5 over 11m)    kubelet            Created container generic-device-plugin
  Normal   Pulled     67s (x4 over 5m43s)  kubelet            Container image "squat/generic-device-plugin:944bcffabd132cfcbdf3caa39ba3b6a979a0861d" already present on machine
  Normal   Started    65s (x5 over 11m)    kubelet            Started container generic-device-plugin
  Warning  BackOff    9s (x7 over 4m47s)   kubelet            Back-off restarting failed container
squat commented 1 year ago

Amazing! Thanks for that :) the answer to the issue was in that output

State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 27 Mar 2023 11:16:45 -0600 Finished: Mon, 27 Mar 2023 11:17:41 -0600 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 27 Mar 2023 11:15:05 -0600 Finished: Mon, 27 Mar 2023 11:16:01 -0600

The plugin is getting OOM killed! Can you try doubling the memory allocation and seeing if it still gets killed after some time? I wonder if we have a memory leak or it it indeed was simply too little for that node. It would be great to see some Prometheus metrics of the memory use of that pod on your cluster to confirm if there is a leak

detscorn commented 1 year ago

Good catch! i totally missed that! Thanks for the help! I'll double the memory and see if that helps.

duckfullstop commented 1 year ago

For what it's worth I'm not seeing memory leaking issues on my install - did check my Prometheus just to be sure, but I'm hovering at around 8MiB of 10MiB limits.

If you're still getting slaughtered by the OOM reaper, check to make sure you're not under memory pressure from other stuff - I've had this problem before and it ended up being caused by having swap on, but your mileage will vary!

detscorn commented 1 year ago

It hasn't crashed since I increase the memory requirements! I appreciate everyone's help and incredibly fast response time!!

squat commented 1 year ago

Love it :)) in that case i propose we close the issue and reopen if the issue persists <3