squat / generic-device-plugin

A Kubernetes device plugin to schedule generic Linux devices
Apache License 2.0
210 stars 23 forks source link

UnexpectedAdmissionError occurs after node restart #33

Closed artem-zinnatullin closed 1 year ago

artem-zinnatullin commented 1 year ago

Hi!

I use generic-device-plugin to expose Zigbee USB stick to Zigbee2Mqtt

  - --device
  - '{"name": "zigbee", "groups": [{"paths": [{"path": "/dev/ttyACM0"}]}]}'

It works and I'm able to match the node with squat.ai/zigbee: 1 in a Deployment with replicas 1

However if node (it runs both as K8S controller and K8S worker) restarts I start seeing many instances of that pod despite it being ran as Deployment with replicas: 1 in UnexpectedAdmissionError state

image

kubectl describe pod gives this:

Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices squat.ai/zigbee, which is unexpected

My understanding is that at the time K8S tries to run my Deployment the generic-device-plugin hasn't started yet and when it does there is some race condition and K8S tries to spin up many pods with access to same device and only one pod succeeds and others fail into UnexpectedAdmissionError

I wonder if there is any solution to this?

squat commented 1 year ago

@artem-zinnatullin thanks for reporting this. Can you please share any relevant logs from the kubelet on the affected node at the time of the errors?

One thing that occurs to me is that the issue might arrise because this project is not strictly observing the initialization order required by Kubernetes. Strictly speaking, the gRPC server for the plugin must be started before the plugin registers itself with the Kubelet, however, this plugin does both concurrently [0]. This could be causing the issue and would not be difficult to correct.

[0] https://github.com/squat/generic-device-plugin/blob/main/deviceplugin/plugin.go#L115-L160

squat commented 1 year ago

hi @artem-zinnatullin can you please update to the latest version of the plugin now that #35 is merged? I wonder if this might have any effect on the UnexpectedAdmissionError message you are seeing. If not, then we will need to see logs from your Kubelet and scheduler before we can proceed any futrher.

FWIW, I see similar messages on clusters running the NVIDIA device plugin. I wonder to what extent this is common to device plugins.

squat commented 1 year ago

Oh, one last question: what version of k8s are you runnnig?

squat commented 1 year ago

closing for now. please re-open if you think there is a problem with the device plugin or you need more help!