squat / generic-device-plugin

A Kubernetes device plugin to schedule generic Linux devices
Apache License 2.0
198 stars 22 forks source link

Device Synchronization Lag in Pod Readiness #59

Closed ndgoodman closed 9 months ago

ndgoodman commented 10 months ago

Hello Squat,

We're experiencing a problem where the generic device plugin pod becomes ready before the devices on our host are actually ready. This leads to complications for another pod relying on this device plugin, as it can't start due to unmet requirements. Currently, we're resolving this by restarting the generic device plugin pod, which appears to solve the problem.

Is there a method to configure the generic device plugin so that it only reaches a ready state after the host devices are fully prepared?

squat commented 10 months ago

Hi @ndgoodman, just to make sure I understand, the device plugin is starting and registering itself with the Kubelet before your devices are discoverable on the filesystem and thus the Kubernetes doesn't know about your devices?

This is concerning; a key feature is that the device plugin should always update the Kubelet when new devices are available or disappear.

There isn't a mechanism for delaying registration with the Kubelet today. There's a good reason for this: clusters are heterogeneous and there's no reason why all nodes should require all devices to be present for that node to be considered ready. More importantly, I'm surprised that new devices aren't being discovered by the device plugin as they come and go. That is definitely something to investigate.

A dirty hack to achieve the delay you're looking for could be to add an init container to the DaemonSet that only exits once the device in question is discovered.

squat commented 9 months ago

Discovering devices on-the-fly as they appear on the node should be existing functionality. If this isn't working, please re-open the issue so we can debug further. Closing this for now!