Closed rbaumgar closed 1 year ago
this is kind of outside of the scope of typical usage of a device plugin.
Pods cannot release allocated devices while they are running, even if they do not use them; the scheduler has granted them exclusive access to the resource for their lifetime.
However, one way for perhaps implement similar logic would be to use lifecycle hooks or a sidecar container that does not release the device but instead creates another file that matches the device's glob pattern once your main container has finished starting up. This way your container has exclusive access to the file but also other containers can start up. The hook/sidecar would also need to delete the originally allocated device to ensure that number of unallocated dummy devices always stays at the desired number once the pod is deleted.
This is a non-standard workaround for the scheduler for sure and might funny edge cases if for example the lifecycle hook for some reason fails to create a new device but does delete its own device so the number of devices goes to 0. A safer/more predictable option might be too implement such logic in an extension to the Kube scheduler.
Lucas, it could be tricky in case of a node crash/reboot. Think of 20 pods running and the count for dummy devices is 10. you will have 30 devices after reboot. so cleanup is required before the first pod should be scheduled.
yes exactly, this is the kind of edge-case I wanted to describe: you cannot guarantee that a step runs and thus the number of devices cannot be guaranteed to remain stable. As I mentioned, I don't think that the generic device plugin is the right tool for this job.
It sounds to me like you're essentially trying to create a kind of semaphore to synchronize access to a shared resource. If you are set on using the generic-device-plugin, there are alternative architectures you could employ.
For example, instead of having the Java container be responsible for deleting the dummy device files, you could deploy a daemonset that runs on all nodes and enforces that there is always a maximum of X files in the directory. Files are all named with the timestamp of when they were created and the daemoset deletes files in lexicographic order to ensure that the files that remain can actually be allocated to new pods rather than belonging to old pods. Note: dummy device files are deleted after the Java container has bootstrapped but while it is still running. Furthermore, the device files are all created in a tmpfs on the host, ensuring that when the node reboots, all files are guaranteed to be gone and no Java containers will start until the daemonset pod starts and creates new dummy device files.
To reiterate, "releasing a device" is easy: just delete the device and create a new one that matches the device glob. The Kubernetes scheduler will allow the old device to continue running and will also allocate the new device to a new, waiting pod. The challenging bit is the orchestration of deleting and creating the dummy device files to have the intended robustness.
Again, this is just one potential architecture that uses the generic-device-plugin. It has trade-offs and edge-cases as this is a non-standard application of the plugin and not intended to be supported.
Closing for now since this is out of scope for the plugin. Maybe we can turn this into a GitHub discussion if we need to brainstorm more architectures. If we come up with a compelling solution we can add a doc for this.
Hi Lucas,
the generic-devive-plugin can be perfectly used to limit the number of running pods on a specific node. Just create a dummy device or a group with a certain number. Add a device requirement on the pod. done. very nice! very useful! Now my question: how can I release a device during the pod is running? lets after the pod is ready. Why: Java, Jakarta EE, JEE, and Spring applications need a huge amount of CPU during startup but often not during runtime. So when a node is restarting with many Java applications they will not come up. When you do not define CPU request. the Java pods do not come up, because no pods get enough resources and hit startup timeout. when you define the CPU request high you waste a lot of CPU resources when all pods are running. So it could be useful to somehow serialize the startup of the Java applications. How: I know I can watch the pods and realize that the one pod gets ready (liveness probe), but how would I release the dummy device at this time, so that the next Java application can startup? Other pods would not be affected, because they do not have a dummy device request. Thanks for your feedback! KR Robert