nokia / CPU-Pooler

A Device Plugin for Kubernetes, which exposes the CPU cores as consumable Devices to the Kubernetes scheduler.
BSD 3-Clause "New" or "Revised" License
92 stars 22 forks source link

Allocatable cpu resource will be lost when kubelet restart #58

Closed antzjm closed 3 years ago

antzjm commented 3 years ago

Describe the bug Once kubelet.service restarted, the exclusive and share cpu resource will lost. This problem can be resolved by restartling cpu-device-plugin pod.

To Reproduce Steps to reproduce the behavior:

  1. systemctl restart kubelet.service
  2. kubectl decribe node to observe exclusive and share cpu resouce.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

Relevant SW info

Levovar commented 3 years ago

yes it works like that, but it is by design

in the Device Plugin - Kubelet interaction Kubelet is the server not the client, so it is the DP which needs to seek out the Kubelet. Kubelet does not know about the DPs also Kubelet does not store the registered devices in a persistent store, only in memory. so when you restart the Kubelet these devices are lost, and Pooler needs to register itself again with the new Kubelet process when you restart Pooler it does exactly that

hope this answers

antzjm commented 3 years ago

@Levovar Thanks for your answer. I also want to know it there any workaround to avoid this phenomenon. This design about cpu pooler and kubelet may not be stable on production environment.

antzjm commented 3 years ago

@Levovar There is not 100% about losing resource when restart kubelet. Sometimes I restart kubelet, the exclusive and share cpu resouce of nodes do not loss.

Levovar commented 3 years ago

hmm, interesting. I will ask our internal production team if they have seen this, and whether they have a workaround. I guess one could add a liveness probe checking if the resources managed by Pooler show up in the Allocatable field of the Node it is running on, and exit with error if not

but, other than that registration is a one-time even in other DP reference implementations too as far as I can tell

antzjm commented 3 years ago

I think it's should be an optimizing point for cpu pooler's next development. There are also sriov NIC device-plugin and in my infra, but it will not loss resource when kubelet restart.

Levovar commented 3 years ago

I think it might be because of this: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/dd65f3f65dacaecb01176e139d4373d909a217ef/pkg/resources/server.go#L289

it effectively does the same what I proposed above to be added in the form of a liveness probe: it checkes for signs of kubelet restart, and if it happened it restarts itself to re-trigger registration I think an independent probe based restart is a cleaner and more native solution, so we will add that to the Deployment specification of the DP

antzjm commented 3 years ago

I have look up the DP document. Can I add liveness probe by configure cpu-device-plugin's deamoset yaml or only modify source code can make it use?

antzjm commented 3 years ago

image Here is result I tested. Not every kubelet restart will loss resource. So I guess it not the root cause about liveness probe.

antzjm commented 3 years ago

@Levovar Can you share me the example of liveness probe?

Levovar commented 3 years ago

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command

antzjm commented 3 years ago

@Levovar tks for your support. Hope cpupooler will solve this issue in next generation :)