Open ahmetb opened 4 months ago
Hello, I'm trying to understand why the NodeFeature resource is owned by DaemonSet Pods.
The reason is garbage collection, keeping the cluster clean of stale NodeFeature objects after uninstallation of NFD (and e.g. prevent another instance of NFD picking them).
We've been using 0.4.0 and noticed that 0.16.x brings a ton of architectural changes. These changes seem to lead to removals under some scenarios like #1802, and removal though Kubernetes garbage collector being another.
There were issues in v0.16.0 but those should be now mitigated in the latest v0.16.3.
In our use case, removal of node labels at any time is unacceptable since we heavily rely on
nodeSelector
s for scheduling (or preventing scheduling), program controllers to avoid touching nodes with certain labels etc. So things like point-in-time absence of a DaemonSet pod causing removal of NodeFeature.
I can see you point. This is now mitigated so that the NodeFeature objects are owned by both the Pod and the Daemonset. NodeFeatures are only GC'd if you uninstall the daemonset.
What we could do, is to add something like core.noOwnerRefs
configuration setting to nfd-worker to prevent automatic GC for good. WDYT?
Is there a reason why NodeFeature is owned by nfd-worker
Pod
, say, instead ofNode
?
A namespaced object (NodeFeature) cannot be "owned" (as by ownerReference) by a cluster-scoped object (Node).
Sorry I forgot to reply.
A namespaced object (NodeFeature) cannot be "owned" (as by ownerReference) by a cluster-scoped object (Node).
Are you sure? I think it's the other way around. :)
What we could do, is to add something like core.noOwnerRefs configuration setting to nfd-worker to prevent automatic GC for good. WDYT?
IMO this should be the default. Many serious companies rely on persistence of node labels for scheduling-time decisions, as there's no way to select a particular group of nodes (pools) some other way in a heterogeneous cluster.
The reason is garbage collection, keeping the cluster clean of stale NodeFeature objects after uninstallation of NFD
If I'm uninstalling NFD I delete CRD which deletes CRs of that type. I don't follow why this is a path the project needs to handle.
It seems newer versions of NFD now parent NodeFeature resource to DaemonSet
.
I think this is still a bad idea. We might remove the DaemonSet accidentally, or another reason like use of kubectl replace
that might change the parent object's UID and trigger all node labels to be deleted (which is a no-go for us).
If the worry here is the cleanup, uninstalling the chart would delete the CRD which would delete the CRs anyway.
Hello, I'm trying to understand why the NodeFeature resource is owned by DaemonSet Pods.
We've been using 0.4.0 and noticed that 0.16.x brings a ton of architectural changes. These changes seem to lead to removals under some scenarios like #1802, and removal though Kubernetes garbage collector being another.
In our use case, removal of node labels at any time is unacceptable since we heavily rely on
nodeSelector
s for scheduling (or preventing scheduling), program controllers to avoid touching nodes with certain labels etc. So things like point-in-time absence of a DaemonSet pod causing removal of NodeFeature.Is there a reason why NodeFeature is owned by nfd-worker
Pod
, say, instead ofNode
?cc: @lxlxok