Open ahmetb opened 4 months ago
I think this happens when upgrading from NFD v0.13 or earlier. The specific reason is the NodeFeature CRD API that was introduced in v0.14 (earlier versions of NFD relied on gRPC communication between nfd-master and nfd-worker instances).
What happens (I think) is that nfd-master starts quickly (and syncs NodeFeature cache) as there are no NodeFeature objects in the cluster. And, bc of that it starts removing the labels from nodes (as it sees no features for them published by the nfd-worker instances). At the same time the nfd-worker instances start and start pushing NodeFeature objects to the apiserver -> there are ton of them and the cache throws timeout errors. Eventually things "stabilize" and node labels are back.
The thing to "fix" this would be upgrading nfd-workers first, so that the NodeFeature objects are created before nfd-master kicks in. Maybe a migration guide from pre-v0.14?
@ahmetb PTAL #1807
Hi @marquiz,
Let me provide more context about the situation:
The logs indicate that the nfd-master pod failed to list the NodeFeature CRD due to a 1-minute request timeout, although the NodeFeature CRDs do exist in the cluster. The large size of the CRD likely contributed to this timeout.
I0719 15:02:17.382982 1 trace.go:236] Trace[1302901114]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (19-Jul-2024 15:01:16.878) (total time: 60504ms):
Trace[1302901114]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 3; INTERNAL_ERROR; received from peer 60504ms (15:02:17.382)
Trace[1302901114]: [1m0.504227078s] [1m0.504227078s] END
In this situation, the nfd-master removes all the labels from nodes and does not add them back.
The nfd-master invokes the node label updates here. The labels were removed because an empty NodeFeature CRD was created here and passed to that function.
It seems the current logic assumes that the NodeFeature CRD missing in the sharedInformer cache is equivalent to the NodeFeature being missing in the cluster. This causes all the labels to be removed and not added back.
@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.
The timeout is another issue, though. That needs to be investigated/solved. Looking at the log that @ahmetb linked nfd-master tried syncing caches for 1,5 hours without succeeding.
@marquiz any ideas why the logs show "timeout" for informer sync many minutes later the logs say "node updated"?
I'm trying to understand if logs show any indication of cache sync ever succeeding, because the program goes into updating nodes shortly after booting.
@lxlxok I'm still not convinced that this is the issue. The nfd-master blocks until all the caches are synced (here). It will never get to the node update part if/when the informer cache sync keeps timeouting.
Hi @marquiz It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.
This also explain why updaterPool start immediately after m.startNfdApiController at here at 04:29:46. The cache of NodeFeature was populated at 04:30:01.
We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.
I0725 04:29:46.988600 1 main.go:66] "-crd-controller is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0725 04:29:46.989165 1 nfd-master.go:283] "Node Feature Discovery Master" version="v0.16.0" nodeName="kind-control-plane" namespace="node-feature-discovery"
I0725 04:29:46.989241 1 nfd-master.go:1381] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.989639 1 nfd-master.go:1429] "configuration successfully updated" configuration=<
AutoDefaultNs: true
DenyLabelNs: {}
EnableTaints: false
ExtraLabelNs: {}
Klog: {}
LabelWhiteList: null
LeaderElection:
LeaseDuration:
Duration: 15000000000
RenewDeadline:
Duration: 10000000000
RetryPeriod:
Duration: 2000000000
NfdApiParallelism: 10
NoPublish: false
ResourceLabels: {}
ResyncPeriod:
Duration: 3600000000000
>
I0725 04:29:46.989910 1 nfd-master.go:1513] "starting the nfd api controller"
I0725 04:29:46.989993 1 nfd-api-controller.go:70] "initializing new NFD API controller" options=<
DisableNodeFeature: false
DisableNodeFeatureGroup: false
ResyncPeriod: 3600000000000
>
I0725 04:29:46.990041 1 updater-pool.go:142] "starting the NFD master updater pool" parallelism=10
I0725 04:29:46.990191 1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureRule (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990211 1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990191 1 reflector.go:296] Starting reflector *v1alpha1.NodeFeature (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990226 1 reflector.go:332] Listing and watching *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990383 1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0725 04:29:46.990395 1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes/node-feature-discovery"
I0725 04:29:46.990402 1 fswatcher.go:99] "added fsnotify watch" path="/etc/kubernetes"
I0725 04:29:46.990407 1 fswatcher.go:99] "added fsnotify watch" path="/etc"
I0725 04:29:46.990411 1 fswatcher.go:99] "added fsnotify watch" path="/"
I0725 04:29:46.990422 1 reflector.go:296] Starting reflector *v1alpha1.NodeFeatureGroup (1h0m0s) from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.990430 1 reflector.go:332] Listing and watching *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203 1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureRule from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994203 1 reflector.go:359] Caches populated for *v1alpha1.NodeFeatureGroup from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:29:46.994508 1 metrics.go:44] "metrics server starting" port=":8081"
I0725 04:29:46.994571 1 component.go:36] [core][Server #1]Server created
I0725 04:29:46.994589 1 nfd-master.go:417] "gRPC health server serving" port=8082
I0725 04:29:46.994633 1 component.go:36] [core][Server #1 ListenSocket #2]ListenSocket created
I0725 04:29:47.995621 1 nfd-master.go:792] "will process all nodes in the cluster"
I0725 04:29:47.998358 1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-control-plane" objectCount=0 duration="83ns"
I0725 04:29:47.998378 1 nfd-master.go:1285] "no updates to node" nodeName="kind-control-plane"
I0725 04:29:47.998799 1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker3" objectCount=0 duration="84ns"
I0725 04:29:47.998811 1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker3"
I0725 04:29:47.998883 1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker2" objectCount=0 duration="83ns"
I0725 04:29:47.998889 1 nfd-master.go:1209] "processed NodeFeatureRule objects" nodeName="kind-worker" objectCount=0 duration="83ns"
I0725 04:29:47.998895 1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker2"
I0725 04:29:47.998896 1 nfd-master.go:1285] "no updates to node" nodeName="kind-worker"
I0725 04:29:56.762856 1 component.go:36] [transport][server-transport 0x400046b980] Closing: EOF
I0725 04:29:56.762900 1 component.go:36] [transport][server-transport 0x400046b980] loopyWriter exiting with error: transport closed by client
I0725 04:29:56.763042 1 component.go:36] [transport][server-transport 0x4000924600] Closing: EOF
I0725 04:29:56.763072 1 component.go:36] [transport][server-transport 0x4000924600] loopyWriter exiting with error: transport closed by client
I0725 04:30:01.288935 1 trace.go:236] Trace[1887133826]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232 (25-Jul-2024 04:29:46.990) (total time: 14298ms):
Trace[1887133826]: ---"Objects listed" error:<nil> 14296ms (04:30:01.286)
Trace[1887133826]: ---"Resource version extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"Objects extracted" 0ms (04:30:01.286)
Trace[1887133826]: ---"SyncWith done" 1ms (04:30:01.288)
Trace[1887133826]: ---"Resource version updated" 0ms (04:30:01.288)
Trace[1887133826]: [14.298920589s] [14.298920589s] END
I0725 04:30:01.288971 1 reflector.go:359] Caches populated for *v1alpha1.NodeFeature from pkg/mod/k8s.io/client-go@v0.30.0/tools/cache/reflector.go:232
I0725 04:30:01.289381 1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker"
I0725 04:30:01.289398 1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker2"
I0725 04:30:01.289401 1 nfd-api-controller.go:80] "NodeFeature added" nodefeature="node-feature-discovery/kind-worker3"
It doesn't block nfd-master to run node update part. The WaitForCacheSync function return the res map immediately at here after it loop all the informer.
That DOES block until the caches have synced. However, in v0.16.0 we don't call WaitForCacheSync()
We tested node-feature-discovery 0.16.0 in kind. Please see the verbose log below for more detail.
You won't see this kind of log with the latest version (v0.16.3). However, there are other issues you are likely to face. I submitted #1810 and #1811 to mitigate those.
That DOES block until the caches have synced. However, in v0.16.0 we don't call
WaitForCacheSync()
@marquiz thanks for the responses. Would the lack of WaitForCacheSync()
combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)
I see v0.16.3 actually does have a WaitForCacheSync()
, maybe that would've made the difference.
By the way nfd-gc
also does not check for the return value here: https://github.com/kubernetes-sigs/node-feature-discovery/blob/bd8d74d6f2523b5cc866801f4e6bdb1d113ffd5d/pkg/nfd-gc/nfd-gc.go#L211
Would the lack of
WaitForCacheSync()
combined with something like https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.16.0/pkg/nfd-master/nfd-master.go#L821-L824 cause node labels to be dropped? (This last snippet is still there in the latest version.)
Yes, exactly that.
I see v0.16.3 actually does have a
WaitForCacheSync()
, maybe that would've made the difference.
It should've. But I suspect that in a BIG cluster with v0.16.3 you'll hit the "timeout problem" (that #1811 addresses) and WaitForCacheSync
would block indefinitely.
By the way
nfd-gc
also does not check for the return value here:
Thanks for pointing out that. It's not critical there but I think it's good to be consistent and start from a known/synced state. I'll submit a PR to fix that one
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
What happened:
While migrating from 0.10 to 0.16.0:
kubectl get nodefeatures -n node-feature-discovery
was unresponsive at the time (likely because the cluster size is 4000 nodes and the NodeFeature CR objects are 130kb each by default)What you expected to happen:
How to reproduce it (as minimally and precisely as possible): Run nfd chart by default on a 4000 node cluster.
Anything else we need to know?:
There were extensive informer sync errors in
nfd-master
logs (seeming to be timing out after 60s). This is likely because the LIST NodeFeatures is a very expensive call (each object is very large + a lot of Nodes in the cluster).Attaching logs: nfd-master.log
My suspicion is that the
nfd-master
somehow does not wait for informer cache to sync (as the first informer sync error occurs exactly 60s after the process starts) –and it treats lack of response as "empty set of labels" and clearing the labels. (But I'm not familiar with the inner workings of the codebase, it's just a theory.)💡 We don't see the issue on much smaller clusters.
💡 We have not yet tried v0.16.2 (release notes mention it fixes a node removal issue, but it's clear what was the root cause there).
Environment:
kubectl version
): v1.23.17cat /etc/os-release
): Not applicableuname -a
): Not applicable