kubernetes-sigs / node-feature-discovery

Node feature discovery for Kubernetes
Apache License 2.0
775 stars 240 forks source link

nfd-topology-updater: can't run normally when there are empty huagepages in some numa nodes #1286

Closed freelizhun closed 1 year ago

freelizhun commented 1 year ago

What happened: nfd-topology-updater can't running normally when there are empty huagepages in some numa nodes eg:

[root@master1 node-feature-discovery]# kubectl -n node-feature-discovery get pods -o wide
NAME                               READY   STATUS             RESTARTS       AGE     IP             NODE      NOMINATED NODE   READINESS GATES
nfd-topology-updater-mwhx2         1/1     Running            0              7m52s   10.119.1.203   node1     <none>           <none>
nfd-topology-updater-rv6kp         0/1     CrashLoopBackOff   6 (117s ago)   7m52s   10.119.0.75    master1   <none>           <none>

[root@master1 node-feature-discovery]# kubectl -n node-feature-discovery logs nfd-topology-updater-rv6kp 
I0731 08:11:40.262277       1 nfd-topology-updater.go:127] "Node Feature Discovery Topology Updater" version="v0.14.0-devel-161-ge0f10a81-dirty" nodeName="master1"
I0731 08:11:40.262395       1 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/host-var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262455       1 component.go:36] [core][Channel #1] Channel created
I0731 08:11:40.262473       1 component.go:36] [core][Channel #1] original dial target is: "/host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262506       1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme: Authority: Endpoint:host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme: Opaque: User: Host: Path:/host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
I0731 08:11:40.262519       1 component.go:36] [core][Channel #1] fallback to scheme "passthrough"
I0731 08:11:40.262539       1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:/host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme:passthrough Opaque: User: Host: Path://host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
I0731 08:11:40.262560       1 component.go:36] [core][Channel #1] Channel authority set to "/host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262725       1 component.go:36] [core][Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "/host-var/lib/kubelet/pod-resources/kubelet.sock",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
I0731 08:11:40.262790       1 component.go:36] [core][Channel #1] Channel switches to new LB policy "pick_first"
I0731 08:11:40.262825       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel created
2023/07/31 08:11:40 Connected to '"/host-var/lib/kubelet/pod-resources/kubelet.sock"'!
I0731 08:11:40.262904       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
I0731 08:11:40.262935       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel picks a new address "/host-var/lib/kubelet/pod-resources/kubelet.sock" to connect
I0731 08:11:40.263074       1 component.go:36] [core][Channel #1] Channel Connectivity change to CONNECTING
I0731 08:11:40.263126       1 nfd-topology-updater.go:294] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config=&{ExcludeList:map[]}
I0731 08:11:40.263148       1 podresourcesscanner.go:53] "watching all namespaces"
I0731 08:11:40.263366       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to READY
I0731 08:11:40.263392       1 component.go:36] [core][Channel #1] Channel Connectivity change to READY
E0731 08:11:40.493194       1 main.go:71] "error while running" err="failed to obtain node resource information: open /host-sys/bus/node/devices/node1/hugepages: no such file or directory"
[root@master1 node-feature-discovery]# 
[root@master1 node-feature-discovery]# ls /sys/bus/node/devices/node1
compact  cpu10  cpu11  cpu12  cpu13  cpu14  cpu15  cpu8  cpu9  cpulist  cpumap  distance  meminfo  numastat  power  subsystem  uevent  vmstat
[root@master1 node-feature-discovery]# numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32272 MB
node 0 free: 24754 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32730 MB
node 2 free: 27810 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32730 MB
node 4 free: 28156 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 32730 MB
node 6 free: 30288 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus: 64 65 66 67 68 69 70 71
node 8 size: 32666 MB
node 8 free: 24379 MB
node 9 cpus: 72 73 74 75 76 77 78 79
node 9 size: 0 MB
node 9 free: 0 MB
node 10 cpus: 80 81 82 83 84 85 86 87
node 10 size: 32730 MB
node 10 free: 26705 MB
node 11 cpus: 88 89 90 91 92 93 94 95
node 11 size: 0 MB
node 11 free: 0 MB
node 12 cpus: 96 97 98 99 100 101 102 103
node 12 size: 32707 MB
node 12 free: 27130 MB
node 13 cpus: 104 105 106 107 108 109 110 111
node 13 size: 0 MB
node 13 free: 0 MB
node 14 cpus: 112 113 114 115 116 117 118 119
node 14 size: 31665 MB
node 14 free: 29324 MB
node 15 cpus: 120 121 122 123 124 125 126 127
node 15 size: 0 MB
node 15 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  0:  10  20  40  30  20  30  50  40  100  100  100  100  100  100  100  100 
  1:  20  10  30  40  50  20  40  50  100  100  100  100  100  100  100  100 
  2:  40  30  10  20  40  50  20  30  100  100  100  100  100  100  100  100 
  3:  30  40  20  10  30  20  40  50  100  100  100  100  100  100  100  100 
  4:  20  50  40  30  10  50  30  20  100  100  100  100  100  100  100  100 
  5:  30  20  50  20  50  10  50  40  100  100  100  100  100  100  100  100 
  6:  50  40  20  40  30  50  10  30  100  100  100  100  100  100  100  100 
  7:  40  50  30  50  20  40  30  10  100  100  100  100  100  100  100  100 
  8:  100  100  100  100  100  100  100  100  10  20  40  30  20  30  50  40 
  9:  100  100  100  100  100  100  100  100  20  10  30  40  50  20  40  50 
 10:  100  100  100  100  100  100  100  100  40  30  10  20  40  50  20  30 
 11:  100  100  100  100  100  100  100  100  30  40  20  10  30  20  40  50 
 12:  100  100  100  100  100  100  100  100  20  50  40  30  10  50  30  20 
 13:  100  100  100  100  100  100  100  100  30  20  50  20  50  10  50  40 
 14:  100  100  100  100  100  100  100  100  50  40  20  40  30  50  10  30 
 15:  100  100  100  100  100  100  100  100  40  50  30  50  20  40  30  10 

What you expected to happen: nfd-topology-updater pods running normally when there are empty huagepages in some numa nodes

How to reproduce it (as minimally and precisely as possible): $ git clone https://github.com/kubernetes-sigs/node-feature-discovery.git $ cd node-feature-discovery $ kubectl apply -k deployment/overlays/topologyupdater

Environment:

freelizhun commented 1 year ago

/assign

marquiz commented 1 year ago

Thanks @freelizhun for reporting this (and for the fix, too).