Closed freelizhun closed 1 year ago
What happened: nfd-topology-updater can't running normally when there are empty huagepages in some numa nodes eg:
[root@master1 node-feature-discovery]# kubectl -n node-feature-discovery get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nfd-topology-updater-mwhx2 1/1 Running 0 7m52s 10.119.1.203 node1 <none> <none> nfd-topology-updater-rv6kp 0/1 CrashLoopBackOff 6 (117s ago) 7m52s 10.119.0.75 master1 <none> <none> [root@master1 node-feature-discovery]# kubectl -n node-feature-discovery logs nfd-topology-updater-rv6kp I0731 08:11:40.262277 1 nfd-topology-updater.go:127] "Node Feature Discovery Topology Updater" version="v0.14.0-devel-161-ge0f10a81-dirty" nodeName="master1" I0731 08:11:40.262395 1 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/host-var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///host-var/lib/kubelet/pod-resources/kubelet.sock" I0731 08:11:40.262455 1 component.go:36] [core][Channel #1] Channel created I0731 08:11:40.262473 1 component.go:36] [core][Channel #1] original dial target is: "/host-var/lib/kubelet/pod-resources/kubelet.sock" I0731 08:11:40.262506 1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme: Authority: Endpoint:host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme: Opaque: User: Host: Path:/host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}} I0731 08:11:40.262519 1 component.go:36] [core][Channel #1] fallback to scheme "passthrough" I0731 08:11:40.262539 1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:/host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme:passthrough Opaque: User: Host: Path://host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}} I0731 08:11:40.262560 1 component.go:36] [core][Channel #1] Channel authority set to "/host-var/lib/kubelet/pod-resources/kubelet.sock" I0731 08:11:40.262725 1 component.go:36] [core][Channel #1] Resolver state updated: { "Addresses": [ { "Addr": "/host-var/lib/kubelet/pod-resources/kubelet.sock", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null } ], "ServiceConfig": null, "Attributes": null } (resolver returned new addresses) I0731 08:11:40.262790 1 component.go:36] [core][Channel #1] Channel switches to new LB policy "pick_first" I0731 08:11:40.262825 1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel created 2023/07/31 08:11:40 Connected to '"/host-var/lib/kubelet/pod-resources/kubelet.sock"'! I0731 08:11:40.262904 1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING I0731 08:11:40.262935 1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel picks a new address "/host-var/lib/kubelet/pod-resources/kubelet.sock" to connect I0731 08:11:40.263074 1 component.go:36] [core][Channel #1] Channel Connectivity change to CONNECTING I0731 08:11:40.263126 1 nfd-topology-updater.go:294] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config=&{ExcludeList:map[]} I0731 08:11:40.263148 1 podresourcesscanner.go:53] "watching all namespaces" I0731 08:11:40.263366 1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to READY I0731 08:11:40.263392 1 component.go:36] [core][Channel #1] Channel Connectivity change to READY E0731 08:11:40.493194 1 main.go:71] "error while running" err="failed to obtain node resource information: open /host-sys/bus/node/devices/node1/hugepages: no such file or directory" [root@master1 node-feature-discovery]# [root@master1 node-feature-discovery]# ls /sys/bus/node/devices/node1 compact cpu10 cpu11 cpu12 cpu13 cpu14 cpu15 cpu8 cpu9 cpulist cpumap distance meminfo numastat power subsystem uevent vmstat [root@master1 node-feature-discovery]# numactl -H available: 16 nodes (0-15) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32272 MB node 0 free: 24754 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32730 MB node 2 free: 27810 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: 32 33 34 35 36 37 38 39 node 4 size: 32730 MB node 4 free: 28156 MB node 5 cpus: 40 41 42 43 44 45 46 47 node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: 48 49 50 51 52 53 54 55 node 6 size: 32730 MB node 6 free: 30288 MB node 7 cpus: 56 57 58 59 60 61 62 63 node 7 size: 0 MB node 7 free: 0 MB node 8 cpus: 64 65 66 67 68 69 70 71 node 8 size: 32666 MB node 8 free: 24379 MB node 9 cpus: 72 73 74 75 76 77 78 79 node 9 size: 0 MB node 9 free: 0 MB node 10 cpus: 80 81 82 83 84 85 86 87 node 10 size: 32730 MB node 10 free: 26705 MB node 11 cpus: 88 89 90 91 92 93 94 95 node 11 size: 0 MB node 11 free: 0 MB node 12 cpus: 96 97 98 99 100 101 102 103 node 12 size: 32707 MB node 12 free: 27130 MB node 13 cpus: 104 105 106 107 108 109 110 111 node 13 size: 0 MB node 13 free: 0 MB node 14 cpus: 112 113 114 115 116 117 118 119 node 14 size: 31665 MB node 14 free: 29324 MB node 15 cpus: 120 121 122 123 124 125 126 127 node 15 size: 0 MB node 15 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 20 40 30 20 30 50 40 100 100 100 100 100 100 100 100 1: 20 10 30 40 50 20 40 50 100 100 100 100 100 100 100 100 2: 40 30 10 20 40 50 20 30 100 100 100 100 100 100 100 100 3: 30 40 20 10 30 20 40 50 100 100 100 100 100 100 100 100 4: 20 50 40 30 10 50 30 20 100 100 100 100 100 100 100 100 5: 30 20 50 20 50 10 50 40 100 100 100 100 100 100 100 100 6: 50 40 20 40 30 50 10 30 100 100 100 100 100 100 100 100 7: 40 50 30 50 20 40 30 10 100 100 100 100 100 100 100 100 8: 100 100 100 100 100 100 100 100 10 20 40 30 20 30 50 40 9: 100 100 100 100 100 100 100 100 20 10 30 40 50 20 40 50 10: 100 100 100 100 100 100 100 100 40 30 10 20 40 50 20 30 11: 100 100 100 100 100 100 100 100 30 40 20 10 30 20 40 50 12: 100 100 100 100 100 100 100 100 20 50 40 30 10 50 30 20 13: 100 100 100 100 100 100 100 100 30 20 50 20 50 10 50 40 14: 100 100 100 100 100 100 100 100 50 40 20 40 30 50 10 30 15: 100 100 100 100 100 100 100 100 40 50 30 50 20 40 30 10
What you expected to happen: nfd-topology-updater pods running normally when there are empty huagepages in some numa nodes
How to reproduce it (as minimally and precisely as possible): $ git clone https://github.com/kubernetes-sigs/node-feature-discovery.git $ cd node-feature-discovery $ kubectl apply -k deployment/overlays/topologyupdater
Environment:
/assign
Thanks @freelizhun for reporting this (and for the fix, too).
What happened: nfd-topology-updater can't running normally when there are empty huagepages in some numa nodes eg:
What you expected to happen: nfd-topology-updater pods running normally when there are empty huagepages in some numa nodes
How to reproduce it (as minimally and precisely as possible): $ git clone https://github.com/kubernetes-sigs/node-feature-discovery.git $ cd node-feature-discovery $ kubectl apply -k deployment/overlays/topologyupdater
Environment: