Open k8scoder192 opened 1 month ago
kubernetes-nmstate do some probes before commit the changes to ensure that it do not break the cluster, that's where the time goes on.
@qinqon did you look at the full logs? It's NOT due to the probes. The probes come back extremely quick. Something else is wrong.
Edit I also noticed this is only happening on 1 worker node. If I look at the logs of the other nmstate-handler pods, they aren't constantly reporting "enactment updated at the node: true". Something is triggering this pod to constantly perform state changes like below for all the interfaces
journalctl -S today -u NetworkManager.service -f (output on the node I see the problem on)
May 31 18:12:37 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179157.9313] device (usb-int-v87): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
May 31 18:12:37 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179157.9321] device (usb-int-v87): Activation: starting connection 'usb-int-v87' (74ec521d-7a67-4ade-82ae-19f6892e645c)
May 31 18:12:37 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179157.9326] device (usb-int-v87): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
May 31 18:12:37 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179157.9342] device (usb-int-v87): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
May 31 18:12:37 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179157.9371] device (usb-int-v87): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0251] device (usb-int-br87): attached bridge port usb-int-v87
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0251] device (usb-int-v87): Activation: connection 'usb-int-v87' enslaved, continuing activation
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0256] device (usb-int-v87): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0281] device (usb-int-v87): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0283] device (usb-int-v87): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
May 31 18:12:38 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179158.0287] device (usb-int-v87): Activation: successful, device activated.
May 31 18:12:42 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179162.9356] checkpoint[0x55bcd4c8d450]: destroy /org/freedesktop/NetworkManager/Checkpoint/292
May 31 18:12:42 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179162.9390] audit: op="checkpoint-destroy" arg="/org/freedesktop/NetworkManager/Checkpoint/292" pid=2292492 uid=0 result="success"
May 31 18:12:46 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179166.5941] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/293" pid=2292893 uid=0 result="success"
May 31 18:12:46 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179166.5968] audit: op="checkpoint-adjust-rollback-timeout" arg="/org/freedesktop/NetworkManager/Checkpoint/293" pid=2292893 uid=0 result="success"
May 31 18:12:47 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179167.3864] audit: op="connection-update" uuid="c504a0fa-730b-4182-8b98-f404ab44c196" name="usb-int-br67" args="connection.timestamp" pid=2292893 uid=0 result="success"
May 31 18:12:47 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179167.5521] audit: op="device-reapply" interface="usb-int-br67" ifindex=177998 pid=2292893 uid=0 result="success"
May 31 18:12:47 dal4-qz4-sr7-rk047-s04 NetworkManager[2025380]: <info> [1717179167.5563] audit: op="checkpoint-adjust-rollback-timeout" arg="/org/freedesktop/NetworkManager/Checkpoint/293" pid=2292893 uid=0 result="succe
When I look at NNCP for that node, they are constantly going from "Available SuccessfullyConfigured" to blank status then back to "Available SuccessfullyConfigured". When that happens I see the above journalctl output for that particular interface.
I really need to figure out why k8s nmstate is behaving this way. This node's nmstate-handler pod restarted this week. The other node's pods have been up for 200+ days. I'm afraid to bound the other node's pods because of this odd behavior.
What happened: Apply of a nncp takes over 60 seconds. If I exec into the node's nmstate-handler pod and use nmstatectl to create an interface (vlan or bridge) it takes seconds.
What you expected to happen: Apply of nncp to apply quickly
How to reproduce it (as minimally and precisely as possible): Apply a yaml such as this
Anything else we need to know?: nmstate version --> kubernetes-nmstate-handler:v0.80.0
Environment:
NodeNetworkState
on affected nodes (usekubectl get nodenetworkstate <node_name> -o yaml
):Problematic
NodeNetworkConfigurationPolicy
: Any but see above yaml, here is the actual output from the nncpkubernetes-nmstate image (use
kubectl get pods --all-namespaces -l app=kubernetes-nmstate -o jsonpath='{.items[0].spec.containers[0].image}'
):NetworkManager version (use
nmcli --version
)Kubernetes version (use
kubectl version
):OS (e.g. from /etc/os-release):
Kernel
kubectl logs nmstate-handler-9zkz6 -n nmstate
I killed the pod and a new one properly started up. Tried the apply for another vlan interface and it still took > 60s to apply
I attached the full logs of nmstate-handler-9zkz6 logs--svc-uc-v108-apply.txt