Open ivansharamok opened 7 months ago
Hi @ivansharamok, could you share the vpp-manager logs:
kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp
Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?
Are the nodes using NetworkManager or systemd.networkd? Could you please share the appropriate logs (NM or systemd.networkd) when this issue happens?
Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?
I tried v3.27.0 but the calicovpp/install-whereabouts
image wasn't published to the Docker Hub which prompted me to switch to v3.26.0. I see that it was published a few days ago. I'll give it a try and update this ticket.
Installed Calico VPP v3.27.0. Hit the same issue. Below is the info collected from the cluster using Calico VPP v3.27.0.
Looks like Ubuntu 22.04 by default uses systemd-networkd
.
# checking if NetworkManager is used
azureuser@master:~$ systemctl status NetworkManager
Unit NetworkManager.service could not be found.
azureuser@master:~$ systemctl status network-manager
Unit network-manager.service could not be found.
# checking if systemd-networkd is used
azureuser@master:~$ systemctl status /etc/network/interfaces
Unit etc-network-interfaces.mount could not be found.
azureuser@master:~$ systemctl status systemd-networkd
● systemd-networkd.service - Network Configuration
Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-04-04 17:10:06 UTC; 15min ago
TriggeredBy: ● systemd-networkd.socket
Docs: man:systemd-networkd.service(8)
Main PID: 7977 (systemd-network)
Status: "Processing requests..."
Tasks: 1 (limit: 19179)
Memory: 1.3M
CPU: 136ms
CGroup: /system.slice/systemd-networkd.service
└─7977 /lib/systemd/systemd-networkd
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Here's the log for systemd-networkd
(journalctl -u systemd-networkd).
Apr 04 16:32:09 master systemd[1]: Starting Network Configuration...
Apr 04 16:32:09 master systemd-networkd[539]: lo: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: lo: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: Enumeration completed
Apr 04 16:32:09 master systemd[1]: Started Network Configuration.
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: DHCPv4 address 172.10.1.5/24 via 172.10.1.1
Apr 04 16:32:11 master systemd-networkd[539]: eth0: Gained IPv6LL
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCP lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCPv6 lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 17:10:06 master systemd[1]: Stopping Network Configuration...
Apr 04 17:10:06 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:10:06 master systemd[1]: Stopped Network Configuration.
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: Enumeration completed
Apr 04 17:10:06 master systemd[1]: Started Network Configuration.
Apr 04 17:10:07 master systemd-networkd[7977]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd[1]: Stopping Network Configuration...
Apr 04 17:29:43 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:29:43 master systemd[1]: Stopped Network Configuration.
Apr 04 17:29:43 master systemd[1]: Starting Network Configuration...
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd-networkd[17212]: Enumeration completed
Apr 04 17:29:43 master systemd[1]: Started Network Configuration.
Apr 04 17:10:06
corresponds to when I installed Calico VPP in my clusterApr 04 17:29:43
corresponds to sudo systemctl restart systemd-networkd
command as I tried to see if restarting the networking service could help fix the problem. It didn't.Logs for one of calico-vpp-node
pods
time="2024-04-04T17:10:03Z" level=info msg="Version info\nImage tag : ab81a775fbdeba932888690c68ddf7e9f4bd8d2b\nVPP-dataplane version : ab81a77 Release v3.27.0\nVPP Version : 24.02-rc0~8-g9db45f6ae\nBinapi-generator version : v0.8.0\nVPP Base commit : 06efd532e gerrit:34726/3 interface: add buffer stats api\n------------------ Cherry picked commits --------------------\ncapo: Calico Policies plugin\nacl: acl-plugin custom policies\ncnat: [WIP] no k8s maglev from pods\npbl: Port based balancer\ngerrit:40078/3 vnet: allow format deleted swifidx\ngerrit:40090/3 cnat: undo fib_entry_contribute_forwarding\ngerrit:39507/13 cnat: add flow hash config to cnat translation\ngerrit:34726/3 interface: add buffer stats api\n-------------------------------------------------------------\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SWAP_DRIVER="
time="2024-04-04T17:10:03Z" level=info msg="Config:SERVICE_PREFIX=[10.96.0.0/12]"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_GRACEFUL_SHUTDOWN_TIMEOUT=10s"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACES={\n \"defaultPodIfSpec\": {\n \"rx\": 1,\n \"tx\": 1,\n \"rxqsz\": 0,\n \"txqsz\": 0,\n \"isl3\": true,\n \"rxMode\": 0\n },\n \"maxPodIfSpec\": {\n \"rx\": 10,\n \"tx\": 10,\n \"rxqsz\": 1024,\n \"txqsz\": 1024,\n \"isl3\": null,\n \"rxMode\": 0\n },\n \"vppHostTapSpec\": {\n \"rx\": 1,\n \"tx\": 1,\n \"rxqsz\": 1024,\n \"txqsz\": 1024,\n \"isl3\": false,\n \"rxMode\": 0\n },\n \"uplinkInterfaces\": [\n {\n \"rx\": 0,\n \"tx\": 0,\n \"rxqsz\": 0,\n \"txqsz\": 0,\n \"isl3\": null,\n \"rxMode\": 0,\n \"isMain\": false,\n \"physicalNetworkName\": \"\",\n \"interfaceName\": \"eth0\",\n \"vppDriver\": \"af_packet\",\n \"newDriver\": \"\",\n \"annotations\": null,\n \"mtu\": 0\n }\n ]\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_FEATURE_GATES={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC={\n \"nbAsyncCryptoThreads\": 0,\n \"extraAddresses\": 0\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INITIAL_CONFIG={\n \"vppStartupSleepSeconds\": 1,\n \"corePattern\": \"/var/lib/vpp/vppcore.%e.%p\",\n \"extraAddrCount\": 0,\n \"ifConfigSavePath\": \"\",\n \"defaultGWs\": \"\",\n \"redirectToHostRules\": null\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_TEMPLATE=unix {\n nodaemon\n full-coredump\n cli-listen /var/run/vpp/cli.sock\n pidfile /run/vpp/vpp.pid\n exec /etc/vpp/startup.exec\n}\napi-trace { on }\ncpu {\n workers 0\n}\nsocksvr {\n socket-name /var/run/vpp/vpp-api.sock\n}\nplugins {\n plugin default { enable }\n plugin dpdk_plugin.so { disable }\n plugin calico_plugin.so { enable }\n plugin ping_plugin.so { disable }\n plugin dispatch_trace_plugin.so { enable }\n}\nbuffers {\n buffers-per-numa 131072\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_IF_READ=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:NODENAME=master"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_BGP_LOG_LEVEL=INFO"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_VPP_RUN=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_RUNNING=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SRV6={\n \"localsidPool\": \"\",\n \"policyPool\": \"\"\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_FORMAT="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INIT_SCRIPT_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_EXEC_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_DONE_OK=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_LEVEL=info"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_DEBUG={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_ERRORED=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC_IKEV2_PSK="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_NATIVE_DRIVER="
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="No pci device for interface eth0"
time="2024-04-04T17:10:03Z" level=info msg="-- Environment --"
time="2024-04-04T17:10:03Z" level=info msg="Hugepages 0"
time="2024-04-04T17:10:03Z" level=info msg="KernelVersion 5.15.0-1042"
time="2024-04-04T17:10:03Z" level=info msg="Drivers map[uio_pci_generic:false vfio-pci:true]"
time="2024-04-04T17:10:03Z" level=info msg="initial iommu status N"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface Spec --"
time="2024-04-04T17:10:03Z" level=info msg="Interface Name: eth0"
time="2024-04-04T17:10:03Z" level=info msg="Native Driver: af_packet"
time="2024-04-04T17:10:03Z" level=info msg="New Drive Name: "
time="2024-04-04T17:10:03Z" level=info msg="PHY target #Queues rx:0 tx:0"
time="2024-04-04T17:10:03Z" level=info msg="Tap MTU: 0"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface config --"
time="2024-04-04T17:10:03Z" level=info msg="Node IP4: 172.10.1.5/24"
time="2024-04-04T17:10:03Z" level=info msg="Node IP6: "
time="2024-04-04T17:10:03Z" level=info msg="PciId: "
time="2024-04-04T17:10:03Z" level=info msg="Driver: "
time="2024-04-04T17:10:03Z" level=info msg="Linux IF was up ? true"
time="2024-04-04T17:10:03Z" level=info msg="Promisc was on ? false"
time="2024-04-04T17:10:03Z" level=info msg="DoSwapDriver: false"
time="2024-04-04T17:10:03Z" level=info msg="Mac: 00:22:48:c0:5e:e6"
time="2024-04-04T17:10:03Z" level=info msg="Addresses: [172.10.1.5/24 eth0,fe80::222:48ff:fec0:5ee6/64]"
time="2024-04-04T17:10:03Z" level=info msg="Routes: [{Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, <Dst: nil (default), Ifindex: 2, Gw: 172.10.1.1, Src: 172.10.1.5, >]"
time="2024-04-04T17:10:03Z" level=info msg="PHY original #Queues rx:64 tx:64"
time="2024-04-04T17:10:03Z" level=info msg="MTU 1500"
time="2024-04-04T17:10:03Z" level=info msg="isTunTap false"
time="2024-04-04T17:10:03Z" level=info msg="isVeth false"
time="2024-04-04T17:10:03Z" level=info msg="Running with uplink af_packet"
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="VPP started [PID 7918]"
vpp[7918]: clib_sysfs_prealloc_hugepages:236: pre-allocating 149 additional 2048K hugepages on numa node 0
vpp[7918]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
time="2024-04-04T17:10:04Z" level=info msg="Waiting for VPP... [0/10]"
vpp[7918]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
vpp[7918]: tls_init_ca_chain:1086: Could not initialize TLS CA certificates
vpp[7918]: tls_openssl_init:1209: failed to initialize TLS CA chain
vpp[7918]: vat-plug/load: vat_plugin_register: idpf plugin not loaded...
vpp[7918]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
time="2024-04-04T17:10:06Z" level=info msg="Created AF_PACKET interface 1"
time="2024-04-04T17:10:06Z" level=info msg="tagging interface [1] with: main-eth0"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to uplink interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to uplink interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Creating Linux side interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to data interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Adding ND proxy for address fe80::222:48ff:fec0:5ee6"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address fe80::222:48ff:fec0:5ee6/64 to tap interface"
time="2024-04-04T17:10:06Z" level=warning msg="add addr fe80::222:48ff:fec0:5ee6/64 via vpp EEXIST, file exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="add route via vpp : {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: <nil> Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Using 172.10.1.254 as next hop for cluster IPv4 routes"
time="2024-04-04T17:10:06Z" level=info msg="Setting BGP nodeIP 172.10.1.5/24"
time="2024-04-04T17:10:06Z" level=info msg="Updating node, version = 1741, metaversion = 1741"
default_hook: using systemctl...
default_hook: system is using systemd-networkd; restarting...
time="2024-04-04T17:10:06Z" level=info msg="Received signal child exited, vpp index 1"
time="2024-04-04T17:10:06Z" level=info msg="Ignoring SIGCHLD for pid 0"
time="2024-04-04T17:10:06Z" level=info msg="Done with signal child exited"
I just tried switching from Ubuntu 22.04 to CentOS 8 and I didn't run into DNS resolution issue on the host when using CentOS hosts. I noticed that CentOS uses NetworkManager by default. At this point, I'm not sure what the exact root cause of the issue is, but it might be related to networking managed by systemd-networkd
or perhaps some other default network management configuration bundled in Ubuntu.
On CentOS hosts the /etc/resolv.conf
file doesn't get edited when the calico-vpp-node
pods get up and running.
Thanks for the details and sorry about the missing whereabouts
image - tagging it somehow got missed out during the release :)
What happens is that when calico-vpp-node
starts, it takes over the uplink interface and replaces it with a tap, and systemd-networkd
does not like this disappearance act and causes a reset which involves expiry of the DHCP lease ( as can be seen in the logs ) and in some cases also wipes out the DNS config.
We have faced this issue in the past and usually a restart of systemd-networkd
has done the trick. Somehow the restart trick doesn't seem to be effective in your case. This will require some further digging. But for a quick fix I can think of the following:
NetworkManager has a config option, dns=none
, which tells it to not meddle with the dns config at all which means the dns config remains intact when calico-vpp-node
gets running. So, if switching to NM is ok with you then you could try it.
After the azure instances are up and running, modify netplan to make the network config static instead of DHCP, and then start the kubeadm steps to install the cluster.
Try the systemd-networkd
option, Unmanaged=true
for the uplink interface. It seems like similar to the NM dns=none
but not really sure. Refer to this link: https://github.com/systemd/systemd/issues/28626
Environment
Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue description The
calico-vpp-node
pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The/etc/resolv.conf
file on the hosts get edited when thecalico-vpp-node
pod is running. The DNS resolution from within thecalico-vpp-node
pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck inImagePullBackOff
state.To Reproduce Steps to reproduce the behavior:
Standard_D4s_v3
size instancesinterfaceName: eth0
instead of the defaulteth1
as shown below:installation-default.yaml
was edit as the following:Expected behavior Installation of Calico VPP should not disrupt host's DNS resolution.
Additional context
calico-vpp-node
pods getting initialized, the DNS resolution on the host works as expected. However, once thecalico-vpp-dataplane/calico-vpp-node
pods get to theRunning
state, the DNS resolution stops working on the host and/etc/resolv.conf
file gets modified./etc/resolv.conf
on the host before Calico VPP is installed/etc/resolv.conf
on the host aftercalico-vpp-node
pod reaches theRunning
state/etc/resolv.conf
inside thecalico-vpp-node
podscurl google.com
from within thecalico-vpp-node
pod, but the same query fails on the host with the messagecurl: (6) Could not resolve host: google.com
calico-vpp-node
is up or right after when you manually kill the pod and before it's back upcalico-vpp-node
pod is upcalico-vpp-node
pods get up and running, is to manually kill thecalico-vpp-node
pods and force restart the pods that are failing to pull the images. Since it takes thecalico-vpp-node
pods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again./etc/resolv.conf
file on the host and make it look like the one I fetch from within thecalico-vpp-node
pods. The DNS starts working until thecalico-vpp-node
gets restarted as the restart of that pod seems to overwrite the/etc/resolv.conf
file once again.Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.