projectcalico / vpp-dataplane

VPP dataplane implementation for Calico
Apache License 2.0
147 stars 38 forks source link

DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running #688

Open ivansharamok opened 7 months ago

ivansharamok commented 7 months ago

Environment

Issue description The calico-vpp-node pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The /etc/resolv.conf file on the hosts get edited when the calico-vpp-node pod is running. The DNS resolution from within the calico-vpp-node pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in ImagePullBackOff state.

To Reproduce Steps to reproduce the behavior:

  CALICOVPP_INTERFACES: |-
    {
      "maxPodIfSpec": {
        "rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
      },
      "defaultPodIfSpec": {
        "rx": 1, "tx":1, "isl3": true
      },
      "vppHostTapSpec": {
        "rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
      },
      "uplinkInterfaces": [
        {
          "interfaceName": "eth0",
          "vppDriver": "af_packet"
        }
      ]
    }
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    linuxDataplane: VPP
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLAN

Expected behavior Installation of Calico VPP should not disrupt host's DNS resolution.

Additional context

kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 127.0.0.53
options edns0 trust-ad
search .
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440

Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.

onong commented 7 months ago

Hi @ivansharamok, could you share the vpp-manager logs:

kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp

Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?

onong commented 7 months ago

Are the nodes using NetworkManager or systemd.networkd? Could you please share the appropriate logs (NM or systemd.networkd) when this issue happens?

ivansharamok commented 7 months ago

Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?

I tried v3.27.0 but the calicovpp/install-whereabouts image wasn't published to the Docker Hub which prompted me to switch to v3.26.0. I see that it was published a few days ago. I'll give it a try and update this ticket.

ivansharamok commented 7 months ago

Installed Calico VPP v3.27.0. Hit the same issue. Below is the info collected from the cluster using Calico VPP v3.27.0.

Looks like Ubuntu 22.04 by default uses systemd-networkd.

# checking if NetworkManager is used
azureuser@master:~$ systemctl status NetworkManager
Unit NetworkManager.service could not be found.

azureuser@master:~$ systemctl status network-manager
Unit network-manager.service could not be found.

# checking if systemd-networkd is used
azureuser@master:~$ systemctl status /etc/network/interfaces
Unit etc-network-interfaces.mount could not be found.
azureuser@master:~$ systemctl status systemd-networkd
● systemd-networkd.service - Network Configuration
     Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-04-04 17:10:06 UTC; 15min ago
TriggeredBy: ● systemd-networkd.socket
       Docs: man:systemd-networkd.service(8)
   Main PID: 7977 (systemd-network)
     Status: "Processing requests..."
      Tasks: 1 (limit: 19179)
     Memory: 1.3M
        CPU: 136ms
     CGroup: /system.slice/systemd-networkd.service
             └─7977 /lib/systemd/systemd-networkd

Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP

Here's the log for systemd-networkd (journalctl -u systemd-networkd).

Apr 04 16:32:09 master systemd[1]: Starting Network Configuration...
Apr 04 16:32:09 master systemd-networkd[539]: lo: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: lo: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: Enumeration completed
Apr 04 16:32:09 master systemd[1]: Started Network Configuration.
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: DHCPv4 address 172.10.1.5/24 via 172.10.1.1
Apr 04 16:32:11 master systemd-networkd[539]: eth0: Gained IPv6LL
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCP lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCPv6 lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 17:10:06 master systemd[1]: Stopping Network Configuration...
Apr 04 17:10:06 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:10:06 master systemd[1]: Stopped Network Configuration.
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: Enumeration completed
Apr 04 17:10:06 master systemd[1]: Started Network Configuration.
Apr 04 17:10:07 master systemd-networkd[7977]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd[1]: Stopping Network Configuration...
Apr 04 17:29:43 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:29:43 master systemd[1]: Stopped Network Configuration.
Apr 04 17:29:43 master systemd[1]: Starting Network Configuration...
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd-networkd[17212]: Enumeration completed
Apr 04 17:29:43 master systemd[1]: Started Network Configuration.

Logs for one of calico-vpp-node pods

time="2024-04-04T17:10:03Z" level=info msg="Version info\nImage tag                   : ab81a775fbdeba932888690c68ddf7e9f4bd8d2b\nVPP-dataplane version       : ab81a77 Release v3.27.0\nVPP Version                 : 24.02-rc0~8-g9db45f6ae\nBinapi-generator version    : v0.8.0\nVPP Base commit             : 06efd532e gerrit:34726/3 interface: add buffer stats api\n------------------ Cherry picked commits --------------------\ncapo: Calico Policies plugin\nacl: acl-plugin custom policies\ncnat: [WIP] no k8s maglev from pods\npbl: Port based balancer\ngerrit:40078/3 vnet: allow format deleted swifidx\ngerrit:40090/3 cnat: undo fib_entry_contribute_forwarding\ngerrit:39507/13 cnat: add flow hash config to cnat translation\ngerrit:34726/3 interface: add buffer stats api\n-------------------------------------------------------------\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SWAP_DRIVER="
time="2024-04-04T17:10:03Z" level=info msg="Config:SERVICE_PREFIX=[10.96.0.0/12]"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_GRACEFUL_SHUTDOWN_TIMEOUT=10s"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACES={\n  \"defaultPodIfSpec\": {\n    \"rx\": 1,\n    \"tx\": 1,\n    \"rxqsz\": 0,\n    \"txqsz\": 0,\n    \"isl3\": true,\n    \"rxMode\": 0\n  },\n  \"maxPodIfSpec\": {\n    \"rx\": 10,\n    \"tx\": 10,\n    \"rxqsz\": 1024,\n    \"txqsz\": 1024,\n    \"isl3\": null,\n    \"rxMode\": 0\n  },\n  \"vppHostTapSpec\": {\n    \"rx\": 1,\n    \"tx\": 1,\n    \"rxqsz\": 1024,\n    \"txqsz\": 1024,\n    \"isl3\": false,\n    \"rxMode\": 0\n  },\n  \"uplinkInterfaces\": [\n    {\n      \"rx\": 0,\n      \"tx\": 0,\n      \"rxqsz\": 0,\n      \"txqsz\": 0,\n      \"isl3\": null,\n      \"rxMode\": 0,\n      \"isMain\": false,\n      \"physicalNetworkName\": \"\",\n      \"interfaceName\": \"eth0\",\n      \"vppDriver\": \"af_packet\",\n      \"newDriver\": \"\",\n      \"annotations\": null,\n      \"mtu\": 0\n    }\n  ]\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_FEATURE_GATES={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC={\n  \"nbAsyncCryptoThreads\": 0,\n  \"extraAddresses\": 0\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INITIAL_CONFIG={\n  \"vppStartupSleepSeconds\": 1,\n  \"corePattern\": \"/var/lib/vpp/vppcore.%e.%p\",\n  \"extraAddrCount\": 0,\n  \"ifConfigSavePath\": \"\",\n  \"defaultGWs\": \"\",\n  \"redirectToHostRules\": null\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_TEMPLATE=unix {\n  nodaemon\n  full-coredump\n  cli-listen /var/run/vpp/cli.sock\n  pidfile /run/vpp/vpp.pid\n  exec /etc/vpp/startup.exec\n}\napi-trace { on }\ncpu {\n    workers 0\n}\nsocksvr {\n    socket-name /var/run/vpp/vpp-api.sock\n}\nplugins {\n    plugin default { enable }\n    plugin dpdk_plugin.so { disable }\n    plugin calico_plugin.so { enable }\n    plugin ping_plugin.so { disable }\n    plugin dispatch_trace_plugin.so { enable }\n}\nbuffers {\n  buffers-per-numa 131072\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_IF_READ=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:NODENAME=master"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_BGP_LOG_LEVEL=INFO"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_VPP_RUN=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_RUNNING=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SRV6={\n  \"localsidPool\": \"\",\n  \"policyPool\": \"\"\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_FORMAT="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INIT_SCRIPT_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_EXEC_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_DONE_OK=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_LEVEL=info"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_DEBUG={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_ERRORED=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; fixing dns...\"\n        sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nundo_dns_fix () {\n    if systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n        sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n        systemctl daemon-reload\n        systemctl restart NetworkManager\n    fi\n}\n\nrestart_network () {\n    if systemctl status systemd-networkd > /dev/null 2>&1; then\n        echo \"default_hook: system is using systemd-networkd; restarting...\"\n        systemctl restart systemd-networkd\n    elif systemctl status NetworkManager > /dev/null 2>&1; then\n        echo \"default_hook: system is using NetworkManager; restarting...\"\n        systemctl restart NetworkManager\n    elif systemctl status networking > /dev/null 2>&1; then\n        echo \"default_hook: system is using networking service; restarting...\"\n        systemctl restart networking\n    elif systemctl status network > /dev/null 2>&1; then\n        echo \"default_hook: system is using network service; restarting...\"\n        systemctl restart network\n    else\n        echo \"default_hook: Networking backend not detected, network configuration may fail\"\n    fi\n}\n\nif which systemctl > /dev/null; then\n    echo \"default_hook: using systemctl...\"\nelse\n    echo \"default_hook: Init system not supported, network configuration may fail\"\n    exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n    fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n    restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n    undo_dns_fix\n    restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n    undo_dns_fix\n    restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC_IKEV2_PSK="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_NATIVE_DRIVER="
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="No pci device for interface eth0"
time="2024-04-04T17:10:03Z" level=info msg="-- Environment --"
time="2024-04-04T17:10:03Z" level=info msg="Hugepages            0"
time="2024-04-04T17:10:03Z" level=info msg="KernelVersion        5.15.0-1042"
time="2024-04-04T17:10:03Z" level=info msg="Drivers              map[uio_pci_generic:false vfio-pci:true]"
time="2024-04-04T17:10:03Z" level=info msg="initial iommu status N"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface Spec --"
time="2024-04-04T17:10:03Z" level=info msg="Interface Name:      eth0"
time="2024-04-04T17:10:03Z" level=info msg="Native Driver:       af_packet"
time="2024-04-04T17:10:03Z" level=info msg="New Drive Name:      "
time="2024-04-04T17:10:03Z" level=info msg="PHY target #Queues   rx:0 tx:0"
time="2024-04-04T17:10:03Z" level=info msg="Tap MTU:             0"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface config --"
time="2024-04-04T17:10:03Z" level=info msg="Node IP4:            172.10.1.5/24"
time="2024-04-04T17:10:03Z" level=info msg="Node IP6:            "
time="2024-04-04T17:10:03Z" level=info msg="PciId:               "
time="2024-04-04T17:10:03Z" level=info msg="Driver:              "
time="2024-04-04T17:10:03Z" level=info msg="Linux IF was up ?    true"
time="2024-04-04T17:10:03Z" level=info msg="Promisc was on ?     false"
time="2024-04-04T17:10:03Z" level=info msg="DoSwapDriver:        false"
time="2024-04-04T17:10:03Z" level=info msg="Mac:                 00:22:48:c0:5e:e6"
time="2024-04-04T17:10:03Z" level=info msg="Addresses:           [172.10.1.5/24 eth0,fe80::222:48ff:fec0:5ee6/64]"
time="2024-04-04T17:10:03Z" level=info msg="Routes:              [{Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, <Dst: nil (default), Ifindex: 2, Gw: 172.10.1.1, Src: 172.10.1.5, >]"
time="2024-04-04T17:10:03Z" level=info msg="PHY original #Queues rx:64 tx:64"
time="2024-04-04T17:10:03Z" level=info msg="MTU                  1500"
time="2024-04-04T17:10:03Z" level=info msg="isTunTap             false"
time="2024-04-04T17:10:03Z" level=info msg="isVeth               false"
time="2024-04-04T17:10:03Z" level=info msg="Running with uplink af_packet"
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="VPP started [PID 7918]"
vpp[7918]: clib_sysfs_prealloc_hugepages:236: pre-allocating 149 additional 2048K hugepages on numa node 0
vpp[7918]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
time="2024-04-04T17:10:04Z" level=info msg="Waiting for VPP... [0/10]"
vpp[7918]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
vpp[7918]: tls_init_ca_chain:1086: Could not initialize TLS CA certificates
vpp[7918]: tls_openssl_init:1209: failed to initialize TLS CA chain
vpp[7918]: vat-plug/load: vat_plugin_register: idpf plugin not loaded...
vpp[7918]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
time="2024-04-04T17:10:06Z" level=info msg="Created AF_PACKET interface 1"
time="2024-04-04T17:10:06Z" level=info msg="tagging interface [1] with: main-eth0"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to uplink interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to uplink interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Creating Linux side interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to data interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Adding ND proxy for address fe80::222:48ff:fec0:5ee6"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address fe80::222:48ff:fec0:5ee6/64 to tap interface"
time="2024-04-04T17:10:06Z" level=warning msg="add addr fe80::222:48ff:fec0:5ee6/64 via vpp EEXIST, file exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="add route via vpp : {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: <nil> Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Using 172.10.1.254 as next hop for cluster IPv4 routes"
time="2024-04-04T17:10:06Z" level=info msg="Setting BGP nodeIP 172.10.1.5/24"
time="2024-04-04T17:10:06Z" level=info msg="Updating node, version = 1741, metaversion = 1741"
default_hook: using systemctl...
default_hook: system is using systemd-networkd; restarting...
time="2024-04-04T17:10:06Z" level=info msg="Received signal child exited, vpp index 1"
time="2024-04-04T17:10:06Z" level=info msg="Ignoring SIGCHLD for pid 0"
time="2024-04-04T17:10:06Z" level=info msg="Done with signal child exited"
ivansharamok commented 7 months ago

I just tried switching from Ubuntu 22.04 to CentOS 8 and I didn't run into DNS resolution issue on the host when using CentOS hosts. I noticed that CentOS uses NetworkManager by default. At this point, I'm not sure what the exact root cause of the issue is, but it might be related to networking managed by systemd-networkd or perhaps some other default network management configuration bundled in Ubuntu. On CentOS hosts the /etc/resolv.conf file doesn't get edited when the calico-vpp-node pods get up and running.

onong commented 7 months ago

Thanks for the details and sorry about the missing whereabouts image - tagging it somehow got missed out during the release :)

What happens is that when calico-vpp-node starts, it takes over the uplink interface and replaces it with a tap, and systemd-networkd does not like this disappearance act and causes a reset which involves expiry of the DHCP lease ( as can be seen in the logs ) and in some cases also wipes out the DNS config.

We have faced this issue in the past and usually a restart of systemd-networkd has done the trick. Somehow the restart trick doesn't seem to be effective in your case. This will require some further digging. But for a quick fix I can think of the following:

NetworkManager has a config option, dns=none, which tells it to not meddle with the dns config at all which means the dns config remains intact when calico-vpp-node gets running. So, if switching to NM is ok with you then you could try it.

After the azure instances are up and running, modify netplan to make the network config static instead of DHCP, and then start the kubeadm steps to install the cluster.

Try the systemd-networkd option, Unmanaged=true for the uplink interface. It seems like similar to the NM dns=none but not really sure. Refer to this link: https://github.com/systemd/systemd/issues/28626