projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

vxlan.calico is missing on some nodes of the k8s cluster #7757

Closed RichardSufliarsky closed 1 year ago

RichardSufliarsky commented 1 year ago

Expected Behavior

All the nodes of the k8s cluster should have vxlan.calico interface.

Current Behavior

We have 20 nodes bare-metal kubernetes cluster with Calico v3.25.0 and some of the nodes are missing vxlan.calico interface. Logs for the calico-node pods contain these errors repeated every second:

2023-06-07 00:21:53.517 [INFO][1541825] felix/route_table.go 623: Interface missing, will retry if it appears. ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
2023-06-07 00:21:53.619 [INFO][1541825] felix/route_table.go 1205: Failed to access interface because it doesn't exist. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
2023-06-07 00:21:53.619 [INFO][1541825] felix/route_table.go 1273: Failed to get interface; it's down/gone. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
2023-06-07 00:21:53.620 [ERROR][1541825] felix/route_table.go 1040: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0

Can I try to set vxlanEnabled: false in FelixConfiguration as we have all the nodes on the same subnet (172.16.2.0/24) to get rid of these messages?

Context

IPv4 BGP status +--------------+-------------------+-------+------------+-------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +--------------+-------------------+-------+------------+-------------+ | 172.16.2.30 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.60 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.80 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.90 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.91 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.92 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.93 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.12 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.13 | node-to-node mesh | up | 23:32:59 | Established | | 172.16.2.10 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.21 | node-to-node mesh | up | 22:05:51 | Established | | 172.16.2.22 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.23 | node-to-node mesh | up | 2023-05-24 | Established | | 172.16.2.24 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.25 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.26 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.27 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.28 | node-to-node mesh | up | 2023-05-19 | Established | | 172.16.2.50 | node-to-node mesh | up | 2023-05-19 | Established | +--------------+-------------------+-------+------------+-------------+

IPv6 BGP status No IPv6 peers found.

We have MTU=9000 on interfaces used for k8s network. Shouldn't also vxlan.calico use something larger than 1450?

for s in k8s1 k8s2 k8s3 nas001 redis002 api001 gpu001 gpu002 gpu003 gpu004 gpu005 gpu006 node001 node002 node003 node004 node005 node006 node007 node008; do echo "Server ${s}: $(ssh ${s} 'ip a|grep vxlan')"; done Server k8s1: Server k8s2: 25: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.142.64/32 scope global vxlan.calico Server k8s3: Server nas001: 39: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.83.64/32 scope global vxlan.calico Server redis002: Server api001: 20: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.243.64/32 scope global vxlan.calico Server gpu001: Server gpu002: 19: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.170.128/32 scope global vxlan.calico Server gpu003: Server gpu004: 31: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.168.128/32 scope global vxlan.calico Server gpu005: 39: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.110.192/32 scope global vxlan.calico Server gpu006: 24: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.87.64/32 scope global vxlan.calico Server node001: Server node002: Server node003: Server node004: Server node005: 63: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.200.0/32 scope global vxlan.calico Server node006: 210: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default inet 172.18.167.64/32 scope global vxlan.calico Server node007: Server node008:


## Your Environment
* Calico version: v3.25.0
* Orchestrator version: kubernetes 1.25.10 (control plane) and 1.24.2 (other nodes)
* Operating System and version: RHEL 8 and 9
mazdakn commented 1 year ago

@RichardSufliarsky if the underlying infrastructure provides connectivity, which seems to be the case for you, you can disable any sort of tunnelling like vxlan. Regarding the MTU, it does not hurt to have lower value. The only downside is that you won't get higher bandwidth from using 9000 MTU. You can increase that using Felix configuration (vxlanMTU and vxlanMTUV6)

mazdakn commented 1 year ago

It would be great to understand why some of the nodes do not have the interface. Is there anything different on those nodes? Can you also share the Felix configuration and available IP pools?

RichardSufliarsky commented 1 year ago

@mazdakn all the nodes were installed same way, they have different versions of OS, but this seems not to be the differentiator. Also net interfaces that are used for kubernetes have different names on different machines, but on nodes node001-node008 are the same, these are exact same HW, exact same installation steps taken and two of them have vxlan.calico interface.

kubectl get nodes -owide
NAME                       STATUS     ROLES           AGE    VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                               KERNEL-VERSION                 CONTAINER-RUNTIME
api001.lab.company1.io     Ready      <none>          286d   v1.24.2    172.16.2.30   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.13.1.el8_7.x86_64   cri-o://1.24.4
gpu001.lab.company1.io     Ready      <none>          308d   v1.24.2    172.16.2.60   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.13.1.el8_7.x86_64   cri-o://1.24.4
gpu002.lab.company1.io     Ready      <none>          348d   v1.24.2    172.16.2.80   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.19.2.el8_7.x86_64   cri-o://1.24.5
gpu003.lab.company1.io     Ready      <none>          320d   v1.24.2    172.16.2.90   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.19.2.el8_7.x86_64   cri-o://1.24.5
gpu004.lab.company1.io     Ready      <none>          148d   v1.24.2    172.16.2.91   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.13.1.el8_7.x86_64   cri-o://1.24.4
gpu005.lab.company1.io     Ready      <none>          171d   v1.24.2    172.16.2.92   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.3.1.el8.x86_64      cri-o://1.24.3
gpu006.lab.company1.io     Ready      <none>          99d    v1.24.2    172.16.2.93   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.4
k8s1.lab.company1.io       Ready      control-plane   412d   v1.25.10   172.16.2.11   <none>        Red Hat Enterprise Linux 9.2 (Plow)    5.14.0-284.11.1.el9_2.x86_64   cri-o://1.25.3
k8s2.lab.company1.io       Ready      control-plane   412d   v1.25.10   172.16.2.12   <none>        Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-425.13.1.el8_7.x86_64   cri-o://1.25.3
k8s3.lab.company1.io       Ready      control-plane   412d   v1.25.10   172.16.2.13   <none>        Red Hat Enterprise Linux 8.8 (Ootpa)   4.18.0-477.10.1.el8_8.x86_64   cri-o://1.25.3
nas001.lab.company1.io     Ready      <none>          333d   v1.24.2    172.16.2.10   <none>        Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.13.1.el8_7.x86_64   cri-o://1.24.4
node001.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.21   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
node002.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.22   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
node003.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.23   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
node004.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.24   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
node005.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.25   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.4
node006.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.26   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
node007.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.27   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.22.2.el9_1.x86_64   cri-o://1.24.5
node008.lab.company1.io    Ready      <none>          89d    v1.24.2    172.16.2.28   <none>        Red Hat Enterprise Linux 9.1 (Plow)    5.14.0-162.18.1.el9_1.x86_64   cri-o://1.24.5
redis002.lab.company1.io   Ready      <none>          295d   v1.24.2    172.16.2.50   <none>        Red Hat Enterprise Linux 8.5 (Ootpa)   4.18.0-348.12.2.el8_5.x86_64   cri-o://1.24.2
kubectl get felixconfiguration default -oyaml
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  creationTimestamp: "2022-04-27T18:26:27Z"
  name: default
  resourceVersion: "18992585"
  uid: b3db7f7e-4e5a-4845-92f7-9cd0bf21d325
spec:
  bpfLogLevel: ""
  floatingIPs: Disabled
  healthPort: 9099
  logSeverityScreen: Info
  reportingInterval: 0s
  vxlanEnabled: true

For the IPPool, I think that I have changed vxlanMode to Never, but I am not sure.

kubectl get ippool default-ipv4-ippool -oyaml
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  creationTimestamp: "2022-04-27T18:26:27Z"
  name: default-ipv4-ippool
  resourceVersion: "276576540"
  uid: 490ff4c8-2e67-4d68-9195-c600684d7111
spec:
  allowedUses:
  - Workload
  - Tunnel
  blockSize: 26
  cidr: 172.18.0.0/16
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never
kubectl get installation default -oyaml
apiVersion: operator.tigera.io/v1
kind: Installation                
metadata:
  creationTimestamp: "2022-04-27T18:26:23Z"
  finalizers:       
  - tigera.io/operator-cleanup
  generation: 5            
  name: default                                                         
  resourceVersion: "406630617"               
  uid: 037439ea-6989-48a6-87ff-662543500474
spec:               
  calicoNetwork:         
    bgp: Enabled         
    hostPorts: Enabled     
    ipPools:       
    - blockSize: 26
      cidr: 172.18.0.0/16                     
      disableBGPExport: false     
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled     
      nodeSelector: all()
    linuxDataplane: Iptables
    mtu: 9000                                 
    multiInterfaceMode: None      
    nodeAddressAutodetectionV4:
      kubernetes: NodeInternalIP
  cni:             
    ipam:         
      type: Calico                            
    type: Calico                  
  controlPlaneReplicas: 2
  flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  kubeletVolumePluginPath: /var/lib/kubelet
  nodeUpdateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate                                                                                                                                                                                                                                                                                                                                                                                                                
  nonPrivileged: Disabled
  variant: Calico
status:
  computed:
    calicoNetwork:
      bgp: Enabled
      hostPorts: Enabled
      ipPools:
      - blockSize: 26
        cidr: 172.18.0.0/16
        disableBGPExport: false
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
      linuxDataplane: Iptables
      mtu: 9000
      multiInterfaceMode: None
      nodeAddressAutodetectionV4:
        kubernetes: NodeInternalIP
    cni:
      ipam:
        type: Calico
      type: Calico
    controlPlaneReplicas: 2
    flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
    kubeletVolumePluginPath: /var/lib/kubelet
    nodeUpdateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
    nonPrivileged: Disabled
    variant: Calico
  conditions:
  - lastTransitionTime: "2023-06-13T20:35:24Z"
    message: All Objects Available
    observedGeneration: 5
    reason: AllObjectsAvailable
    status: "False"
    type: Progressing
  - lastTransitionTime: "2023-06-13T20:35:24Z"
    message: All Objects Available
    observedGeneration: 5
    reason: AllObjectsAvailable
    status: "False"
    type: Degraded
  - lastTransitionTime: "2023-06-13T20:35:24Z"
    message: All objects available
    observedGeneration: 5
    reason: AllObjectsAvailable
    status: "True"
    type: Ready
  mtu: 9000
  variant: Calico
coutinhop commented 1 year ago

@RichardSufliarsky could you delete the vxlanEnabled: true line from your FelixConfiguration? Calico supports autodetecting the encapsulation from the IPPools (since I believe v3.23), and any value in FelixConfiguration overrides that. (I'm guessing you upgraded from a prior version that used to have to have those?) Basically what's happening now is felix thinks VXLAN should be enabled, but there are no IP pools with VXLAN, removing that from FelixConfig will make felix stop expecting there to be a vxlan.calico interface, which should fix the issue...

RichardSufliarsky commented 1 year ago

@coutinhop thanks for the response. I have changed CNI plugin together with k8s version upgrade last week so can't try that anymore. Closing the issue.