projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.05k stars 1.34k forks source link

ip_auto_detection "kubernetes-internal-ip" is not using the actual Internal-Ip of the node #7341

Closed amshankaran closed 11 months ago

amshankaran commented 1 year ago

ip_auto_detection set to "kubernetes-internal-ip" is not using the nodes (host) internal-ip, instead using another IP of the host, this causing the mesh failure

Expected Behavior

kubernetes_internal_ip should use the nodes InternalIP to form the calico mesh and as bind ip (BindMode is NodeIP)

Current Behavior

Below is the output describing the node. Where calico IPv4Address annotation is different and InternalIP of the node is not used.

cp-tp888:~> kubectl describe node cp-tp888
Name:               cp-tp888
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    node-pool=cp
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        cluster.x-k8s.io/owner-kind: KubeadmControlPlane
                    cluster.x-k8s.io/owner-name: cp
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    **projectcalico.org/IPv4Address: 10.14.100.21/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.10.89.64**
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 13 Feb 2023 07:50:19 +0000
Taints:             <none>
Unschedulable:      false
Addresses:
  **InternalIP:  10.0.0.1**
  Hostname:    cp-tp888

Possible Solution

Steps to Reproduce (for bugs)

IP addresses of the Node

==================
cp-tp888:~> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2090 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:63:52:06 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    altname ens3
    inet 10.0.0.1/24 brd 10.0.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.2/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.3/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe63:5206/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2090 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:7c:ea:26 brd ff:ff:ff:ff:ff:ff
    altname enp0s4
    altname ens4
    inet 20.0.0.3/24 brd 20.0.0.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe7c:ea26/64 scope link
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2090 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:cb:b1:4d brd ff:ff:ff:ff:ff:ff
    altname enp0s5
    altname ens5
    inet 10.14.100.21/24 brd 10.14.100.255 scope global eth2
       valid_lft forever preferred_lft forever
    inet 10.14.100.45/32 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:14:a0a::e/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fecb:b14d/64 scope link
       valid_lft forever preferred_lft forever
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default

Describe output of the Pod

cp-tp888:~> kubectl describe  pod calico-node-lj6f5 -n kube-system
Name:                 calico-node-lj6f5
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      calico-node
Node:                 cp-tp888/10.0.0.1
Start Time:           Mon, 13 Feb 2023 07:51:39 +0000
Labels:               controller-revision-hash=7964bbc76f
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          ccd/addon: calico
                      prometheus.io/port: 9091
                      prometheus.io/scrape: true
**Status:               Running
IP:                   10.14.100.21
IPs:
  IP:           10.14.100.21**
Controlled By:  DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  containerd://5f3541e33e3cfd569204ebf24b5c164f70f13ce16f2b785b5a4f0d3f3c57ba86
    Image:         registry.eccd.local:5000/cni:v3.24.5-1-d793ee12
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 13 Feb 2023 07:51:42 +0000
      Finished:     Mon, 13 Feb 2023 07:51:42 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2rqt (ro)
  install-cni:
    Container ID:  containerd://e4ac80af2fe0e7aa094951bfe834dad5b4d711ce889781965399c3c9ee512da9
    Image:         registry.eccd.local:5000/cni:v3.24.5-1-d793ee12
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 13 Feb 2023 07:51:43 +0000
      Finished:     Mon, 13 Feb 2023 07:51:43 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
  flexvol-driver:
    Container ID:   containerd://e1f2c24c523fb8a15a80c2cb1b4bff35818b346b2193c72c69f6882db7014acd
    Image:          registry.eccd.local:5000/pod2daemon-flexvol:v3.24.5-1-d793ee12
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 13 Feb 2023 07:51:45 +0000
      Finished:     Mon, 13 Feb 2023 07:51:45 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2rqt (ro)
Containers:
  calico-node:
    Container ID:   containerd://3919dde34ad36895489e99e90fafa8819d66a47d11b86198c6c23ce4c7c7127a
    Image:          registry.eccd.local:5000/**node:v3.24.5**-2-6207f924
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 13 Feb 2023 07:51:54 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CALICO_MANAGE_CNI:                  true
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_LOGSEVERITYSCREEN:            Info
      FELIX_HEALTHENABLED:                true
      FELIX_PROMETHEUSMETRICSENABLED:     true
      FELIX_IPV6SUPPORT:                  false
      IP:
      CALICO_IPV4POOL_IPIP:               Always
      **IP_AUTODETECTION_METHOD:            kubernetes-internal-ip**
      CALICO_IPV4POOL_CIDR:               10.10.0.0/16
      FELIX_DEVICEROUTESOURCEADDRESS:      (v1:status.hostIP)

Context

calico mesh is not forming

Your Environment

caseydavenport commented 1 year ago

Hm, it does appear like Calico is selecting a different address.

Could you share the startup log output from the calico/node pod on that Node?

sridhartigera commented 1 year ago

@amshankaran Any update on this? Could you share the log output?

amshankaran commented 1 year ago

Hi @sridhartigera, I have printed the Node object where I could see the Node object itself is having the VIP address as Node address instead of the InternalIP. Please find the below log. Whereas the node describe is showing the proper address. I'm redeploying the stack, I'll upload the complete log soon.

023-02-16 15:54:10.445 [INFO][9] startup/startup.go 485: Initialize BGP data 2023-02-16 15:54:10.445 [INFO][9] startup/autodetection_methods.go 171: ****PRINTING Node &Node{ObjectMeta:{capocluster-cp-4v875 d41 3ac0c-b0bb-46ef-999c-8c2f72c7ae2e 609 0 2023-02-16 15:52:24 +0000 UTC map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/ os:linux ccd.ericsson.com/node-images-version:v3 ccd/version:2.25.0 infra.ccd.ericsson.com/cluster:capocluster infra.ccd.ericsson.com/o penstackmachine:capocluster-cp-4v875 infra.ccd.ericsson.com/pool:cp kubernetes.io/arch:amd64 kubernetes.io/hostname:capocluster-cp-4v87 5 kubernetes.io/os:linux node-pool:cp node-role.kubernetes.io/control-plane: node.kubernetes.io/exclude-from-external-load-balancers: n ode.uuid:2f5892fc-ed7b-4284-a65d-1b0deb10c77e node.uuid_source:ccd] map[kubeadm.alpha.kubernetes.io/cri-socket:unix:///run/containerd/c ontainerd.sock node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true] [] [] [{kubelet Update v1 2 023-02-16 15:52:24 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:volumes.kubernetes.io/controller-managed-attach-detach" :{}},"f:labels":{".":{},"f:beta.kubernetes.io/arch":{},"f:beta.kubernetes.io/os":{},"f:ccd.ericsson.com/node-images-version":{},"f:ccd/ version":{},"f:infra.ccd.ericsson.com/cluster":{},"f:infra.ccd.ericsson.com/openstackmachine":{},"f:infra.ccd.ericsson.com/pool":{},"f: kubernetes.io/arch":{},"f:kubernetes.io/hostname":{},"f:kubernetes.io/os":{},"f:node-pool":{},"f:node.uuid":{},"f:node.uuid_source":{}} }} } {kubeadm-exec Update v1 2023-02-16 15:52:27 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{"f:kubeadm.alpha.kubernetes.io/cri- socket":{}},"f:labels":{"f:node-role.kubernetes.io/control-plane":{},"f:node.kubernetes.io/exclude-from-external-load-balancers":{}}}} } {kube-controller-manager Update v1 2023-02-16 15:54:02 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{"f:node.alpha.kubernetes.io /ttl":{}}},"f:spec":{"f:podCIDR":{},"f:podCIDRs":{".":{},"v:\"10.10.0.0/24\"":{}},"f:taints":{}}} } {kubelet Update v1 2023-02-16 15:54 :02 +0000 UTC FieldsV1 {"f:status":{"f:conditions":{"k:{\"type\":\"DiskPressure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"MemoryPre ssure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"PIDPressure\"}":{"f:lastHeartbeatTime":{}},"k:{\"type\":\"Ready\"}":{"f:lastHeartbe atTime":{},"f:lastTransitionTime":{},"f:message":{},"f:reason":{},"f:status":{}}},"f:images":{}}} status}]},Spec:NodeSpec{PodCIDR:10.10 .0.0/24,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{Taint{Key:node.cloudprovider.kubernetes.io/uninitialized,Val ue:true,Effect:NoSchedule,TimeAdded:,},},ConfigSource:nil,PodCIDRs:[10.10.0.0/24],},Status:NodeStatus{Capacity:ResourceList{cpu: { {4 0} {} 4 DecimalSI},ephemeral-storage: {{60088270848 0} {} BinarySI},hugepages-1Gi: {{0 0} {} 0 DecimalSI},hugepages- 2Mi: {{0 0} {} 0 DecimalSI},memory: {{6214524928 0} {} BinarySI},pods: {{110 0} {} 110 DecimalSI},},Allocatable:Resourc eList{cpu: {{3900 -3} {} 3900m DecimalSI},ephemeral-storage: {{57083857261 0} {} 57083857261 DecimalSI},hugepages-1Gi: {{0 0} {} 0 DecimalSI},hugepages-2Mi: {{0 0} {} 0 DecimalSI},memory: {{6090234433 0} {} 6090234433 DecimalSI},pods: {{110 0} {

} 110 DecimalSI},},Phase:,Conditions:[]NodeCondition{NodeCondition{Type:MemoryPressure,Status:False,LastHeartbeatTime:2023-02-16 1 5:54:02 +0000 UTC,LastTransitionTime:2023-02-16 15:52:15 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient mem ory available,},NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatTime:2023-02-16 15:54:02 +0000 UTC,LastTransitionTime:2023-02 -16 15:52:15 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,},NodeCondition{Type:PIDPressure,Status:Fal se,LastHeartbeatTime:2023-02-16 15:54:02 +0000 UTC,LastTransitionTime:2023-02-16 15:52:15 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,},NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2023-02-16 15:54:02 +0000 UTC,LastTra nsitionTime:2023-02-16 15:54:02 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status. AppArmor enabled,},},Addresses:[ ]NodeAddress{NodeAddress**{Type:InternalIP,Address:10.0.100.6**,},NodeAddress{Type:Hostname,Address:capocluster-cp-4v875,},},DaemonEndpoint s:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:10250,},},NodeInfo:NodeSystemInfo{MachineID:2f5892fced7b4284a65d1b0deb10c77e, SystemUUID:2f5892fc-ed7b-4284-a65d-1b0deb10c77e,BootID:20079fa3-468e-4ce4-9c5c-5f527aa7bf61,KernelVersion:5.14.21-150400.24.41-default, OSImage:SUSE Linux Enterprise Server 15 SP4,ContainerRuntimeVersion:containerd://1.6.12,KubeletVersion:v1.26.1,KubeProxyVersion:v1.26.1 ,OperatingSystem:linux,Architecture:amd64,},Images:[]ContainerImage{ContainerImage{Names:[registry.eccd.local:5000/coredns@sha256:e2ce3 0078f100337ca407eded893ce98aee06b1b81643d855ba064f4ccdbd937 registry.eccd.local:5000/coredns:v1.9.3],SizeBytes:84227167,},ContainerImag e{Names:[registry.eccd.local:5000/etcd@sha256:3497405808fe35b7f80ea65baa64b1a73701e6220c33e34e6882611758cbe095 registry.eccd.local:5000 /etcd:v3.5.7-1-bd6ad743],SizeBytes:38051836,},ContainerImage{Names:[registry.eccd.local:5000/kube-apiserver@sha256:500944baff0cb3c98086 a1b21c0111fe6df731a388e889dc61c72b671a30d37e registry.eccd.local:5000/kube-apiserver:v1.26.1-1-7e7f4b75],SizeBytes:35318900,},Container Image{Names:[registry.eccd.local:5000/kube-controller-manager@sha256:0f6d80056eea286e4b1bca9a8a9eb413c9f8a1616f9e32283e7ef2b8758b2c52 r egistry.eccd.local:5000/kube-controller-manager:v1.26.1-1-7e7f4b75],SizeBytes:32243957,},ContainerImage{Names:[registry.eccd.local:5000 /kube-proxy@sha256:cc6827323b6f0f36e56bb38adba2825cd37af8fcd8c17ba06a576b623249c231 registry.eccd.local:5000/kube-proxy:v1.26.1-1-7e7f4 b75],SizeBytes:21535534,},ContainerImage{Names:[registry.eccd.local:5000/kube-scheduler@sha256:83985eebbf210d590725629af893b83175169c37 f844bc5027c71059eaec19fc registry.eccd.local:5000/kube-scheduler:v1.26.1-1-7e7f4b75],SizeBytes:17484013,},ContainerImage{Names:[registr y.eccd.local:5000/pause@sha256:5cb945f4b0690fd7488af037721269db73d2e199590d2c4f1cb2efe9f8b5dee4 registry.eccd.local:5000/pause:3.9 regi stry.eccd.local:5000/pause:3.9-1-7e7f4b75],SizeBytes:319043,},},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},} 2023-02-16 15:54:10.446 [INFO][9] startup/autodetection_methods.go 232: Including CIDR information from host interface. CIDR="10.0.100. 6/24" 2023-02-16 15:54:10.446 [INFO][9] startup/startup.go 561: Node IPv4 changed, will check for conflicts 2023-02-16 15:54:10.448 [INFO][9] startup/startup.go 701: No AS number configured on node resource, using global value 2023-02-16 15:54:10.456 [INFO][9] startup/startup.go 750: found v6= in the kubeadm config map 2023-02-16 15:54:10.458 [INFO][9] startup/startup.go 682: CALICO_IPV4POOL_NAT_OUTGOING is true (defaulted) through environment variable 2023-02-16 15:54:10.458 [INFO][9] startup/startup.go 682: CALICO_IPV4POOL_DISABLE_BGP_EXPORT is false (defaulted) through environment v ariable 2023-02-16 15:54:10.458 [INFO][9] startup/startup.go 908: Ensure default IPv4 pool is created. IPIP mode: Always, VXLAN mode: Never, Di sableBGPExport: false 2023-02-16 15:54:10.472 [INFO][9] startup/startup.go 918: Created default IPv4 pool (10.10.0.0/16) with NAT outgoing true. IPIP mode: A lways, VXLAN mode: Never, DisableBGPExport: false
amshankaran commented 1 year ago

calico-node.log @caseydavenport PFA calico-node pod log.

tomastigera commented 1 year ago

@amshankaran yeah, your log shows that it is picking up the IP provided by the Node object Addresses:[]NodeAddress{NodeAddress{Type:InternalIP,Address:10.0.100.6,},NodeAddress{Type:Hostname,Address:capocluster-cp-jlfxj,},} as InternalIP

amshankaran commented 1 year ago

@tomastigera whereas, the node describe shows the Kubernetes nodes InternalIP as 10.0.0.1(expected), which is different than the Calicos node objects Internal IP(10.0.100.6). How this is possible?

Addresses: InternalIP: 10.0.0.1 Hostname: cp-tp888

tomastigera commented 1 year ago

@amshankaran Sorry for not responding earlier. Any development on this? What entiry has 10.0.100.6? Ar the logs from the same node? It does not seem like it is picking a random address from any device on that node :thinking:

amshankaran commented 1 year ago

@tomastigera IP 10.0.100.6 is in the same node for different interface (eth2 address). (It was 10.14.100.21 in my first message, I redeployed the stack and got 10.0.100.6 for eth2. Basically it is taking the eth2 address) kubernetes-internal-ip is 10.0.0.1, expecting this address for calico-node.

But from the above logs it is confirmed that Calico receives the Node's internal IP as Type:InternalIP,Address:10.0.100.6 while the node describe gives Addresses: InternalIP: 10.0.0.1

coutinhop commented 1 year ago

@amshankaran could you please change logSeverityScreen in the default FelixConfig to debug and post the calico logs again? (https://docs.tigera.io/calico/latest/reference/resources/felixconfig for reference)

Also, could you post updated kubectl describe of both the k8s node and calico-node corresponding to those logs?

It does seem like calico finds 10.0.100.6 as the InternalIP, can we make sure it wasn't the case at some point in time and that changed? Just trying to better understand how this issue came about...

On a more practical note, does your setup use a "predictable" interface for the node's IP? If so, IP_AUTODETECTION_METHOD=interface=eth0 or something similar might be a viable workaround...