projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

The two nodes cannot ping each other #9138

Closed ja-mitiankai closed 2 months ago

ja-mitiankai commented 2 months ago

Expected Behavior

I hope that my k8s cluster can communicate with each other

Current Behavior

My previous k8s cluster was running normally. For some reason, I wanted to re-divide the network segment, so I reset my previous cluster and re-init, but I encountered a problem. Calico seemed to be normal, but it could not communicate with each other. Use kubectl describe pod -n calico-system calico-node-7cwl2. There will be warnings, but they will disappear after a while and return to normal warning:

Warning  Unhealthy  29m  kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

normal:

  kube-api-access-wdsf5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

The following is part of calico-node log:


2024-08-15 07:08:17.264 [INFO][85] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calie3ba9b96e53", Addrs:set.Typed[string]{"fe80::ecee:eeff:feee:eeee":set.v{}}}
2024-08-15 07:08:31.401 [INFO][85] felix/summary.go 100: Summarising 29 dataplane reconciliation loops over 1m1.8s: avg=6ms longest=17ms (resync-filter-v4,resync-mangle-v4,update-filter-v4)
bird: KIF: Received address message for unknown interface 128090
2024-08-15 07:08:58.378 [INFO][85] felix/int_dataplane.go 1289: Linux interface state changed. ifIndex=128090 ifaceName="calie3ba9b96e53" state="down"
2024-08-15 07:08:58.378 [INFO][85] felix/int_dataplane.go 1893: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"calie3ba9b96e53", State:"down", Index:128090}
2024-08-15 07:08:58.378 [INFO][85] felix/int_dataplane.go 1325: Linux interface addrs changed. addrs=set.Set{} ifaceName="calie3ba9b96e53"
2024-08-15 07:08:58.379 [INFO][85] felix/int_dataplane.go 1913: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calie3ba9b96e53", Addrs:set.Typed[string]{}}
2024-08-15 07:08:58.379 [INFO][85] felix/iface_monitor.go 238: Netlink address update for known interface.  addr="fe80::ecee:eeff:feee:eeee" exists=false ifIndex=128090
2024-08-15 07:08:58.379 [INFO][85] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calie3ba9b96e53", Addrs:set.Typed[string]{}}
2024-08-15 07:08:58.379 [INFO][85] felix/endpoint_mgr.go 431: Workload interface state changed; marking for status update. ifaceName="calie3ba9b96e53"
2024-08-15 07:08:58.379 [INFO][85] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=false status="down" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.379 [INFO][85] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="down" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.379 [INFO][85] felix/status_combiner.go 78: Endpoint down for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"} ipVersion=0x4 status="down"
2024-08-15 07:08:58.379 [INFO][85] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"} status="down"
2024-08-15 07:08:58.381 [INFO][85] felix/int_dataplane.go 1289: Linux interface state changed. ifIndex=128090 ifaceName="calie3ba9b96e53" state=""
2024-08-15 07:08:58.381 [INFO][85] felix/int_dataplane.go 1325: Linux interface addrs changed. addrs=<nil> ifaceName="calie3ba9b96e53"
2024-08-15 07:08:58.381 [INFO][85] felix/int_dataplane.go 1893: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"calie3ba9b96e53", State:"", Index:128090}
2024-08-15 07:08:58.381 [INFO][85] felix/int_dataplane.go 1913: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calie3ba9b96e53", Addrs:set.Set[string](nil)}
2024-08-15 07:08:58.381 [INFO][85] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calie3ba9b96e53", Addrs:set.Set[string](nil)}
2024-08-15 07:08:58.381 [INFO][85] felix/endpoint_mgr.go 431: Workload interface state changed; marking for status update. ifaceName="calie3ba9b96e53"
2024-08-15 07:08:58.381 [INFO][85] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=false status="down" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.381 [INFO][85] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="down" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.381 [INFO][85] felix/status_combiner.go 78: Endpoint down for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"} ipVersion=0x4 status="down"
2024-08-15 07:08:58.381 [INFO][85] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"} status="down"
2024-08-15 07:08:58.489 [INFO][85] felix/calc_graph.go 465: Local endpoint deleted id=WorkloadEndpoint(node=izuf640eb7mh8kgvyg4b4jz, orchestrator=k8s, workload=default/debug, name=eth0)
2024-08-15 07:08:58.489 [INFO][85] felix/int_dataplane.go 1836: Received *proto.WorkloadEndpointRemove update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"default/debug" endpoint_id:"eth0" > 
2024-08-15 07:08:58.489 [INFO][85] felix/int_dataplane.go 1836: Received *proto.ActiveProfileRemove update from calculation graph msg=id:<name:"kns.default" > 
2024-08-15 07:08:58.489 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pri-kns.default" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.489 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pro-kns.default" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.489 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pro-kns.default" ipVersion=0x4 table="mangle"
2024-08-15 07:08:58.489 [INFO][85] felix/int_dataplane.go 1836: Received *proto.ActiveProfileRemove update from calculation graph msg=id:<name:"ksa.default.default" > 
2024-08-15 07:08:58.489 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pri-ksa.default.default" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pro-ksa.default.default" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-pro-ksa.default.default" ipVersion=0x4 table="mangle"
2024-08-15 07:08:58.490 [INFO][85] felix/int_dataplane.go 1836: Received *proto.RouteRemove update from calculation graph msg=dst:"192.168.125.140/32" 
2024-08-15 07:08:58.490 [INFO][85] felix/endpoint_mgr.go 703: Workload removed, deleting its chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-tw-calie3ba9b96e53" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-pri-kns.default"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-pri-ksa.default.default"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-fw-calie3ba9b96e53" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-pro-kns.default"
2024-08-15 07:08:58.490 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-pro-ksa.default.default"
2024-08-15 07:08:58.490 [INFO][85] felix/endpoint_mgr.go 558: Workload removed, deleting old state. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 508: Queueing update of chain. chainName="cali-from-wl-dispatch" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-from-wl-dispatch-e"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 508: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-to-wl-dispatch-e"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-from-wl-dispatch-e" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-fw-calie3ba9b96e53"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 539: Queuing deletion of chain. chainName="cali-to-wl-dispatch-e" ipVersion=0x4 table="filter"
2024-08-15 07:08:58.491 [INFO][85] felix/table.go 593: Chain no longer referenced, marking it for removal chainName="cali-tw-calie3ba9b96e53"
2024-08-15 07:08:58.491 [INFO][85] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=false failed=false known=false operUp=false status="" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.491 [INFO][85] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:08:58.491 [INFO][85] felix/route_table.go 1185: Failed to access interface because it doesn't exist. error=Link not found ifaceName="calie3ba9b96e53" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=254
2024-08-15 07:08:58.491 [INFO][85] felix/route_table.go 1253: Failed to get interface; it's down/gone. error=Link not found ifaceName="calie3ba9b96e53" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=254
2024-08-15 07:08:58.492 [INFO][85] felix/route_table.go 589: Interface missing, will retry if it appears. ifaceName="calie3ba9b96e53" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=254
2024-08-15 07:08:58.492 [INFO][85] felix/conntrack.go 90: Removing conntrack flows ip=192.168.125.140
2024-08-15 07:08:58.507 [INFO][85] felix/status_combiner.go 86: Reporting endpoint removed. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"default/debug", EndpointId:"eth0"}
2024-08-15 07:09:34.028 [INFO][85] felix/summary.go 100: Summarising 19 dataplane reconciliation loops over 1m2.6s: avg=6ms longest=25ms ()
2024-08-15 07:10:37.652 [INFO][85] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m3.6s: avg=5ms longest=25ms ()
2024-08-15 07:11:40.381 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.7s: avg=4ms longest=7ms (resync-filter-v4)
2024-08-15 07:12:42.652 [INFO][85] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m2.3s: avg=5ms longest=23ms ()
2024-08-15 07:13:46.856 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m4.2s: avg=5ms longest=23ms ()
2024-08-15 07:14:49.555 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.7s: avg=3ms longest=7ms (resync-mangle-v4,resync-nat-v4)
2024-08-15 07:15:53.574 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m4s: avg=5ms longest=24ms ()
2024-08-15 07:16:55.997 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.4s: avg=5ms longest=23ms ()
2024-08-15 07:18:00.032 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m4s: avg=3ms longest=9ms (resync-mangle-v4,resync-nat-v4)
2024-08-15 07:19:00.518 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m0.5s: avg=6ms longest=25ms ()
2024-08-15 07:20:05.005 [INFO][85] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m4.5s: avg=5ms longest=26ms ()
2024-08-15 07:21:06.364 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m1.4s: avg=7ms longest=24ms ()
2024-08-15 07:22:11.083 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m4.7s: avg=3ms longest=7ms (resync-filter-v4)
2024-08-15 07:23:15.050 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m4s: avg=5ms longest=24ms ()
2024-08-15 07:24:18.783 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3.7s: avg=4ms longest=10ms (resync-mangle-v4,resync-nat-v4)
2024-08-15 07:25:20.846 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.1s: avg=5ms longest=23ms ()
2024-08-15 07:26:24.458 [INFO][85] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.6s: avg=5ms longest=23ms ()
2024-08-15 07:27:27.327 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.9s: avg=5ms longest=24ms ()
2024-08-15 07:28:28.004 [INFO][85] felix/summary.go 100: Summarising 8 dataplane reconciliation loops over 1m0.7s: avg=3ms longest=8ms (resync-filter-v4)
2024-08-15 07:29:33.163 [INFO][85] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m5.2s: avg=5ms longest=24ms ()
2024-08-15 07:30:36.941 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3.8s: avg=6ms longest=23ms ()
2024-08-15 07:31:41.070 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m4.1s: avg=3ms longest=8ms (resync-mangle-v4,resync-nat-v4)
2024-08-15 07:32:44.669 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3.6s: avg=5ms longest=24ms ()
2024-08-15 07:33:46.507 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m1.8s: avg=6ms longest=26ms ()
2024-08-15 07:34:49.507 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3s: avg=3ms longest=8ms (resync-mangle-v4,resync-nat-v4)
2024-08-15 07:35:51.803 [INFO][85] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.3s: avg=5ms longest=23ms ()
2024-08-15 07:36:54.878 [INFO][85] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m3.1s: avg=6ms longest=24ms ()
2024-08-15 07:37:58.721 [INFO][85] felix/summary.go 100: Summarising 7 dataplane reconciliation loops over 1m3.8s: avg=2ms longest=3ms (resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg)
2024-08-15 07:39:02.954 [INFO][85] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m4.2s: avg=5ms longest=23ms ()
2024-08-15 07:40:06.437 [INFO][85] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m3.5s: avg=5ms longest=23ms ()

Below is the log of calico-apiserver:

I0815 09:05:32.239608 1 resource.go:207] Watch called with key: projectcalico.org/bgppeers on resource BGPPeer
time="2024-08-15T09:05:32Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:05:56Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:05:56Z" level=info msg="Terminating main client watcher loop"
I0815 09:05:56.436277 1 resource.go:207] Watch called with key: projectcalico.org/profiles on resource Profile
time="2024-08-15T09:05:56Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:06:36Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:06:36Z" level=info msg="Terminating main client watcher loop"
I0815 09:06:36.235471 1 resource.go:207] Watch called with key: projectcalico.org/kubecontrollersconfigurations on resource KubeControllersConfiguration
time="2024-08-15T09:06:36Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:07:00Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:07:00Z" level=info msg="Terminating main client watcher loop"
I0815 09:07:00.400414 1 resource.go:207] Watch called with key: projectcalico.org/ipamconfigurations on resource IPAMConfiguration
time="2024-08-15T09:07:00Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:07:06Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:07:06Z" level=info msg="Terminating main client watcher loop"
I0815 09:07:06.186406 1 resource.go:207] Watch called with key: projectcalico.org/caliconodestatuses on resource CalicoNodeStatus
time="2024-08-15T09:07:06Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:07:29Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:07:29Z" level=info msg="Terminating main client watcher loop"
I0815 09:07:29.437958 1 resource.go:207] Watch called with key: projectcalico.org/felixconfigurations on resource FelixConfiguration
time="2024-08-15T09:07:29Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:07:30Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:07:30Z" level=info msg="Terminating main client watcher loop"
I0815 09:07:30.213593 1 resource.go:207] Watch called with key: projectcalico.org/globalnetworksets on resource GlobalNetworkSet
time="2024-08-15T09:07:30Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:07:49Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:07:49Z" level=info msg="Terminating main client watcher loop"
I0815 09:07:49.407763 1 resource.go:207] Watch called with key: projectcalico.org/bgpconfigurations on resource BGPConfiguration
time="2024-08-15T09:07:49Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:09:17Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:09:17Z" level=info msg="Terminating main client watcher loop"
I0815 09:09:17.413098 1 resource.go:207] Watch called with key: projectcalico.org/clusterinformations on resource ClusterInformation
time="2024-08-15T09:09:17Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:09:38Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:09:38Z" level=info msg="Terminating main client watcher loop"
I0815 09:09:38.764253 1 resource.go:207] Watch called with key: projectcalico.org/bgpfilters on resource BGPFilter
time="2024-08-15T09:09:38Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:10:45Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:10:45Z" level=info msg="Terminating main client watcher loop"
I0815 09:10:45.535100 1 resource.go:207] Watch called with key: projectcalico.org/ipreservations on resource IPReservation
time="2024-08-15T09:10:45Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:11:02Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:11:02Z" level=info msg="Terminating main client watcher loop"
I0815 09:11:02.244076 1 resource.go:207] Watch called with key: projectcalico.org/bgppeers on resource BGPPeer
time="2024-08-15T09:11:02Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:11:41Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:11:41Z" level=info msg="Terminating main client watcher loop"
I0815 09:11:41.788187 1 resource.go:207] Watch called with key: projectcalico.org/ippools on resource IPPool
time="2024-08-15T09:11:41Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:11:42Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:11:42Z" level=info msg="Terminating main client watcher loop"
I0815 09:11:42.768146 1 resource.go:207] Watch called with key: projectcalico.org/blockaffinities on resource BlockAffinity
time="2024-08-15T09:11:42Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:11:58Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:11:58Z" level=info msg="Terminating main client watcher loop"
I0815 09:11:58.859334 1 resource.go:207] Watch called with key: projectcalico.org/hostendpoints on resource HostEndpoint
time="2024-08-15T09:11:58Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:12:25Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:12:25Z" level=info msg="Terminating main client watcher loop"
I0815 09:12:25.408250 1 resource.go:207] Watch called with key: projectcalico.org/globalnetworkpolicies on resource GlobalNetworkPolicy
time="2024-08-15T09:12:25Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:12:41Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:12:41Z" level=info msg="Terminating main client watcher loop"
I0815 09:12:41.442755 1 resource.go:207] Watch called with key: projectcalico.org/felixconfigurations on resource FelixConfiguration
time="2024-08-15T09:12:41Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:13:01Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:13:01Z" level=info msg="Terminating main client watcher loop"
I0815 09:13:01.412968 1 resource.go:207] Watch called with key: projectcalico.org/bgpconfigurations on resource BGPConfiguration
time="2024-08-15T09:13:01Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:13:04Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:13:04Z" level=info msg="Terminating main client watcher loop"
I0815 09:13:04.441301 1 resource.go:207] Watch called with key: projectcalico.org/profiles on resource Profile
time="2024-08-15T09:13:04Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:13:10Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:13:10Z" level=info msg="Terminating main client watcher loop"
I0815 09:13:10.240013 1 resource.go:207] Watch called with key: projectcalico.org/kubecontrollersconfigurations on resource KubeControllersConfiguration
time="2024-08-15T09:13:10Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:13:16Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:13:16Z" level=info msg="Terminating main client watcher loop"
I0815 09:13:16.654980 1 resource.go:207] Watch called with key: projectcalico.org/networkpolicies on resource NetworkPolicy
time="2024-08-15T09:13:16Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:14:56Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:14:56Z" level=info msg="Terminating main client watcher loop"
I0815 09:14:56.191850 1 resource.go:207] Watch called with key: projectcalico.org/caliconodestatuses on resource CalicoNodeStatus
time="2024-08-15T09:14:56Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:14:56Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:14:56Z" level=info msg="Terminating main client watcher loop"
I0815 09:14:56.768722 1 resource.go:207] Watch called with key: projectcalico.org/bgpfilters on resource BGPFilter
time="2024-08-15T09:14:56Z" level=info msg="Main client watcher loop"
time="2024-08-15T09:15:02Z" level=info msg="Process backend watcher done event in main client"
time="2024-08-15T09:15:02Z" level=info msg="Terminating main client watcher loop"
I0815 09:15:02.217377 1 resource.go:207] Watch called with key: projectcalico.org/globalnetworksets on resource GlobalNetworkSet
time="2024-08-15T09:15:02Z" level=info msg="Main client watcher loop"

use calicoctl node status on master:

Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 172.17.114.22 | node-to-node mesh | up    | 08:53:11 | Established |
+---------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

use calicoctl node status on worker:

Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 172.17.114.24 | node-to-node mesh | up    | 08:53:11 | Established |
+---------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

Possible Solution

I have no idea

Steps to Reproduce (for bugs)

Just a normal deployment cluster

  1. Both servers use the cri-dockerd container runtime
  2. Initialize the cluster using kubeadm init --config=my-config.yaml
  3. Use kubectl create -f tigera-operator.yaml
  4. Use kubectl create -f custom-resources.yaml
  5. Watch the pod status until they are all started
  6. Use kubectl describe pod -n calico-system calico-node-7cwl2
  7. Use kubectl run -i --tty --rm debug --image=busybox --restart=Never -- ping [calico-apiserver-ip], You can see
    If you don't see a command prompt, try pressing enter.
    ^C
    --- 192.168.125.133 ping statistics ---
    72 packets transmitted, 0 packets received, 100% packet loss
    pod "debug" deleted
    pod default/debug terminated (Error)

Your Environment

caseydavenport commented 2 months ago

@ja-mitiankai nothing is standing out to me as especially wrong here - logs look normal. What is the main symptom that you're trying to diagnose?

Warning Unhealthy 29m kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

This looks like it happened 29m ago but hasn't happened since, and is a potentially normal transient event as part of bootstrapping, so long as it doesn't persist. Considering your BGP statis is "Established", it looks like everything is working correctly.

72 packets transmitted, 0 packets received, 100% packet loss

I would expect a pod to be able to ping this IP, so that's potentially suspicious. Are other pods having trouble communicating?

ja-mitiankai commented 2 months ago

@caseydavenport I can't communicate across nodes now. He looks fine, but he's not actually working properly. I want to try to use tcpdump to capture packets. I'm sorry, I don't know much about tcpdump. node1 ip: 172.17.114.24 (master) node2 ip: 172.17.114.22 (worker) The pod I tested the ping command with is running on node1. name: debug, ip: 192.192.125.143. command like this: kubectl run -i --tty --rm debug --image=busybox --restart=Never -- ping 192.192.134.72 ip: 192.192.134.72 is csi-node-driver-9zbv7, and it is running on node2

route at node1:

192.192.125.128 0.0.0.0         255.255.255.192 U     0      0        0 *
192.192.125.137 0.0.0.0         255.255.255.255 UH    0      0        0 cali5331bf8f64b
192.192.125.138 0.0.0.0         255.255.255.255 UH    0      0        0 cali2f01c522b60
192.192.125.139 0.0.0.0         255.255.255.255 UH    0      0        0 calie5460d1dd77
192.192.125.140 0.0.0.0         255.255.255.255 UH    0      0        0 cali726ebf83671
192.192.125.143 0.0.0.0         255.255.255.255 UH    0      0        0 calie3ba9b96e53
192.192.134.64  172.17.114.22   255.255.255.192 UG    0      0        0 eth0

ip route at node1:

blackhole 192.192.125.128/26 proto 80 
192.192.125.137 dev cali5331bf8f64b scope link 
192.192.125.138 dev cali2f01c522b60 scope link 
192.192.125.139 dev calie5460d1dd77 scope link 
192.192.125.140 dev cali726ebf83671 scope link 
192.192.125.143 dev calie3ba9b96e53 scope link 
192.192.134.64/26 via 172.17.114.22 dev eth0 proto 80 onlink

route at node2:

192.192.125.128 172.17.114.24   255.255.255.192 UG    0      0        0 eth0
192.192.134.64  0.0.0.0         255.255.255.192 U     0      0        0 *
192.192.134.71  0.0.0.0         255.255.255.255 UH    0      0        0 califdd333fac30
192.192.134.72  0.0.0.0         255.255.255.255 UH    0      0        0 calie923cf83e08
192.192.134.73  0.0.0.0         255.255.255.255 UH    0      0        0 cali73b51134a71

ip route at node2:

192.192.125.128/26 via 172.17.114.24 dev eth0 proto 80 onlink 
blackhole 192.192.134.64/26 proto 80 
192.192.134.71 dev califdd333fac30 scope link 
192.192.134.72 dev calie923cf83e08 scope link 
192.192.134.73 dev cali73b51134a71 scope link 

The following are the operations on node1 I use tcpdump -i calie3ba9b96e53 -nn. (calie3ba9b96e53: 192.192.125.143)

tcpdump -i calie3ba9b96e53 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on calie3ba9b96e53, link-type EN10MB (Ethernet), capture size 262144 bytes
12:19:12.173454 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4462, length 64
12:19:13.173573 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4463, length 64
12:19:14.173701 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4464, length 64
12:19:15.173793 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4465, length 64
12:19:16.173884 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4466, length 64
12:19:17.173975 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4467, length 64
12:19:18.174075 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4468, length 64
12:19:19.174168 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4469, length 64
12:19:20.174274 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4470, length 64
12:19:21.174391 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4471, length 64
12:19:22.174484 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4472, length 64
12:19:22.420710 ARP, Request who-has 169.254.1.1 tell 192.192.125.143, length 28
12:19:22.420740 ARP, Reply 169.254.1.1 is-at ee:ee:ee:ee:ee:ee, length 28
12:19:23.174589 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4473, length 64
12:19:24.174711 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4474, length 64

I use tcpdump -i eth0 icmp -nn | grep 192.192.125.143. ( eth0: 192.192.134.64 172.17.114.22 255.255.255.192 UG 0 0 0 eth0)

tcpdump -i eth0 icmp -nn | grep 192.192.125.143
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:19:57.178169 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4507, length 64
12:19:58.178264 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4508, length 64
12:19:59.178383 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4509, length 64
12:20:00.178462 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4510, length 64
12:20:01.178553 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4511, length 64
12:20:02.178654 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4512, length 64
12:20:03.178772 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4513, length 64
12:20:04.178855 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4514, length 64
12:20:05.178947 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4515, length 64
12:20:06.179033 IP 192.192.125.143 > 192.192.134.72: ICMP echo request, id 1, seq 4516, length 64

I don't know where to go next using tcpdump.

The following are the operations on node2 I use tcpdump -i eth0 icmp -nn (eth0: 192.192.125.128 172.17.114.24 255.255.255.192 UG 0 0 0 eth0). I also use tcpdump -i calie923cf83e08 icmp -nn (calie923cf83e08: 192.192.134.72). They have no output. It seems that tcpdump is missing in the middle, but I don't know where to use tcpdump.

caseydavenport commented 2 months ago

Yep, sounds like the traffic is leaving the source node properly and then being dropped by your network fabric somewhere in the middle before reaching node2.

Where is this environment running?

Also just to confirm, it looks like you don't have any encapsulation enabled on your Calico IP pools - is that right?

ja-mitiankai commented 2 months ago

I changed the encapsulation in the custom-resources.yaml file from VXLANCrossSubnet to VXLAN and my problem was solved. Thank you very much for your help. @caseydavenport

caseydavenport commented 2 months ago

Cool, glad that helped. For some additional information:

That likely means your nodes were in the same subnet, but that something was dropping the traffic (e.g., reverse path filtering or AWS src/dst checks) because the node was sending traffic with the source / destination of pod IPs.

If you're on Amazon, that would be something like this: https://docs.tigera.io/calico/latest/reference/public-cloud/aws#routing-traffic-within-a-single-vpc-subnet

ja-mitiankai commented 2 months ago

OK, thanks for your addition.