projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

bird: Netlink: Network is down, The new node cannot work #4273

Closed dotbalo closed 3 years ago

dotbalo commented 3 years ago

I have a k8s cluster, which worked normally before. Then several nodes have been added recently, the bird connection is normal, and the Calico Pod is also normal. But Calico's Pod log keeps printing: bird: Netlink: Network is down. At this time, the new node cannot ping the Pod's IP on other nodes. I mentioned an issue before, but this issue has not been resolved. I have also tried some ideas to solve this problem, but it has never been solved. Now I have no idea to solve this problem, because there is no abnormal log, only: bird: Netlink: Network is down.

Expected Behavior

I want to solve the problem of bird: Netlink: Network is down. Then the new node can ping Pods on other nodes.

Current Behavior

I have a cluster as follows:

[root@k8s-master01 ~]# kubectl get node
NAME                   STATUS   ROLES    AGE     VERSION
k8s-master01      Ready    master   584d    v1.18.9
k8s-master02      Ready    master   340d    v1.18.9
k8s-master03      Ready    master   584d    v1.18.9
k8s-node01        Ready    node     584d    v1.18.9
k8s-node02        Ready    node     584d    v1.18.9
k8s-node236-230   Ready    <none>   2d21h   v1.18.9
k8s-node236-231   Ready    <none>   2d20h   v1.18.9

k8s-node236-230 and k8s-node236-231 are new nodes. Bird Status:

[root@k8s-node236-230 ~]# netstat -lantup |grep bird
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      4839/bird           
tcp        0      0 192.168.0.230:50649    192.168.0.178:179      ESTABLISHED 4839/bird           
tcp        0      0 192.168.0.230:51321    192.168.0.179:179      ESTABLISHED 4839/bird           
tcp        0      0 192.168.0.230:60227    192.168.0.175:179      ESTABLISHED 4839/bird           
tcp        0      0 192.168.0.230:39239    192.168.0.177:179      ESTABLISHED 4839/bird           
tcp        0      0 192.168.0.230:179      192.168.0.231:50469    ESTABLISHED 4839/bird           
tcp        0      0 192.168.0.230:33471    192.168.0.176:179      ESTABLISHED 4839/bird

k8s-master01:

[root@k8s-master01 ~]# netstat -lantup |grep bird
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      22731/bird          
tcp        0      0 192.168.0.177:179      192.168.0.179:49993    ESTABLISHED 22731/bird          
tcp        0      0 192.168.0.177:179      192.168.0.230:39239    ESTABLISHED 22731/bird          
tcp        0      0 192.168.0.177:56867    192.168.0.175:179      ESTABLISHED 22731/bird          
tcp        0      0 192.168.0.177:35795    192.168.0.176:179      ESTABLISHED 22731/bird          
tcp        0      0 192.168.0.177:179      192.168.0.178:47913    ESTABLISHED 22731/bird          
tcp        0      0 192.168.0.177:179      192.168.0.231:52831    ESTABLISHED 22731/bird

It all looks normal. But the new node does not have any routing information.The previous node is normal.

[root@k8s-node236-230 ~]# ip route
default via 192.168.0.254 dev ens192 
192.168.0.0/24 dev ens192 proto kernel scope link src 192.168.0.230 
169.254.0.0/16 dev ens192 scope link metric 1002 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
blackhole 192.242.27.128/26 proto bird

Calico Status:

[root@k8s-master01 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+----------------+-------------------+-------+------------+-------------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
+----------------+-------------------+-------+------------+-------------+
| 192.168.0.178 | node-to-node mesh | up    | 2020-12-18 | Established |
| 192.168.0.179 | node-to-node mesh | up    | 2020-11-03 | Established |
| 192.168.0.175 | node-to-node mesh | up    | 2020-11-03 | Established |
| 192.168.0.176 | node-to-node mesh | up    | 2020-11-03 | Established |
| 192.168.0.230 | node-to-node mesh | up    | 2020-12-18 | Established |
| 192.168.0.231 | node-to-node mesh | up    | 2020-12-18 | Established |
+----------------+-------------------+-------+------------+-------------+

IPv6 BGP status
No IPv6 peers found.

NetworkManager:

[root@k8s-master01 ~]# systemctl status NetworkManager
● NetworkManager.service - Network Manager
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:NetworkManager(8)

Possible Solution

  1. Change IPIP to BGP: Not solved.
  2. Add IP_AUTODETECTION_METHOD configuration: Not solved.
  3. Update Calico: Not solved.

Context

My newly added node cannot communicate with the pods of other nodes, hoping to solve this problem.

Your Environment

caseydavenport commented 3 years ago

@dotbalo could you share the log file from the calico/node pod that is emitting these logs? That would be useful to see.

It would also be useful to see what might be occuring to the interface on that node.

i.e., run the following command on the problematic node to see what is happening to that interface:

ip monitor dev <CALICO_INTERFACE>

Where is eth0, ens0, etc. Whatever Calico is using for BGP.

dotbalo commented 3 years ago

@caseydavenport Sorry, I discovered the cause of this problem last night, so I closed this issue in advance. The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip

Today I reconfirmed that this is indeed the reason. So can this issue be closed?

caseydavenport commented 3 years ago

@dotbalo ok great, thanks for closing the loop! I wasn't sure if you meant to close it or not.

oldthreefeng commented 3 years ago

Delete the ipip module to return to normal.

This helps me a lot.

jayXiu commented 7 months ago

Thank you ~