projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

Changing the kubernetes node IP will cause the current node Pod IP to be unavailable and unrecoverable #8700

Closed Levi080513 closed 4 months ago

Levi080513 commented 5 months ago

In some scenarios, the IP of machine where the k8s node is located may change. Kubelet can sense the change and update it to the node CR, but calico does not seem to work properly in this scenario. The pods on this node are not accessible via pod ip and will never recover.

To add, the network mode of calico is IPIP.

Expected Behavior

The pods on this node can accessible via pod ip.

Current Behavior

The pods on this node are not accessible via pod ip.

Possible Solution

Restart the calico-node pod on this node.

Steps to Reproduce (for bugs)

  1. Create a k8s cluster and use calico cni.
  2. Modify the IP of machine where the k8s node is located.
  3. The pods on this node are not accessible via pod ip.

Context

When I tried to analyze this problem, I found that we support automatic update of BGP IP, but the problem seems to be here. https://github.com/projectcalico/calico/blob/5741d7df6dfe2453c41be46f4d990dd7b56b1d4c/node/pkg/lifecycle/startup/startup.go#L315-L366 We only obtain the k8s node once during startup, and will not obtain the latest k8s node information after that. Therefore, when the k8s node ip is updated, we always use the old IP to match the IP on the network interface, it will never succeed and the monitor-addresses log verify this. The old ip of machine is 10.255.2.214 and the new ip is 10.255.2.215. monitor-addresses log like this.

2024-04-08 16:48:04.700 [WARNING][75] monitor-addresses/autodetection_methods.go 236: Unable to find matching host interface for IP 10.255.2.214
2024-04-08 16:48:04.700 [ERROR][75] monitor-addresses/autodetection_methods.go 185: Unable to parse CIDR 10.255.2.214 : invalid CIDR address: 10.255.2.214
2024-04-08 16:48:04.700 [WARNING][75] monitor-addresses/startup.go 516: Autodetection of IPv4 address failed, keeping existing value: 10.255.2.214/16
2024-04-08 16:48:04.709 [WARNING][75] monitor-addresses/startup.go 626: Unable to confirm IPv4 address 10.255.2.214 is assigned to this host

Your Environment

caseydavenport commented 5 months ago

2024-04-08 16:48:04.700 [ERROR][75] monitor-addresses/autodetection_methods.go 185: Unable to parse CIDR 10.255.2.214 : invalid CIDR address: 10.255.2.214

This seems to be a separate issue, where we're failing to parse the IP from the node since it's not in CIDR notation.

That said, I agree with your analysis of the issue here - it appears for specifically for the k8s internal IP method of auto detection, we continue to use the stale node queried when calico/node first started rather than re-querying the node on each loop.

We probably want to move the API call to query the Node inside of the loop so that we're working with updated information on each iteration.

Levi080513 commented 5 months ago

This seems to be a separate issue, where we're failing to parse the IP from the node since it's not in CIDR notation. https://github.com/projectcalico/calico/blob/5741d7df6dfe2453c41be46f4d990dd7b56b1d4c/node/pkg/lifecycle/startup/autodetection/autodetection_methods.go#L199-L238

Calico will match the network interface on the node by IP and return the CIDR of the network interface. If the match fails, the IP is returned directly. So I understand this should be the same issue.

Levi080513 commented 5 months ago

@caseydavenport If so, can I try to fix this?

caseydavenport commented 5 months ago

Right, yeah I guess because it's using the old IP it's failing to find a match. I would be happy to review a PR to fix this :+1:

Thanks for the good investigation!

Levi080513 commented 4 months ago

/close

https://github.com/projectcalico/calico/pull/8728 was merged.