projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.97k stars 1.33k forks source link

feature request: improve robustness of some pod networking situations in light of control plane failure #7403

Open doctorpangloss opened 1 year ago

doctorpangloss commented 1 year ago

When network connectivity between a worker and the control plane is lost, pod to WAN networking is also lost.

The behavior under such a situation should be configurable. Just as processes, including containers, do not stop running in this scenario, networking should not fail.

Expected Behavior

I should be able to specify that the last known policy on a worker still applies even if connectivity to the control plane is lost.

For example, if my pods can reach the WAN normally via the NAT network on a Windows node, even when the Windows node cannot reach the control plane; and, the last policy specified pods can reach WAN addresses: pods should still be able to reach WAN addresses.

Current Behavior

On premises kubernetes distributions like Rancher and microk8s have many failure modes that result in loss of control plane connectivity, usually transiently. Worker pods abruptly lose connectivity, even if the underlying application has no meaningful dependency on Kubernetes besides scheduling. Since most users would prefer to never lose application networking, pod to WAN/WAN to pod networking should not fail.

Possible Solution

The policy engine implementations on Linux and Windows should not fail catastrophically when they lose access to the control plane.

Steps to Reproduce (for bugs)

  1. Create a Windows Calico networked environment (vxlan or win-bgp)
  2. Disconnect the Windows worker from the control plane.
  3. Observe pods can no longer reach the WAN.

Context

Control planes are not as robust as they seem. These errors can be catastrophic for unrelated applications.

Your Environment

coutinhop commented 1 year ago

Pinging @caseydavenport and @fasaxc as they have more background on this. I wonder if this is related to routes being cleared by Felix when control-plane connectivity is lost? Would an enhancement to it involve some kind of "grace period" a la what BGP has?

caseydavenport commented 1 year ago

I wouldn't expect Felix to de-program its routes and policies if it loses connection to the API server. I'd expect it to fail checks, maybe be restarted, and then decide not to do anything until it re-establishes a connection, leaving the data plane intact.

However, BGP does have timeouts involved here, so if the BGP peer loses its connection it may withdraw routes. This is BGP working as intended, and it does already have configurable timers for this IIRC.

I think we need to get a better picture of what specifically changed in this scenario to prevent networking from working. In general, we expect networking to stay intact for some period even if the control plane is down.

fasaxc commented 1 year ago

As above, VXLAN mode should stay static if felix can't reach the API server. BGP is more interesting; it's Win BGP rather than BIRD; not sure what it supports.