Severe regression in calico-3.16 with etcdv3 backend

Wiston999 commented 4 years ago

calico-node and calico-kube-controller pods starts spamming the following log lines after a full ETCD cluster restart (with etcd downtime):

2020-10-20 13:42:23.762 [ERROR][1] watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-20 13:42:23.763 [INFO][1] watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"

Expected Behavior

Tested the same scenario with calico v3.15.3 and a small burst of error messages (about 10) are logged until etcd is back online

Current Behavior

Log messages are repeated endlessly at a rate of 6-9 per second per pod. Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted.

Possible Solution

If those error messages are expected, they should at least be logged with lower loglevel as they are currently being logged as ERROR.

Steps to Reproduce (for bugs)

Deploy calico version 3.16+ using etcd backend.
Stop all etcd cluster nodes
Restart etcd cluster

Context

Your Environment

Calico version 3.16.1 and 3.16.3
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.19.2
Operating System and version: Debian buster
ETCD version: 3.4.13

ravilr commented 4 years ago

@caseydavenport @neiljerram sorry to tag you directly. but this needs your attention.

we are also seeing this issue with multiple Calico installations(multiple Kube clusters) using

calico on etcdv3 backend.
Version anything >= 3.16

Symptoms we are seeing are, calico-node and calico-kube-controller pods never seem to recover from the watch channel error, once they lose etcd connection to etcd3 cluster backend. Seems to be easily reproducible by triggering a rolling restart of etcd cluster servers. After that, calico-node and controller pods remains in a 'watch channel error' loop and never seems to recover until the pods are deleted/recreated.

Once the calico pods go into above state, the etcd server's etcd_debugging_mvcc_watcher_total metric keeps going up and the corresponding etcd server go heap memory usage starts climbing up.

I'm yet to trace this back through stepping down to older version of calico (3.15 or 3.14) to see where this broken behavior got introduced. But, https://github.com/projectcalico/libcalico-go/pull/1247 could be where this regression was introduced. Please check.

ravilr commented 4 years ago

below logs are from calico-node pod running node-v3.16.1 version:

2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies"
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets"

when calico-node or the singleton calico-kube-controller pods gets into this bad state, they keep spewing these errors for watches on all paths and also, all watchers felix/watcher, tunnel-ip-allocator/watcher, confd/watcher all logging the same error in continous loop, generating GBs of logs in matter of hours.. and the pods created by cni plugin are never advertised by bird or programmed by Felix, as such the newly created pods on such nodes ends up having broken networking...

ravilr commented 4 years ago

I tested the etcd cluster rolling restart with calico on v3.15.3. Both the calico-kube-controller and calico-node pods self-recover from watch channel closure/disruption without issues.

So, the regression seems to be introduced since >= v3.16.0. And seem to be https://github.com/projectcalico/libcalico-go/pull/1247 which is eating up watch errors in lib/backend/etcd3/watcher.go and seems to match with my observation where the number of etcd watchers created on the etcd servers keeps going up once the lib gets into this state.

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

Wiston999 commented 4 years ago

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

Sure, thanks for your analysis.

Sarga commented 4 years ago

I have same issue, etcd 3.4.14 and calico 3.16.3

nelljerram commented 4 years ago

@ravilr @Wiston999 @Sarga Please can you clarify:

Is there any symptom beyond the crazy amount of logging? ("Severe regression" sounds like it would mean something more functional, as well as the crazy logging.)
Is there ever recovery once the etcd cluster is fully up again? Original description says "Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted." and @ravilr 's comments seem to indicate no recovery, but I'd like to be sure.

Wiston999 commented 4 years ago

Hi, I can expose the following 3 scenarios:

I have also experienced networking errors in new pods (No route to host, checking with ip route, the route from pod IP to cali* interface in the node is missing) created while calico is spamming those log messages after some hours, fixed when calico pods are restarted.
New pods' network are properly setup if the pods are created just after etcd is restarted. It is, there is logspam but new pods are setup correctly. So I cannot ensure that the logspam is 100% related to pod network problems or some kind of mixed conditions.
I've tested etcd restart with calico 3.15.3 and there is no logspam, just a few error lines (10-15) until etcd is back online.

ravilr commented 4 years ago

This is a severe regression because:

calico pods never recover from Watch channel errror's even after the etcd servers are online/after a etcd cluster rolling restart. Only manual intervention of deleting the calico pods, on which DaemonSet/Deployment Kube controllers creating a new ones, fixes the issue.
Also when it hit this bug, calico pods are creating a lot of new watchers against etcd server in quick succession in never ending loop, resulting in massive ballooning of etcd server's memory usage. If left unattended, this could cause etcd server downtime, so triggers a cascading failure..

shcherbak commented 4 years ago

the same issue with etcd 3.4.13 and calico 3.16.4

hjkatz commented 4 years ago

We are also experiencing this same issue.

Versions:

kubernetes: 3.16.13
calico: 3.16.3
etcd: 3.4.13

We see graphs very similar to this report issue in etcd: https://github.com/etcd-io/etcd/issues/8387

They suggest that there are some client-side metrics for monitoring the unexpected creation of additional watchers.

Steps to reproduce errors:

Upgrade calico-node and calico-controller to 3.16.3
Restart etcd servers in succession (or otherwise create downtime for etcd)
Observe the problem in the calico-node logs
Restart calico-node Pods and the issue goes away

carsten-langer commented 4 years ago

We have the same symptoms with Calico 3.16.1 and etcd 3.4.3 and observe the infinite loops both in calico-kube-controllers pod and calico-node pods.

Similar to what @ravilr writes, for us, we never have recovery of Calico after etcd is up again. Only remedy is to restart both the calico-kube-controllers deploy and calico-node daemonset.

Our impact is:

huge memory increase in etcd and thus spikes of etcd delay potentially due to garbage collection in etcd, leading even to k8s-apiserver restarts due to timeouts on lease renewal towards etcd.
huge increase in CPU load in etcd proportionally/linearily following how many calico-nodes are in this infinite loop situation. This is nicely visible: if you do a rolling restart of the calico-node pods, the CPU load of etcd proportionally goes down again with the number of calico-node pods being restarted and thus not in infinite loop anymore. Which to me proves that calico is really sending out all these requests to etcd, thus is still connected to etcd.
Similar to what @Wiston999 writes, some normal pods, when created during such a infinite loop scenario, have network connection problems right after they are created, as if Calico was not able to correctly set up the network due to an overload in the watcher/etcd area. Such pod network will then not come up until we delete the pod and try again. Then it is again a question of probability if the network comes up.

robbrockbank commented 4 years ago

I believe @ravilr has correctly pointed to a PR containing a bug in some of the watcher error handling. In particular errors caused by revisions that are obsolete due to compaction will not trigger a full re-sync and therefore we'll keep trying to watch from the same bad revision.

I have put up a fix in libcalico-go. https://github.com/projectcalico/libcalico-go/pull/1337/files

Will push to get this reviewed and merged. Watch this space.

projectcalico / calico