Closed Wiston999 closed 4 years ago
@caseydavenport @neiljerram sorry to tag you directly. but this needs your attention.
we are also seeing this issue with multiple Calico installations(multiple Kube clusters) using
Symptoms we are seeing are, calico-node and calico-kube-controller pods never seem to recover from the watch channel error
, once they lose etcd connection to etcd3 cluster backend. Seems to be easily reproducible by triggering a rolling restart of etcd cluster servers. After that, calico-node and controller pods remains in a 'watch channel error' loop and never seems to recover until the pods are deleted/recreated.
Once the calico pods go into above state, the etcd server's etcd_debugging_mvcc_watcher_total
metric keeps going up and the corresponding etcd server go heap memory usage starts climbing up.
I'm yet to trace this back through stepping down to older version of calico (3.15 or 3.14) to see where this broken behavior got introduced. But, https://github.com/projectcalico/libcalico-go/pull/1247 could be where this regression was introduced. Please check.
below logs are from calico-node pod running node-v3.16.1 version:
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies"
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets"
when calico-node or the singleton calico-kube-controller pods gets into this bad state, they keep spewing these errors for watches on all paths and also, all watchers felix/watcher, tunnel-ip-allocator/watcher, confd/watcher all logging the same error in continous loop, generating GBs of logs in matter of hours.. and the pods created by cni plugin are never advertised by bird or programmed by Felix, as such the newly created pods on such nodes ends up having broken networking...
I tested the etcd cluster rolling restart with calico on v3.15.3. Both the calico-kube-controller and calico-node pods self-recover from watch channel closure/disruption without issues.
So, the regression seems to be introduced since >= v3.16.0. And seem to be https://github.com/projectcalico/libcalico-go/pull/1247 which is eating up watch errors in lib/backend/etcd3/watcher.go and seems to match with my observation where the number of etcd watchers created on the etcd servers keeps going up once the lib gets into this state.
@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?
@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?
Sure, thanks for your analysis.
I have same issue, etcd 3.4.14 and calico 3.16.3
@ravilr @Wiston999 @Sarga Please can you clarify:
Is there any symptom beyond the crazy amount of logging? ("Severe regression" sounds like it would mean something more functional, as well as the crazy logging.)
Is there ever recovery once the etcd cluster is fully up again? Original description says "Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted." and @ravilr 's comments seem to indicate no recovery, but I'd like to be sure.
Hi, I can expose the following 3 scenarios:
ip route
, the route from pod IP to cali*
interface in the node is missing) created while calico is spamming those log messages after some hours, fixed when calico pods are restarted. This is a severe regression because:
calico pods never recover from Watch channel errror's
even after the etcd servers are online/after a etcd cluster rolling restart. Only manual intervention of deleting the calico pods, on which DaemonSet/Deployment Kube controllers creating a new ones, fixes the issue.
Also when it hit this bug, calico pods are creating a lot of new watchers against etcd server in quick succession in never ending loop, resulting in massive ballooning of etcd server's memory usage. If left unattended, this could cause etcd server downtime, so triggers a cascading failure..
the same issue with etcd 3.4.13 and calico 3.16.4
We are also experiencing this same issue.
Versions:
We see graphs very similar to this report issue in etcd: https://github.com/etcd-io/etcd/issues/8387
They suggest that there are some client-side metrics for monitoring the unexpected creation of additional watchers.
Steps to reproduce errors:
We have the same symptoms with Calico 3.16.1 and etcd 3.4.3 and observe the infinite loops both in calico-kube-controllers pod and calico-node pods.
Similar to what @ravilr writes, for us, we never have recovery of Calico after etcd is up again. Only remedy is to restart both the calico-kube-controllers deploy and calico-node daemonset.
Our impact is:
I believe @ravilr has correctly pointed to a PR containing a bug in some of the watcher error handling. In particular errors caused by revisions that are obsolete due to compaction will not trigger a full re-sync and therefore we'll keep trying to watch from the same bad revision.
I have put up a fix in libcalico-go. https://github.com/projectcalico/libcalico-go/pull/1337/files
Will push to get this reviewed and merged. Watch this space.
calico-node
andcalico-kube-controller
pods starts spamming the following log lines after a full ETCD cluster restart (with etcd downtime):Expected Behavior
Tested the same scenario with calico v3.15.3 and a small burst of error messages (about 10) are logged until etcd is back online
Current Behavior
Log messages are repeated endlessly at a rate of 6-9 per second per pod. Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted.
Possible Solution
If those error messages are expected, they should at least be logged with lower loglevel as they are currently being logged as ERROR.
Steps to Reproduce (for bugs)
Context
Your Environment