Closed dolfim closed 3 years ago
It might be a side discussion, but from the logs it looks like the helm-operator is fetching APIs which should not be related to our Helm chart https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1
, etc which are creating some client-side throttling.
Could this be related?
If your operator is hitting some API that you've never heard of, that's a pretty good indicator that you've generated your operator incorrectly. It looks like you might be getting rate limited due to hammering the API, and that's causing the leader elections to time out. What helm chart are you using and what commands did you run to generate your operator?
The operator was created using the tutorial on the operator-sdk page:
operator-sdk init --domain cps.deepsearch.ibm.com --plugins helm
operator-sdk create api --group apps --version v1alpha1 --kind KgAmqp
Is "KgAmqp" a pre-existing Helm chart or are you just making a void Helm chart like the example? I can't get the void operator example to fail like this.
I suspect that you are getting rate-limited, but it looks like it's happening client side (on the controller) rather than by the API server. This can be configured via flags thrown on the controller-manager - is there anything funky when you look at the startup command there? "--kube-api-qps" in particular, as it looks like that's what sets the client side rate limit for the controller-manager.
Our current intuition is that the error is visible only on a large OCP 4.6 cluster (+60 nodes, +2500 Pods, etc).
As advised we tried out to introduce the selectors
matching the label of some KgAmqp CR.
When the controller was not matching any CR, the logs were quite and the memory usage was very low.
After patching a few (not all) KgAmqp, we saw a huge increase in the memory usage, and the logs are showing 1) constant "Reconciled release", 2) "client-side throttling" from APIs which we would not expect to be queried (tekton.dev, knative, etc). Those APIs are for sure not matching our labels, and don't contain the CRD defined in the watches.yaml.
For reference, here is the watches.yaml
# Use the 'create api' subcommand to add watches to this file.
- group: apps.cps.deepsearch.ibm.com
version: v1alpha1
kind: KgAmqp
chart: helm-charts/kgamqp
watchDependentResources: false # adding this doesn't help
selector: # adding this doesn't help, when
matchLabels:
app.kubernetes.io/name: kgamqp
app.kubernetes.io/managed-by: ibm-cps-operator
#+kubebuilder:scaffold:watch
Regarding the previous questions.
The helm-chart was started with the vanilla example, and modified with a) a bit more values, b) remove hpa, c) add configmap and secrets.
The controller (reading from the running deployment) has the following args
args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
- --leader-election-id=ibm-cps-operator
@dolfim Thanks for raising the issue. With the current information we are also not able to figure out the reason behind this error. And as you have mentioned since the client-side throttling is happening only on large clusters, it would be difficult for us to reproduce. On brainstorming about this in our community meeting, we could think of a few pointers:
--zap-log-level
(ref) to increase the verbosity so that we could have more logs to debug.Thanks for looking at our issue and brainstorming about possibilities.
kubectl get crd
shows 152 CRDs args:
- '--health-probe-bind-address=:8081'
- '--metrics-bind-address=127.0.0.1:8080'
- '--leader-elect'
- '--leader-election-id=ibm-cps-operator'
- '--zap-log-level=debug'
The debug logs now look like:
{"level":"info","ts":1632210241.791085,"logger":"cmd","msg":"Version","Go Version":"go1.16.8","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1632210241.7931075,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0921 07:44:03.538887 1 request.go:668] Waited for 1.046670127s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/storage.k8s.io/v1?timeout=32s
{"level":"info","ts":1632210245.8048182,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1632210245.8914208,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
I0921 07:44:05.892824 1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
{"level":"info","ts":1632210245.8931193,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0921 07:44:22.484699 1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"debug","ts":1632210262.4847672,"logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"openshift-operators","name":"ibm-cps-operator","uid":"4600a34a-c21e-4698-b83f-0bcbdfc4929c","apiVersion":"v1","resourceVersion":"610631877"},"reason":"LeaderElection","message":"ibm-cps-operator-controller-manager-557c88f7f6-mhlt2_be7ef8ca-2fb9-41b1-9d2d-c3d044ab3633 became leader"}
{"level":"info","ts":1632210262.4856992,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1632210262.485977,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1632210262.9877303,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
{"level":"debug","ts":1632210262.9914088,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-edda45b6-kgamqp-90f46298","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9916945,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-30b90812","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9957469,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-f21574fe-kgamqp-5f323a8a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.998314,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-59041f3c","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.091478,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-6dc398bc-kgamqp-5e842217","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0931582,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-0bbf559a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.093546,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-fa7b5983","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0950687,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-801751ea","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0951493,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-7f18f401","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0968616,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-617ffb14-kgamqp-5b8ab839","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.190764,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-17e384d1","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.193185,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-a4f4d947","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0953085,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-1f739374","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4915593,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-99cf5236-kgamqp-72952c4d","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.491967,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-fa5a863e","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4956002,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-26239ca0-kgamqp-503b8ba5","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
I0921 07:44:25.491421 1 request.go:668] Waited for 1.078316488s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0921 07:44:35.494616 1 request.go:668] Waited for 2.301755739s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
I0921 07:44:50.992501 1 request.go:668] Waited for 1.000852157s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
E0921 07:44:53.894135 1 leaderelection.go:325] error retrieving resource lock openshift-operators/ibm-cps-operator: Get "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0921 07:44:53.990918 1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1632210293.9922085,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
It looks like it is getting the leader election once, and then it fails.
Since the timeout seems to be on the configmap, I tried count all of them:
# count configmaps
❯ kubectl get cm --all-namespaces -o name |wc -l
1211
At the moment the controller doesn't survive more than 2min, so I cannot inspect the API calls done when raising the memory.
I don't think the debug zap-log-level introduced more verbose output. The only debug entries are Reconciling and Reconciled release.
As posted before, I see lots of API request which (at least to me) look weird for our controller.
https://172.21.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta1?timeout=32s
https://172.21.0.1:443/apis/serving.knative.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/image.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s
https://172.21.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
https://172.21.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1?timeout=32s
https://172.21.0.1:443/apis/extensions/v1beta1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta2?timeout=32s
https://172.21.0.1:443/apis/whereabouts.cni.cncf.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/triggers.tekton.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/kibana.k8s.elastic.co/v1beta1?timeout=32s
I don't know the inner logic of the helm-controller but I don't really see a reason for it querying stuff like Tekton, Knative, etc. But I can imagine that if the controller is caching the output of all tekton jobs, then this might explain the origin of the memory usage. Does anybody understand why those APIs are called? Is this maybe a helm sdk issue?
When the helm controller comes up it dynamically queries the API to build a spec for talking to the cluster, so it might hit a bunch of APIs that look weird. Going to have to do some more digging.
So, trying to reproduce this on my rinky-dink minikube with 150+ CRDS deployed, my controller comes up fine with no hint of self-rate limiting. I suspect that maybe some Openshift specific stuff or configuration might be causing this.
Tried this again with the cluster also saturated with CMs, still unable to reproduce this. Would it be possible to get access to the cluster you're experiencing this error on? Not much we can do locally without the ability to reproduce the error.
Closing this as the user is no longer experiencing the problem and we're unable to reproduce it.
Bug Report
What did you do?
Our operator is based on the helm-operator v1.11.0. When the manager is running, we get constant CrashLoop with the message
leader election lost
.Here are the logs produced by all the reboots:
Environment
Operator type:
/language helm
Kubernetes cluster type:
OpenShift 4.7.
$ operator-sdk version
$ kubectl version