Helm operator leader election lost

dolfim commented 3 years ago

Bug Report

What did you do?

Our operator is based on the helm-operator v1.11.0. When the manager is running, we get constant CrashLoop with the message leader election lost.

Here are the logs produced by all the reboots:

{"level":"info","ts":1630651003.0890665,"logger":"cmd","msg":"Version","Go Version":"go1.16.7","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1630651003.09068,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0903 06:36:44.244793       1 request.go:668] Waited for 1.039457424s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1?timeout=32s
{"level":"info","ts":1630651006.893125,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1630651006.8952067,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
{"level":"info","ts":1630651006.8960173,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0903 06:36:46.896255       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
I0903 06:37:05.245608       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"info","ts":1630651025.2462952,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1630651025.2464244,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1630651025.7481437,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
I0903 06:37:08.088098       1 request.go:668] Waited for 1.030828222s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
I0903 06:37:19.237634       1 request.go:668] Waited for 1.048805637s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
I0903 06:37:31.488584       1 request.go:668] Waited for 1.093715185s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/certificates.k8s.io/v1beta1?timeout=32s
E0903 06:37:41.490874       1 leaderelection.go:361] Failed to update lock: Put "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0903 06:37:41.491073       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1630651061.4911356,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

Environment

Operator type:

/language helm

Kubernetes cluster type:

OpenShift 4.7.

$ operator-sdk version

operator-sdk version: "v1.11.0", commit: "28dcd12a776d8a8ff597e1d8527b08792e7312fd", kubernetes version: "1.20.2", go version: "go1.16.7", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+558d959", GitCommit:"558d959332b3f1f7bf786673bf294e6e0932bb18", GitTreeState:"clean", BuildDate:"2021-07-13T20:45:42Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

dolfim commented 3 years ago

It might be a side discussion, but from the logs it looks like the helm-operator is fetching APIs which should not be related to our Helm chart https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1beta1, etc which are creating some client-side throttling. Could this be related?

jberkhahn commented 3 years ago

If your operator is hitting some API that you've never heard of, that's a pretty good indicator that you've generated your operator incorrectly. It looks like you might be getting rate limited due to hammering the API, and that's causing the leader elections to time out. What helm chart are you using and what commands did you run to generate your operator?

dolfim commented 3 years ago

The operator was created using the tutorial on the operator-sdk page:

operator-sdk init --domain cps.deepsearch.ibm.com --plugins helm
operator-sdk create api --group apps --version v1alpha1 --kind KgAmqp

jberkhahn commented 3 years ago

Is "KgAmqp" a pre-existing Helm chart or are you just making a void Helm chart like the example? I can't get the void operator example to fail like this.

I suspect that you are getting rate-limited, but it looks like it's happening client side (on the controller) rather than by the API server. This can be configured via flags thrown on the controller-manager - is there anything funky when you look at the startup command there? "--kube-api-qps" in particular, as it looks like that's what sets the client side rate limit for the controller-manager.

dolfim commented 3 years ago

Our current intuition is that the error is visible only on a large OCP 4.6 cluster (+60 nodes, +2500 Pods, etc).

As advised we tried out to introduce the selectors matching the label of some KgAmqp CR. When the controller was not matching any CR, the logs were quite and the memory usage was very low. After patching a few (not all) KgAmqp, we saw a huge increase in the memory usage, and the logs are showing 1) constant "Reconciled release", 2) "client-side throttling" from APIs which we would not expect to be queried (tekton.dev, knative, etc). Those APIs are for sure not matching our labels, and don't contain the CRD defined in the watches.yaml. For reference, here is the watches.yaml

# Use the 'create api' subcommand to add watches to this file.
- group: apps.cps.deepsearch.ibm.com
  version: v1alpha1
  kind: KgAmqp
  chart: helm-charts/kgamqp
  watchDependentResources: false # adding this doesn't help
  selector: # adding this doesn't help, when
    matchLabels:
      app.kubernetes.io/name: kgamqp
      app.kubernetes.io/managed-by: ibm-cps-operator
#+kubebuilder:scaffold:watch

Regarding the previous questions.

The helm-chart was started with the vanilla example, and modified with a) a bit more values, b) remove hpa, c) add configmap and secrets.

The controller (reading from the running deployment) has the following args

args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --leader-elect
- --leader-election-id=ibm-cps-operator

varshaprasad96 commented 3 years ago

@dolfim Thanks for raising the issue. With the current information we are also not able to figure out the reason behind this error. And as you have mentioned since the client-side throttling is happening only on large clusters, it would be difficult for us to reproduce. On brainstorming about this in our community meeting, we could think of a few pointers:

How many CRDs are installed in the cluster?
Could you try setting the --zap-log-level (ref) to increase the verbosity so that we could have more logs to debug.
If possible, it would be helpful if you could share the operator code. We can try reproducing it locally or at the least verify if there is something which could be improved or is wrong.

dolfim commented 3 years ago

Thanks for looking at our issue and brainstorming about possibilities.

kubectl get crd shows 152 CRDs

We launched the controller with the following flags

      args:
        - '--health-probe-bind-address=:8081'
        - '--metrics-bind-address=127.0.0.1:8080'
        - '--leader-elect'
        - '--leader-election-id=ibm-cps-operator'
        - '--zap-log-level=debug'

The debug logs now look like:

{"level":"info","ts":1632210241.791085,"logger":"cmd","msg":"Version","Go Version":"go1.16.8","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.11.0","commit":"28dcd12a776d8a8ff597e1d8527b08792e7312fd"}
{"level":"info","ts":1632210241.7931075,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
I0921 07:44:03.538887       1 request.go:668] Waited for 1.046670127s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/storage.k8s.io/v1?timeout=32s
{"level":"info","ts":1632210245.8048182,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1632210245.8914208,"logger":"helm.controller","msg":"Watching resource","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp","namespace":"","reconcilePeriod":"1m0s"}
I0921 07:44:05.892824       1 leaderelection.go:243] attempting to acquire leader lease openshift-operators/ibm-cps-operator...
{"level":"info","ts":1632210245.8931193,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0921 07:44:22.484699       1 leaderelection.go:253] successfully acquired lease openshift-operators/ibm-cps-operator
{"level":"debug","ts":1632210262.4847672,"logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"openshift-operators","name":"ibm-cps-operator","uid":"4600a34a-c21e-4698-b83f-0bcbdfc4929c","apiVersion":"v1","resourceVersion":"610631877"},"reason":"LeaderElection","message":"ibm-cps-operator-controller-manager-557c88f7f6-mhlt2_be7ef8ca-2fb9-41b1-9d2d-c3d044ab3633 became leader"}
{"level":"info","ts":1632210262.4856992,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting EventSource","source":"kind source: apps.cps.deepsearch.ibm.com/v1alpha1, Kind=KgAmqp"}
{"level":"info","ts":1632210262.485977,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting Controller"}
{"level":"info","ts":1632210262.9877303,"logger":"controller-runtime.manager.controller.kgamqp-controller","msg":"Starting workers","worker count":16}
{"level":"debug","ts":1632210262.9914088,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-edda45b6-kgamqp-90f46298","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9916945,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-30b90812","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.9957469,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-f21574fe-kgamqp-5f323a8a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210262.998314,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-59041f3c","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.091478,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-6dc398bc-kgamqp-5e842217","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0931582,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-0bbf559a","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.093546,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-fa7b5983","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0950687,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-801751ea","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0951493,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-7f18f401","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0968616,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-617ffb14-kgamqp-5b8ab839","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.190764,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-17e384d1","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.193185,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-a4f4d947","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.0953085,"logger":"helm.controller","msg":"Reconciling","namespace":"foc-mvp-deepsearch","name":"cps-617ffb14-kgamqp-1f739374","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4915593,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-99cf5236-kgamqp-72952c4d","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.491967,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-bd768688-kgamqp-fa5a863e","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
{"level":"debug","ts":1632210263.4956002,"logger":"helm.controller","msg":"Reconciling","namespace":"deepsearch-dev","name":"cps-26239ca0-kgamqp-503b8ba5","apiVersion":"apps.cps.deepsearch.ibm.com/v1alpha1","kind":"KgAmqp"}
I0921 07:44:25.491421       1 request.go:668] Waited for 1.078316488s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0921 07:44:35.494616       1 request.go:668] Waited for 2.301755739s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s
I0921 07:44:50.992501       1 request.go:668] Waited for 1.000852157s due to client-side throttling, not priority and fairness, request: GET:https://172.21.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
E0921 07:44:53.894135       1 leaderelection.go:325] error retrieving resource lock openshift-operators/ibm-cps-operator: Get "https://172.21.0.1:443/api/v1/namespaces/openshift-operators/configmaps/ibm-cps-operator": context deadline exceeded
I0921 07:44:53.990918       1 leaderelection.go:278] failed to renew lease openshift-operators/ibm-cps-operator: timed out waiting for the condition
{"level":"error","ts":1632210293.9922085,"logger":"cmd","msg":"Manager exited non-zero.","Namespace":"","error":"leader election lost","stacktrace":"github.com/operator-framework/operator-sdk/internal/cmd/helm-operator/run.NewCmd.func1\n\t/workspace/internal/cmd/helm-operator/run/cmd.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\nmain.main\n\t/workspace/cmd/helm-operator/main.go:40\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

It looks like it is getting the leader election once, and then it fails.

Since the timeout seems to be on the configmap, I tried count all of them:

# count configmaps
❯ kubectl get cm --all-namespaces -o name |wc -l
    1211

At the moment the controller doesn't survive more than 2min, so I cannot inspect the API calls done when raising the memory.

dolfim commented 3 years ago

I don't think the debug zap-log-level introduced more verbose output. The only debug entries are Reconciling and Reconciled release.

As posted before, I see lots of API request which (at least to me) look weird for our controller.

https://172.21.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta1?timeout=32s
https://172.21.0.1:443/apis/serving.knative.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/image.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s
https://172.21.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
https://172.21.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1?timeout=32s
https://172.21.0.1:443/apis/extensions/v1beta1?timeout=32s
https://172.21.0.1:443/apis/autoscaling/v2beta2?timeout=32s
https://172.21.0.1:443/apis/whereabouts.cni.cncf.io/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/triggers.tekton.dev/v1alpha1?timeout=32s
https://172.21.0.1:443/apis/kibana.k8s.elastic.co/v1beta1?timeout=32s

I don't know the inner logic of the helm-controller but I don't really see a reason for it querying stuff like Tekton, Knative, etc. But I can imagine that if the controller is caching the output of all tekton jobs, then this might explain the origin of the memory usage. Does anybody understand why those APIs are called? Is this maybe a helm sdk issue?

jberkhahn commented 3 years ago

When the helm controller comes up it dynamically queries the API to build a spec for talking to the cluster, so it might hit a bunch of APIs that look weird. Going to have to do some more digging.

jberkhahn commented 3 years ago

So, trying to reproduce this on my rinky-dink minikube with 150+ CRDS deployed, my controller comes up fine with no hint of self-rate limiting. I suspect that maybe some Openshift specific stuff or configuration might be causing this.

jberkhahn commented 3 years ago

Tried this again with the cluster also saturated with CMs, still unable to reproduce this. Would it be possible to get access to the cluster you're experiencing this error on? Not much we can do locally without the ability to reproduce the error.

jberkhahn commented 3 years ago

Closing this as the user is no longer experiencing the problem and we're unable to reproduce it.

operator-framework / operator-sdk