failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew

suneeta-mall commented 5 years ago

I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler v1.2.2 (on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:

F0205 23:32:52.241542       1 main.go:384] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0xc000022100, 0xc000574000, 0x37, 0xee)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:828 +0xd4
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).output(0x4333560, 0xc000000003, 0xc00056e000, 0x429c819, 0x7, 0x180, 0x0)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:779 +0x306
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).printf(0x4333560, 0x3, 0x26f2036, 0xb, 0x0, 0x0, 0x0)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:678 +0x14b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(0x26f2036, 0xb, 0x0, 0x0, 0x0)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1207 +0x67
main.main.func3()
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:384 +0x47
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000668000)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:163 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000668000, 0x29c4b00, 0xc000591dc0)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:172 +0x112
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x29c4b40, 0xc000046040, 0x29cbd20, 0xc0001e6a20, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00001f030, 0x27baac0, 0x0, ...)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:184 +0x99
main.main()
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:372 +0x5cf
I0205 23:32:52.241724       1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e78ccdca-2440-11e9-8514-0a1153ba0cc4", APIVersion:"v1", ResourceVersion:"6949892", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-57f79874cf-c45xb stopped leading
I0205 23:32:52.745013       1 auto_scaling_groups.go:124] Registering ASG XXXX

Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?

MaciekPytel commented 5 years ago

This suggests either CA has a problem reaching apiserver or apiserver is unhealthy. Can you check if the same happens to other system components (ex. kube-controller-manager?). They use the same generic Kubernetes leader election library that CA uses. Usually when I see this problem it's because of an overloaded apiserver and it impacts multiple controllers.

suneeta-mall commented 5 years ago

@MaciekPytel Thats what I thought but rest of the cluster, including all kube-system component works fine. None of them has restarted.

aleksandra-malinowska commented 5 years ago

To rule out version skew as the cause (Kubernetes 1.12.5 and Cluster Autoscaler 1.2.2), can you please try using newer version of autoscaler? We've Recommended versions:

Kubernetes 1.10. with CA 1.2.
Kubernetes 1.11. with CA 1.3.
Kubernetes 1.12. with CA 1.12. (we've changed versioning scheme to match Kubernetes' minor version)

suneeta-mall commented 5 years ago

Ah ok .. WIll try 1.12 ..thanks for letting me know @aleksandra-malinowska I will try out new version

suneeta-mall commented 5 years ago

@aleksandra-malinowska That did not help .. observerd same behaviour of crash in loop. It was interesting to note that the problem surfaced only when the number of nodes autoscaler managed was about 200 or more ... Every time, I brought down number of nodes anywhere 1k-200 range to 150 or less the autoscaler recovered and functioned properly. Rest of the kube-system component remained functional throughout. Does this help in identifying where the bottleneck would be? I can confirm I have run on various versions of autoscaler ranging from 1.12.X/1.13.1 and seeing same behaviour. Autoscaler goes into crash frenzy when number of nodes >~ 200 and recovers when it comes down.

miry commented 5 years ago

@suneeta-mall can you provide logs with strace for CA 1.12.. It would be easy to find the place. Also can you provide deployment script. It would help to understand what options were enabled and do you have memory limits and etc. Also it is usefull to have full logs.

Possible problems: to many queries and kube-apiserver with etcd could not handle them. You can monitor logs,cpu and memory of etcd and apiserver.

suneeta-mall commented 5 years ago

@miry Yeah sure .. I will work on getting the logs .. heres the deployment script:

---
apiVersion: v1
kind: ServiceAccount
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
rules: 
  - 
    apiGroups: 
      - ""
    resources: 
      - events
      - endpoints
    verbs: 
      - create
      - patch
  - 
    apiGroups: 
      - ""
    resources: 
      - pods/eviction
    verbs: 
      - create
  - 
    apiGroups: 
      - ""
    resources: 
      - pods/status
    verbs: 
      - update
  - 
    apiGroups: 
      - ""
    resourceNames: 
      - "cluster-autoscaler"
    resources: 
      - endpoints
    verbs: 
      - get
      - update
  - 
    apiGroups: 
      - ""
    resources: 
      - nodes
    verbs: 
      - watch
      - list
      - get
      - update
  - 
    apiGroups: 
      - ""
    resources: 
      - pods
      - services
      - replicationcontrollers
      - persistentvolumeclaims
      - persistentvolumes
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - extensions
    resources: 
      - replicasets
      - daemonsets
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - policy
    resources: 
      - poddisruptionbudgets
    verbs: 
      - watch
      - list
  - 
    apiGroups: 
      - apps
    resources: 
      - statefulsets
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - storage.k8s.io
    resources: 
      - storageclasses
    verbs: 
      - watch
      - list
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
rules: 
  - 
    apiGroups: 
      - ""
    resources: 
      - configmaps
    verbs: 
      - create
  - 
    apiGroups: 
      - ""
    resourceNames: 
      - "cluster-autoscaler-status"
    resources: 
      - configmaps
    verbs: 
      - delete
      - get
      - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
roleRef: 
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: "cluster-autoscaler"
subjects: 
  - 
    kind: ServiceAccount
    name: "cluster-autoscaler"
    namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
roleRef: 
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: "cluster-autoscaler"
subjects: 
  - 
    kind: ServiceAccount
    name: "cluster-autoscaler"
    namespace: "kube-system"
---
apiVersion: apps/v1
kind: Deployment
metadata: 
  labels: 
    app: "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
spec: 
  replicas: 1
  selector: 
    matchLabels: 
      app: "cluster-autoscaler"
  template: 
    metadata: 
      annotations: 
        ad.datadoghq.com/nginx.logs: "[{\"source\":\"autoscaler\",\"service\":\"autoscaler\"}]"
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
        scheduler.alpha.kubernetes.io/tolerations: "[{\"key\":\"dedicated\", \"value\":\"master\"}]"
      labels: 
        app: "cluster-autoscaler"
        "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    spec: 
      containers: 
        - 
          command: 
            - "./cluster-autoscaler"
            - "--v=4"
            - "--stderrthreshold=info"
            - "--cloud-provider=aws"
            - "--skip-nodes-with-system-pods=false"
            - "--skip-nodes-with-local-storage=false"
            - "--expander=most-pods"
            - "--ignore-daemonsets-utilization=true"
            - "--ignore-mirror-pods-utilization=true"
            - "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/mycluster.com"
          env: 
            - 
              name: AWS_REGION
              value: "ap-southeast-2"
          image: "k8s.gcr.io/cluster-autoscaler:v1.13.1"
          imagePullPolicy: Always
          livenessProbe: 
            httpGet: 
              path: "/health-check"
              port: 8085
          name: "cluster-autoscaler"
          readinessProbe: 
            httpGet: 
              path: "/health-check"
              port: 8085
          resources: 
            limits: 
              cpu: 100m
              memory: 300Mi
            requests: 
              cpu: 100m
              memory: 300Mi
          volumeMounts: 
            - 
              mountPath: "/etc/ssl/certs/ca-certificates.crt"
              name: "ssl-certs"
              readOnly: true
      dnsPolicy: Default
      nodeSelector: 
        kubernetes.io/role: master
      serviceAccountName: "cluster-autoscaler"
      tolerations: 
        - 
          effect: NoSchedule
          key: "node-role.kubernetes.io/master"
      volumes: 
        - 
          hostPath: 
            path: "/etc/ssl/certs/ca-certificates.crt"
          name: "ssl-certs"

Is there any instructions on getting logs with strace when issue results into crash ? I assume you mean wrapping the autoscaler command with strace and sending the logs .. is that enough or any more specific details you are after?

As for possible problems, yes agree its certainly possible that api-server is getting too many queries but all other cluster resources including kube-system resources and my own workload seem to chug along okay. Its only autoscaler that fails to my knowledge. Its possible autoscaler is making too many calls and getting rate-limited? I have not seen much info in logs to indicate that but I will keep and eye on and update what I find.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

alejandrox1 commented 5 years ago

@suneeta-mall sorry for the delay in responding here. Could you include the logs for the failed deploys, i.e., kubectl logs cluster-autoscaler-pod -n kube-system -p, please?

/remove-lifecycle rotten

suneeta-mall commented 5 years ago

@alejandrox1 The log is already attached in the description see here "Lost master" but kube master and all other kube component seem to function fine but autoscaler

alejandrox1 commented 5 years ago

@suneeta-mall how did you create the cluster? would you happen to have a copy of the code somehwere?

suneeta-mall commented 5 years ago

@alejandrox1 It was created with kops on AWS ... anything specific you are looking for ? The very basic version is can be created with following snippet .. which is the foundation of k8s used in this case. ETCD version 3.X

kops create cluster ${NAME} \
    --cloud aws \
    --master-zones ${ZONES} \
    --master-size m4.xlarge \
    --node-size m4.xlarge \
    --zones $ZONES \
    --topology public \
    --networking flannel \
    --kubernetes-version 1.12.8 \
    --node-size m4.xlarge \
    --dns-zone XXX \
    --encrypt-etcd-storage

Sytten commented 5 years ago

I had a similar issue on my cluster (using EKS):

F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading

Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Pluies commented 4 years ago

We're running into similar issues on a very "scaly" EKS cluster here (quite a bit of up-and-down activity during the day); our other, more stable clusters do not seem to run into the issue. I've also noticed that this pod sometimes gets OOMKilled, so I'll try to add more memory first and will report back if it helped 👍

Sytten commented 4 years ago

/remove-lifecycle stale

elutsky commented 4 years ago

Happened for us as well: Cluster: "v1.15.4" Cloud: Azure Autoscaler version: 1.15.2

I1123 18:51:25.870541 1 scale_down.go:771] No candidates for scale down I1123 18:51:47.848093 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded F1123 18:51:47.848126 1 main.go:406] lost master goroutine 1 [running]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0x4cb5f01, 0x3, 0xc000678000, 0x37) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:900 +0xb1 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).output(0x4cb5fa0, 0xc000000003, 0xc000477340, 0x4c19bb1, 0x7, 0x196, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:815 +0xe6 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).printf(0x4cb5fa0, 0x3, 0x2b62471, 0xb, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:727 +0x14e k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1309 main.main.func3() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:406 +0x5c k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run.func1(0xc00026c7e0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:193 +0x40 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run(0xc00026c7e0, 0x2ff65e0, 0xc0001ca740) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:202 +0x10f k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x2ff6620, 0xc0000cc018, 0x3026ee0, 0xc0002ec280, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00040f3e0, 0x2c39cc8, 0x0, ...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:214 +0x96 main.main() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x6ec

goroutine 19 [syscall, 241 minutes]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:139 +0x9c os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 20 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x4cb5fa0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1035 +0x8b created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.init.0 /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:404 +0x6c

goroutine 50 [IO wait, 241 minutes]: internal/poll.runtime_pollWait(0x7fc633d894f0, 0x72, 0x0) /usr/local/go/src/runtime/netpoll.go:182 +0x56 internal/poll.(pollDesc).wait(0xc0004fa198, 0x72, 0x0, 0x0, 0x2b5d3c7) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b internal/poll.(pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:92 internal/poll.(FD).Accept(0xc0004fa180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:384 +0x1ba net.(netFD).accept(0xc0004fa180, 0x28e75a0, 0x50, 0xc00038ef50) /usr/local/go/src/net/fd_unix.go:238 +0x42 net.(TCPListener).accept(0xc0000d01f8, 0xc000070700, 0x7fc633dd9b28, 0xc0002a8000) /usr/local/go/src/net/tcpsock_posix.go:139 +0x32 net.(TCPListener).AcceptTCP(0xc0000d01f8, 0x40dc28, 0x30, 0x28e75a0) /usr/local/go/src/net/tcpsock.go:247 +0x48 net/http.tcpKeepAliveListener.Accept(0xc0000d01f8, 0x28e75a0, 0xc000417710, 0x263bcc0, 0x4c9af30) /usr/local/go/src/net/http/server.go:3264 +0x2f net/http.(Server).Serve(0xc0003845b0, 0x2ff2ae0, 0xc0000d01f8, 0x0, 0x0) /usr/local/go/src/net/http/server.go:2859 +0x22d net/http.(Server).ListenAndServe(0xc0003845b0, 0xc0003845b0, 0xd) /usr/local/go/src/net/http/server.go:2797 +0xe4 net/http.ListenAndServe(...) /usr/local/go/src/net/http/server.go:3037 main.main.func1(0xc00038e000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:359 +0x10d created by main.main /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x258

goroutine 12 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0001cb6c0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:207 +0x66 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewBroadcaster /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:75 +0xcc

goroutine 151 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec140, 0xc000186000, 0xc001306d20, 0xc0009515c0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 13 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ac00, 0xc00040f3a0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:268 +0xa4 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e

goroutine 11 [runnable]: sync.(Cond).Broadcast(0xc0000d4380) /usr/local/go/src/sync/cond.go:73 +0x91 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).processWindowUpdate(0xc000e81fb8, 0xc0009bb200, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2255 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).run(0xc000e81fb8, 0x2c38850, 0xc00001dfb8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1727 +0x6ea k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).readLoop(0xc0000a3500) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1607 +0x76 created by k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(*Transport).newClientConn /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:670 +0x637

goroutine 114 [select, 6 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec0a0, 0x2fc0880, 0xc000d8e340, 0xc001173cc0, 0xc0000d2fc0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec0a0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000694f78) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001173f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec0a0, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewUnschedulablePodInNamespaceLister /gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190 +0x1eb

goroutine 14 [select]: k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).roundTrip(0xc0000a3500, 0xc000737d00, 0x0, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1081 +0x8cc k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTripOpt(0xc000144d80, 0xc000737d00, 0xc000807200, 0x6bda66, 0x0, 0xc00015f7a0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:444 +0x159 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTrip(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:406 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip(0xc000144d80, 0xc000737d00, 0xc0015b6c80, 0x5, 0xc00015f828) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2536 +0x3f net/http.(Transport).roundTrip(0xc00015f680, 0xc000737d00, 0x248fe20, 0xc00041ef01, 0xc0008a6580) /usr/local/go/src/net/http/transport.go:430 +0xe90 net/http.(Transport).RoundTrip(0xc00015f680, 0xc000737d00, 0x2b645a5, 0xd, 0xc0008a6650) /usr/local/go/src/net/http/roundtrip.go:17 +0x35 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(bearerAuthRoundTripper).RoundTrip(0xc000442960, 0xc000737c00, 0x2b607b9, 0xa, 0xc0008a64d8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:317 +0x268 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(userAgentRoundTripper).RoundTrip(0xc00047c2e0, 0xc000737b00, 0xc00047c2e0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:167 +0x1c2 net/http.send(0xc000737b00, 0x2fb5660, 0xc00047c2e0, 0x0, 0x0, 0x0, 0xc0004e5550, 0xc0008078d0, 0x1, 0x0) /usr/local/go/src/net/http/client.go:250 +0x461 net/http.(Client).send(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0, 0xc0004e5550, 0x0, 0x1, 0xc000cc85a0) /usr/local/go/src/net/http/client.go:174 +0xfb net/http.(Client).do(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/client.go:641 +0x279 net/http.(Client).Do(0xc000442990, 0xc000737b00, 0x0, 0x39, 0x2fb34c0) /usr/local/go/src/net/http/client.go:509 +0x35 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).request(0xc001824300, 0xc000807b80, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:737 +0x330 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).Do(0xc001824300, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:809 +0xc5 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(events).CreateWithEventNamespace(0xc00035bc20, 0xc001597180, 0xc00007fdd0, 0x14d9b8e, 0xc00007fdc8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:57 +0x25d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(EventSinkImpl).Create(0xc00040f3c0, 0xc001597180, 0x280c8c0, 0xc001330320, 0x2ff6ea0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:155 +0x3d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordEvent(0x2ff1220, 0xc00040f3c0, 0xc001597180, 0x0, 0x0, 0x0, 0xc000096000, 0xc00035bca0, 0x1) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:221 +0x12d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordToSink(0x2ff1220, 0xc00040f3c0, 0xc001096780, 0xc00035bca0, 0xc00051ac30, 0x2540be400) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:189 +0x179 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartRecordingToSink.func1(0xc001096780) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:171 +0x5c k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ade0, 0xc00051adb0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:275 +0xe8 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e

goroutine 128 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec500, 0xc000186000, 0xc000c85b00, 0xc000a190e0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 150 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec0a0, 0xc000186000, 0xc0001873e0, 0xc0000d2fc0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 83 [chan receive]: main.run(0xc00038e000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:325 +0x1eb main.main.func2(0x2ff65e0, 0xc0001ca740) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

goroutine 115 [select, 6 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec140, 0x2fc0880, 0xc0001819c0, 0xc001175cc0, 0xc0009515c0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec140, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000364f78) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001175f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec140, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewScheduledPodLister /gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:214 +0x1d9

bostadynamics commented 4 years ago

We got the same kind of message and we have a similar config to what @suneeta-mall posted In the aspect of memory and cpu requests (300mb ram, 100m cpu). I don't know about the details but my issue got solved by cleaning up all the completed pods from the cluster. I had about 5-8k pods and even running kubectl get pods --all-namespaces took a long while. After deleting the unneeded pods all is back to working correctly. Also had the same thing as @Pluies I had 3 clusters with the same config but only one of them had that issue.

chusAlvarez commented 4 years ago

After v1.17.0, some permissions need to be added to rbac ClusterRole:

  - apiGroups:
    - storage.k8s.io
    resources:
    - storageclasses
    - csinodes
    verbs:
    - watch
    - list
    - get
  - apiGroups:
    - coordination.k8s.io
    resources:
    - leases
    verbs:
    - watch
    - list
    - get
    - create
    - patch
    - update

tkbrex commented 4 years ago

I have the same problem:

I0514 05:08:51.277989       1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016       1 main.go:409] lost master

I am running auto scaler version 1.15.6

For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.

        - --leader-elect=false

chaitushiva commented 4 years ago

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

leader-elect	Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability	true

If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.

leader-elect	Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability	true

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1653#issuecomment-714820699): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

gabegorelick commented 3 years ago

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

Is disabling leader election really recommended? All of the official examples I'm aware of specify replicas: 1 but keep the default value for leader-elect.

Even when running replicas: 1, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.

svaranasi-corporate commented 3 years ago

We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.

I0111 09:12:15.398008       1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received

E0111 09:12:26.102040       1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded                                                                                                

I0111 09:12:27.499348       1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading

I0111 09:12:27.698012       1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition

F0111 09:12:28.597994       1 main.go:426] lost master

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 3 years ago

@svaranasi-traderev: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1653#issuecomment-757974316): >We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5. > >``` >I0111 09:12:15.398008 1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received > >E0111 09:12:26.102040 1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded > >I0111 09:12:27.499348 1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading > >I0111 09:12:27.698012 1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition > >F0111 09:12:28.597994 1 main.go:426] lost master >``` > >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

yongzhang commented 3 years ago

I have the same issue on EKS 1.19.

rramadoss4 commented 3 years ago

We have the same issue in EKS 1.17

mkjmkumar commented 3 years ago

I am facing same error with image:cluster-autoscaler:v1.19.1 on EKS 1.19 After applying the suggestion below

--leader-elect=false but after that CrashLoopBackOff and OOM

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 3 years ago

@mkjmkumar: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1653#issuecomment-874147819): >I am facing same error with image:cluster-autoscaler:v1.19.1 on EKS 1.19 >After applying the suggestion below >- --leader-elect=false >but after that CrashLoopBackOff and OOM > >/reopen >/remove-lifecycle rotten > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mkjmkumar commented 3 years ago

Somehow below worked for me

--leader-elect=false .. .. image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.19.0 .. .. resources: limits: cpu: 100m memory: 500Mi requests: cpu: 100m memory: 500Mi

julienlau commented 2 years ago

Maybe the issue was solved with a dirty hack and should be re-openend ? as said before by gabegorelick, I cannot see how it can be safe to use --leader-elect=false ? Their is obviously a corrupted state written in etcd that should be sorted out properly ?

Is disabling leader election really recommended? All of the official examples I'm aware of specify replicas: 1 but keep the default value for leader-elect.

Even when running replicas: 1, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.

jkl373 commented 2 years ago

Seeing the issue with auto-scaler deployed in a GCP kuberenetes cluster. Is this resolved or got closed due to inactivity ?

alex0z1 commented 2 years ago

I had this issue with autoscaler , with CPU limit set to 100m

E0325 00:25:02.404766       1 leaderelection.go:361] Failed to update lock: Put "https://<API>/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0325 00:25:02.404822       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0325 00:25:02.404843       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000182001, 0xc0002e01e0, 0x37, 0xed)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
...
...

setting limit to 1 CPU solved the issue (it needs more CPU when it starts) so in my case, it was a CPU throttling and it slowed down autoscaler itself

mhemken-vts commented 2 years ago

We're seeing this on EKS 1.21.

NoamY8 commented 2 years ago

we're facing same error with image:cluster-autoscaler:v1.18 on EKS 1.20

binlialfie commented 2 years ago

same issue here using k8s.gcr.io/autoscaling/cluster-autoscaler:v1.25.0 in eks 1.22

decipher27 commented 1 year ago

Seeing a weird behaviour with cluster-autoscaler, not sure what's exactly causing this.. Autoscaler Version: 1.21.1 Noticed few number of restarts, not resource limis/request set for CPU.

Describe the cluster-autoscaler pod shows:


    State:          Running
      Started:      Fri, 28 Oct 2022 18:05:10 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 28 Oct 2022 17:56:37 +0530
      Finished:     Fri, 28 Oct 2022 18:02:19 +0530
    Ready:          True
    Restart Count:  36

----------------------------------
Logs: 
```Ruby
1028 12:32:10.414618       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 0 items received
I1028 12:32:10.414628       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338: Watch close - *v1.Job total 8 items received
I1028 12:32:10.414642       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329: Watch close - *v1.ReplicationController total 0 items received
I1028 12:32:10.413723       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 9 items received
I1028 12:32:10.414657       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicaSet total 13 items received
E1028 12:32:12.445308       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
E1028 12:32:15.453424       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:17.469406       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:19.457301       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
I1028 12:32:19.832254       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F1028 12:32:19.832296       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0010267e0, 0x37, 0xd7)```

IzzyMusa commented 1 year ago

This issue happens when an instance can't update the lock object, which is either a configmap, lease resource, or endpoint, within the lease duration time. The default for duration is 15 seconds. If an instance can't update the lock object within that time it'll assume that it crashed and assign a new leader and terminate the pods instance.

nncloud2023 commented 1 year ago

Thanks for the fix and tips.

I had same issue no AWS cluster, cluster autoscaler was running fine but then went into a loop of restarts, reporting similar errors quoted above. I increased the CPU from "0.1" to "0.5" , redeployed auto-scaler , and now it is stable.

My suspicion is the issue got caused by a scheduled scale down perhaps overloaded the resource , maybe.

either way , moving on to next .

kubernetes / autoscaler

failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew #1653