Closed suneeta-mall closed 4 years ago
This suggests either CA has a problem reaching apiserver or apiserver is unhealthy. Can you check if the same happens to other system components (ex. kube-controller-manager?). They use the same generic Kubernetes leader election library that CA uses. Usually when I see this problem it's because of an overloaded apiserver and it impacts multiple controllers.
@MaciekPytel Thats what I thought but rest of the cluster, including all kube-system component works fine. None of them has restarted.
To rule out version skew as the cause (Kubernetes 1.12.5 and Cluster Autoscaler 1.2.2), can you please try using newer version of autoscaler? We've Recommended versions:
Ah ok .. WIll try 1.12 ..thanks for letting me know @aleksandra-malinowska I will try out new version
@aleksandra-malinowska That did not help .. observerd same behaviour of crash in loop. It was interesting to note that the problem surfaced only when the number of nodes autoscaler managed was about 200 or more ... Every time, I brought down number of nodes anywhere 1k-200 range to 150 or less the autoscaler recovered and functioned properly. Rest of the kube-system component remained functional throughout. Does this help in identifying where the bottleneck would be? I can confirm I have run on various versions of autoscaler ranging from 1.12.X/1.13.1 and seeing same behaviour. Autoscaler goes into crash frenzy when number of nodes >~ 200 and recovers when it comes down.
@suneeta-mall can you provide logs with strace for CA 1.12.. It would be easy to find the place. Also can you provide deployment script. It would help to understand what options were enabled and do you have memory limits and etc. Also it is usefull to have full logs.
Possible problems: to many queries and kube-apiserver with etcd could not handle them. You can monitor logs,cpu and memory of etcd and apiserver.
@miry Yeah sure .. I will work on getting the logs .. heres the deployment script:
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
rules:
-
apiGroups:
- ""
resources:
- events
- endpoints
verbs:
- create
- patch
-
apiGroups:
- ""
resources:
- pods/eviction
verbs:
- create
-
apiGroups:
- ""
resources:
- pods/status
verbs:
- update
-
apiGroups:
- ""
resourceNames:
- "cluster-autoscaler"
resources:
- endpoints
verbs:
- get
- update
-
apiGroups:
- ""
resources:
- nodes
verbs:
- watch
- list
- get
- update
-
apiGroups:
- ""
resources:
- pods
- services
- replicationcontrollers
- persistentvolumeclaims
- persistentvolumes
verbs:
- watch
- list
- get
-
apiGroups:
- extensions
resources:
- replicasets
- daemonsets
verbs:
- watch
- list
- get
-
apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- watch
- list
-
apiGroups:
- apps
resources:
- statefulsets
verbs:
- watch
- list
- get
-
apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- watch
- list
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
rules:
-
apiGroups:
- ""
resources:
- configmaps
verbs:
- create
-
apiGroups:
- ""
resourceNames:
- "cluster-autoscaler-status"
resources:
- configmaps
verbs:
- delete
- get
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: "cluster-autoscaler"
subjects:
-
kind: ServiceAccount
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: "cluster-autoscaler"
subjects:
-
kind: ServiceAccount
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
spec:
replicas: 1
selector:
matchLabels:
app: "cluster-autoscaler"
template:
metadata:
annotations:
ad.datadoghq.com/nginx.logs: "[{\"source\":\"autoscaler\",\"service\":\"autoscaler\"}]"
prometheus.io/port: "8085"
prometheus.io/scrape: "true"
scheduler.alpha.kubernetes.io/tolerations: "[{\"key\":\"dedicated\", \"value\":\"master\"}]"
labels:
app: "cluster-autoscaler"
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
spec:
containers:
-
command:
- "./cluster-autoscaler"
- "--v=4"
- "--stderrthreshold=info"
- "--cloud-provider=aws"
- "--skip-nodes-with-system-pods=false"
- "--skip-nodes-with-local-storage=false"
- "--expander=most-pods"
- "--ignore-daemonsets-utilization=true"
- "--ignore-mirror-pods-utilization=true"
- "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/mycluster.com"
env:
-
name: AWS_REGION
value: "ap-southeast-2"
image: "k8s.gcr.io/cluster-autoscaler:v1.13.1"
imagePullPolicy: Always
livenessProbe:
httpGet:
path: "/health-check"
port: 8085
name: "cluster-autoscaler"
readinessProbe:
httpGet:
path: "/health-check"
port: 8085
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
volumeMounts:
-
mountPath: "/etc/ssl/certs/ca-certificates.crt"
name: "ssl-certs"
readOnly: true
dnsPolicy: Default
nodeSelector:
kubernetes.io/role: master
serviceAccountName: "cluster-autoscaler"
tolerations:
-
effect: NoSchedule
key: "node-role.kubernetes.io/master"
volumes:
-
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
name: "ssl-certs"
Is there any instructions on getting logs with strace when issue results into crash ? I assume you mean wrapping the autoscaler command with strace and sending the logs .. is that enough or any more specific details you are after?
As for possible problems, yes agree its certainly possible that api-server is getting too many queries but all other cluster resources including kube-system resources and my own workload seem to chug along okay. Its only autoscaler that fails to my knowledge. Its possible autoscaler is making too many calls and getting rate-limited? I have not seen much info in logs to indicate that but I will keep and eye on and update what I find.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
@suneeta-mall sorry for the delay in responding here. Could you include the logs for the failed deploys, i.e., kubectl logs cluster-autoscaler-pod -n kube-system -p, please?
/remove-lifecycle rotten
@alejandrox1 The log is already attached in the description see here "Lost master" but kube master and all other kube component seem to function fine but autoscaler
@suneeta-mall how did you create the cluster? would you happen to have a copy of the code somehwere?
@alejandrox1 It was created with kops on AWS ... anything specific you are looking for ? The very basic version is can be created with following snippet .. which is the foundation of k8s used in this case. ETCD version 3.X
kops create cluster ${NAME} \
--cloud aws \
--master-zones ${ZONES} \
--master-size m4.xlarge \
--node-size m4.xlarge \
--zones $ZONES \
--topology public \
--networking flannel \
--kubernetes-version 1.12.8 \
--node-size m4.xlarge \
--dns-zone XXX \
--encrypt-etcd-storage
I had a similar issue on my cluster (using EKS):
F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading
Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
We're running into similar issues on a very "scaly" EKS cluster here (quite a bit of up-and-down activity during the day); our other, more stable clusters do not seem to run into the issue. I've also noticed that this pod sometimes gets OOMKilled, so I'll try to add more memory first and will report back if it helped 👍
/remove-lifecycle stale
Happened for us as well: Cluster: "v1.15.4" Cloud: Azure Autoscaler version: 1.15.2
I1123 18:51:25.870541 1 scale_down.go:771] No candidates for scale down I1123 18:51:47.848093 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded F1123 18:51:47.848126 1 main.go:406] lost master goroutine 1 [running]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0x4cb5f01, 0x3, 0xc000678000, 0x37) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:900 +0xb1 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).output(0x4cb5fa0, 0xc000000003, 0xc000477340, 0x4c19bb1, 0x7, 0x196, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:815 +0xe6 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).printf(0x4cb5fa0, 0x3, 0x2b62471, 0xb, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:727 +0x14e k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1309 main.main.func3() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:406 +0x5c k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run.func1(0xc00026c7e0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:193 +0x40 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run(0xc00026c7e0, 0x2ff65e0, 0xc0001ca740) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:202 +0x10f k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x2ff6620, 0xc0000cc018, 0x3026ee0, 0xc0002ec280, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00040f3e0, 0x2c39cc8, 0x0, ...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:214 +0x96 main.main() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x6ec
goroutine 19 [syscall, 241 minutes]: os/signal.signal_recv(0x0) /usr/local/go/src/runtime/sigqueue.go:139 +0x9c os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/local/go/src/os/signal/signal_unix.go:29 +0x41
goroutine 20 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x4cb5fa0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1035 +0x8b created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.init.0 /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:404 +0x6c
goroutine 50 [IO wait, 241 minutes]: internal/poll.runtime_pollWait(0x7fc633d894f0, 0x72, 0x0) /usr/local/go/src/runtime/netpoll.go:182 +0x56 internal/poll.(pollDesc).wait(0xc0004fa198, 0x72, 0x0, 0x0, 0x2b5d3c7) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b internal/poll.(pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:92 internal/poll.(FD).Accept(0xc0004fa180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:384 +0x1ba net.(netFD).accept(0xc0004fa180, 0x28e75a0, 0x50, 0xc00038ef50) /usr/local/go/src/net/fd_unix.go:238 +0x42 net.(TCPListener).accept(0xc0000d01f8, 0xc000070700, 0x7fc633dd9b28, 0xc0002a8000) /usr/local/go/src/net/tcpsock_posix.go:139 +0x32 net.(TCPListener).AcceptTCP(0xc0000d01f8, 0x40dc28, 0x30, 0x28e75a0) /usr/local/go/src/net/tcpsock.go:247 +0x48 net/http.tcpKeepAliveListener.Accept(0xc0000d01f8, 0x28e75a0, 0xc000417710, 0x263bcc0, 0x4c9af30) /usr/local/go/src/net/http/server.go:3264 +0x2f net/http.(Server).Serve(0xc0003845b0, 0x2ff2ae0, 0xc0000d01f8, 0x0, 0x0) /usr/local/go/src/net/http/server.go:2859 +0x22d net/http.(Server).ListenAndServe(0xc0003845b0, 0xc0003845b0, 0xd) /usr/local/go/src/net/http/server.go:2797 +0xe4 net/http.ListenAndServe(...) /usr/local/go/src/net/http/server.go:3037 main.main.func1(0xc00038e000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:359 +0x10d created by main.main /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x258
goroutine 12 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0001cb6c0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:207 +0x66 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewBroadcaster /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:75 +0xcc
goroutine 151 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec140, 0xc000186000, 0xc001306d20, 0xc0009515c0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 13 [chan receive]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ac00, 0xc00040f3a0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:268 +0xa4 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e
goroutine 11 [runnable]: sync.(Cond).Broadcast(0xc0000d4380) /usr/local/go/src/sync/cond.go:73 +0x91 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).processWindowUpdate(0xc000e81fb8, 0xc0009bb200, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2255 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).run(0xc000e81fb8, 0x2c38850, 0xc00001dfb8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1727 +0x6ea k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).readLoop(0xc0000a3500) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1607 +0x76 created by k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(*Transport).newClientConn /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:670 +0x637
goroutine 114 [select, 6 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec0a0, 0x2fc0880, 0xc000d8e340, 0xc001173cc0, 0xc0000d2fc0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec0a0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000694f78) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001173f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec0a0, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewUnschedulablePodInNamespaceLister /gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190 +0x1eb
goroutine 14 [select]: k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).roundTrip(0xc0000a3500, 0xc000737d00, 0x0, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1081 +0x8cc k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTripOpt(0xc000144d80, 0xc000737d00, 0xc000807200, 0x6bda66, 0x0, 0xc00015f7a0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:444 +0x159 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTrip(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:406 k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip(0xc000144d80, 0xc000737d00, 0xc0015b6c80, 0x5, 0xc00015f828) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2536 +0x3f net/http.(Transport).roundTrip(0xc00015f680, 0xc000737d00, 0x248fe20, 0xc00041ef01, 0xc0008a6580) /usr/local/go/src/net/http/transport.go:430 +0xe90 net/http.(Transport).RoundTrip(0xc00015f680, 0xc000737d00, 0x2b645a5, 0xd, 0xc0008a6650) /usr/local/go/src/net/http/roundtrip.go:17 +0x35 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(bearerAuthRoundTripper).RoundTrip(0xc000442960, 0xc000737c00, 0x2b607b9, 0xa, 0xc0008a64d8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:317 +0x268 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(userAgentRoundTripper).RoundTrip(0xc00047c2e0, 0xc000737b00, 0xc00047c2e0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:167 +0x1c2 net/http.send(0xc000737b00, 0x2fb5660, 0xc00047c2e0, 0x0, 0x0, 0x0, 0xc0004e5550, 0xc0008078d0, 0x1, 0x0) /usr/local/go/src/net/http/client.go:250 +0x461 net/http.(Client).send(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0, 0xc0004e5550, 0x0, 0x1, 0xc000cc85a0) /usr/local/go/src/net/http/client.go:174 +0xfb net/http.(Client).do(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0) /usr/local/go/src/net/http/client.go:641 +0x279 net/http.(Client).Do(0xc000442990, 0xc000737b00, 0x0, 0x39, 0x2fb34c0) /usr/local/go/src/net/http/client.go:509 +0x35 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).request(0xc001824300, 0xc000807b80, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:737 +0x330 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).Do(0xc001824300, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:809 +0xc5 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(events).CreateWithEventNamespace(0xc00035bc20, 0xc001597180, 0xc00007fdd0, 0x14d9b8e, 0xc00007fdc8) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:57 +0x25d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(EventSinkImpl).Create(0xc00040f3c0, 0xc001597180, 0x280c8c0, 0xc001330320, 0x2ff6ea0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:155 +0x3d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordEvent(0x2ff1220, 0xc00040f3c0, 0xc001597180, 0x0, 0x0, 0x0, 0xc000096000, 0xc00035bca0, 0x1) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:221 +0x12d k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordToSink(0x2ff1220, 0xc00040f3c0, 0xc001096780, 0xc00035bca0, 0xc00051ac30, 0x2540be400) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:189 +0x179 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartRecordingToSink.func1(0xc001096780) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:171 +0x5c k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ade0, 0xc00051adb0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:275 +0xe8 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e
goroutine 128 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec500, 0xc000186000, 0xc000c85b00, 0xc000a190e0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 150 [select, 2 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec0a0, 0xc000186000, 0xc0001873e0, 0xc0000d2fc0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150 created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 83 [chan receive]: main.run(0xc00038e000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:325 +0x1eb main.main.func2(0x2ff65e0, 0xc0001ca740) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec
goroutine 115 [select, 6 minutes]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec140, 0x2fc0880, 0xc0001819c0, 0xc001175cc0, 0xc0009515c0, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec140, 0xc000186000, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000364f78) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001175f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec140, 0xc000186000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewScheduledPodLister /gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:214 +0x1d9
We got the same kind of message and we have a similar config to what @suneeta-mall posted In the aspect of memory and cpu requests (300mb ram, 100m cpu). I don't know about the details but my issue got solved by cleaning up all the completed pods from the cluster. I had about 5-8k pods and even running kubectl get pods --all-namespaces took a long while. After deleting the unneeded pods all is back to working correctly. Also had the same thing as @Pluies I had 3 clusters with the same config but only one of them had that issue.
After v1.17.0, some permissions need to be added to rbac ClusterRole:
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- csinodes
verbs:
- watch
- list
- get
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- watch
- list
- get
- create
- patch
- update
I have the same problem:
I0514 05:08:51.277989 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016 1 main.go:409] lost master
I am running auto scaler version 1.15.6
For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.
- --leader-elect=false
I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs
leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true |
---|
If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.
leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true |
---|
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs
Is disabling leader election really recommended? All of the official examples I'm aware of specify replicas: 1
but keep the default value for leader-elect
.
Even when running replicas: 1
, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.
We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.
I0111 09:12:15.398008 1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received
E0111 09:12:26.102040 1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded
I0111 09:12:27.499348 1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading
I0111 09:12:27.698012 1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0111 09:12:28.597994 1 main.go:426] lost master
/reopen /remove-lifecycle rotten
@svaranasi-traderev: You can't reopen an issue/PR unless you authored it or you are a collaborator.
I have the same issue on EKS 1.19.
We have the same issue in EKS 1.17
I am facing same error with image:cluster-autoscaler:v1.19.1 on EKS 1.19 After applying the suggestion below
/reopen /remove-lifecycle rotten
@mkjmkumar: You can't reopen an issue/PR unless you authored it or you are a collaborator.
Somehow below worked for me
Maybe the issue was solved with a dirty hack and should be re-openend ?
as said before by gabegorelick, I cannot see how it can be safe to use --leader-elect=false
?
Their is obviously a corrupted state written in etcd that should be sorted out properly ?
Is disabling leader election really recommended? All of the official examples I'm aware of specify
replicas: 1
but keep the default value forleader-elect
.Even when running
replicas: 1
, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.
Seeing the issue with auto-scaler deployed in a GCP kuberenetes cluster. Is this resolved or got closed due to inactivity ?
I had this issue with autoscaler , with CPU limit set to 100m
E0325 00:25:02.404766 1 leaderelection.go:361] Failed to update lock: Put "https://<API>/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0325 00:25:02.404822 1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0325 00:25:02.404843 1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000182001, 0xc0002e01e0, 0x37, 0xed)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
...
...
setting limit to 1 CPU solved the issue (it needs more CPU when it starts) so in my case, it was a CPU throttling and it slowed down autoscaler itself
We're seeing this on EKS 1.21.
we're facing same error with image:cluster-autoscaler:v1.18 on EKS 1.20
same issue here using k8s.gcr.io/autoscaling/cluster-autoscaler:v1.25.0 in eks 1.22
Seeing a weird behaviour with cluster-autoscaler, not sure what's exactly causing this.. Autoscaler Version: 1.21.1 Noticed few number of restarts, not resource limis/request set for CPU.
Describe the cluster-autoscaler pod shows:
State: Running
Started: Fri, 28 Oct 2022 18:05:10 +0530
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Fri, 28 Oct 2022 17:56:37 +0530
Finished: Fri, 28 Oct 2022 18:02:19 +0530
Ready: True
Restart Count: 36
----------------------------------
Logs:
```Ruby
1028 12:32:10.414618 1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 0 items received
I1028 12:32:10.414628 1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338: Watch close - *v1.Job total 8 items received
I1028 12:32:10.414642 1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329: Watch close - *v1.ReplicationController total 0 items received
I1028 12:32:10.413723 1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 9 items received
I1028 12:32:10.414657 1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicaSet total 13 items received
E1028 12:32:12.445308 1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
E1028 12:32:15.453424 1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:17.469406 1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:19.457301 1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
I1028 12:32:19.832254 1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F1028 12:32:19.832296 1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0010267e0, 0x37, 0xd7)```
This issue happens when an instance can't update the lock object, which is either a configmap, lease resource, or endpoint, within the lease duration time. The default for duration is 15 seconds. If an instance can't update the lock object within that time it'll assume that it crashed and assign a new leader and terminate the pods instance.
Thanks for the fix and tips.
I had same issue no AWS cluster, cluster autoscaler was running fine but then went into a loop of restarts, reporting similar errors quoted above. I increased the CPU from "0.1" to "0.5" , redeployed auto-scaler , and now it is stable.
My suspicion is the issue got caused by a scheduled scale down perhaps overloaded the resource , maybe.
either way , moving on to next .
I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler
v1.2.2
(on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?