kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.84k stars 1.42k forks source link

Pod restarts without clear reason (timeout when doing GET to configmap) #3126

Open gals-ma opened 1 year ago

gals-ma commented 1 year ago

Describe the bug aws-lb-contoller restarts unexpectedly (happened multiple times already) when doing GET to configmap.

Steps to reproduce Unknown

Expected outcome Retry mechanism

Environment

Additional Context: Pod logs before the restart:

E0327 15:21:45.005515       1 leaderelection.go:325] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.10.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
I0327 15:21:45.005561       1 leaderelection.go:278] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
{"level":"error","ts":1679930505.0055985,"logger":"setup","msg":"problem running manager","error":"leader election lost"}
{"level":"info","ts":1679930505.005635,"logger":"controller.service","msg":"Shutdown signal received, waiting for all workers to finish"}
oliviassss commented 1 year ago

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

gals-ma commented 1 year ago

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

@oliviassss nothing specific happened at the same time, it happens to us from time to time I also saw this log: leaderelection.go:278] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition

In addition, is there any reason to have 2 replicas of lb-contoller? isn't it a problem with quarom when electing a leader?

gals-ma commented 1 year ago

@gals-ma, can you provide more info on this error? Before it occurred, was there any upgrade, deletions or something else? Can you provide more logs before the error lines, so we can better understand the situation? You can also send the logs to k8s-alb-controller-triage AT amazon.com

Can you also please share more information about this error? What exactly it means? Is there a way to increase the timeout of the leader-election check?

kishorj commented 1 year ago

@gals-ma, the two replicas are in active-standby mode. The issue is not with the controller itself, but the API server is not responding to the controller request. It could either be due to network connectivity issues between the controller and the API server or SG permissions preventing access. Does your controller recover eventually?

gals-ma commented 1 year ago

@kishorj @oliviassss So after talking with AWS it was found out that the issue was actually due to the etcd being defragmented and the load-balancer-controller is getting timed out reaching to the etcd server.

so my questions are- 1) why does the LB controller need to contact the etcd server? 2) Is there a way to increase the timeout (or add a retry mechanism) to avoid the restarts?

Thanks again.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lqhl commented 10 months ago

I also encounter this issue. It will be restarted by k8s, but I'm not sure what is affected. /remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mengqiy commented 5 months ago

why does the LB controller need to contact the etcd server?

Every k8s controller that uses leader election relies on apiserver to elect leader and renew lease. APIServer uses etcd as backing store.

Is there a way to increase the timeout (or add a retry mechanism) to avoid the restarts?

ALB controller uses controller-runtime which support setting the lease duration and retry period.

It's expected to see a restart when leader loses lease.

Related discussion: https://github.com/kubernetes-sigs/controller-runtime/issues/1774#issuecomment-1011763856

mengqiy commented 5 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

nileshgadgi commented 2 months ago

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

chahin-healthhelper commented 2 months ago

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

Same here! any update on this?

My log BTW :

2024-05-22 14:26:59.832 {"level":"error","ts":"2024-05-22T13:26:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

2024-05-22 14:26:59.830 I0522 13:26:59.830195       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition

2024-05-22 14:26:59.829 E0522 13:26:59.829397       1 leaderelection.go:367] Failed to update lock: Put "https://10.100.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/aws-load-balancer-controller-leader": context deadline exceeded

2024-05-22 14:26:56.847 E0522 13:26:56.846970       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
nileshgadgi commented 1 month ago

I'm also facing this issue in the ALB controller here is the logs I found before restart. (Thanks to pod-restart-info-collector)

2024-05-21T10:31:59 E0521 10:31:59       1 leaderelection.go:330] error retrieving resource lock kube-system/aws-load-balancer-controller-leader: Get "https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps/aws-load-balancer-controller-leader": context deadline exceeded
2024-05-21T10:31:59 I0521 10:31:59       1 leaderelection.go:283] failed to renew lease kube-system/aws-load-balancer-controller-leader: timed out waiting for the condition
2024-05-21T10:31:59 {"level":"error","ts":"2024-05-19T10:31:59Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}

Could someone help in this?

@oliviassss can you help us in this issues if there is anything in AWS EKS we have to perform do something in configuration then suggest. thnx in advance!

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten