kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.95k stars 1.47k forks source link

healthz endpoint never comes up despite service functioning #281

Closed kaseyalusi closed 6 years ago

kaseyalusi commented 6 years ago

Using the helm chart, the readiness and liveliness checks never pass because the <>:8080/healthz endpoint never comes up. <>:8080/metrics is live and the controller is functioning properly after removing the liveness check. We are able to create multiple services using ALB's and everything is groovy.

Based on the logs, it looks like one of the loops off of controller.Configure is stuck. The last log message from ALB-INGRESS (with DEBUG turned on) is log.go:48] [ALB-INGRESS] [controller] [INFO]: Ingress class set to alb

Setting up a port-forward to 8080 and curling http://localhost:8080/healthz (or state) returns a 404 which makes me think it never got to the step where those handlers are created.

alock commented 6 years ago

Chart we are using alb-ingress-controller-helm-0.0.9

willejs commented 6 years ago

@kaseyalusi I am looking into this now. There needs to be more debug logs to figure out where its getting stuck I think, and then il add some error handling to surface these issues in future.

willejs commented 6 years ago

@kaseyalusi @alock I ended up running the latest tag 1.0-alpha.7, which includes better logging to debug my issues. Give that a go and it will work. Once 1.0 is released im sure they will bump the version in the helm chart.

kaseyalusi commented 6 years ago

Hey @willejs thanks for looking into this. I deployed the 1.0-alpha.7 tag but with that image the controller is getting 403 trying to use the AWS apis... we are using kube2iam for the authentication and with the 0.8 tag everything is working just fine.

I1208 18:26:16.307510       1 session.go:31] [ALB-INGRESS] [session] [INFO]: Request: elasticloadbalancing/&{DescribeLoadBalancers POST / %!s(*request.Paginator=&{[Marker] [NextMarker]  }) %!s(func(*request.Request) error=<nil>)}, Payload: {
I1208 18:26:16.307524       1 session.go:31] [ALB-INGRESS] [session] [INFO]:
I1208 18:26:16.307527       1 session.go:31] [ALB-INGRESS] [session] [INFO]: }
I1208 18:26:17.479102       1 session.go:31] [ALB-INGRESS] [session] [INFO]: Request: ec2/&{DescribeTags POST / %!s(*request.Paginator=&{[NextToken] [NextToken] MaxResults }) %!s(func(*request.Request) error=<nil>)}, Payload: {
I1208 18:26:17.479122       1 session.go:31] [ALB-INGRESS] [session] [INFO]:   Filters: [{
I1208 18:26:17.479126       1 session.go:31] [ALB-INGRESS] [session] [INFO]:       Name: "resource-id",
I1208 18:26:17.479129       1 session.go:31] [ALB-INGRESS] [session] [INFO]:       Values: ["sg-XXX"]
I1208 18:26:17.479132       1 session.go:31] [ALB-INGRESS] [session] [INFO]:     }]
I1208 18:26:17.479135       1 session.go:31] [ALB-INGRESS] [session] [INFO]: }
I1208 18:26:17.479271       1 session.go:31] [ALB-INGRESS] [session] [INFO]: Request: ec2/&{DescribeTags POST / %!s(*request.Paginator=&{[NextToken] [NextToken] MaxResults }) %!s(func(*request.Request) error=<nil>)}, Payload: {
I1208 18:26:17.479287       1 session.go:31] [ALB-INGRESS] [session] [INFO]:   Filters: [{
I1208 18:26:17.479294       1 session.go:31] [ALB-INGRESS] [session] [INFO]:       Name: "resource-id",
I1208 18:26:17.479303       1 session.go:31] [ALB-INGRESS] [session] [INFO]:       Values: ["sg-XXX"]
I1208 18:26:17.479323       1 session.go:31] [ALB-INGRESS] [session] [INFO]:     }]
I1208 18:26:17.479333       1 session.go:31] [ALB-INGRESS] [session] [INFO]: }
tyrannasaurusbanks commented 6 years ago

Hi @kaseyalusi are you still having problems? I ask because I'm using 1.0-alpha.7 with kiam and everything looks good for us: we did have to add a bunch of IAM permissions to our kiam configuration when we upgraded from 0.X to 1.0-alpha.Y though (mainly around WAF and some extra ec2) - so maybe try the newer build again but with more open IAM permissions?

Side note We did have some problems running an old version of the helm chart though: the liveness probe was failing as we suspect the AWS API calls in /healthz were being rate limited. I've just submitted a PR to prompt a discussion on this (https://github.com/kubernetes-sigs/aws-alb-ingress-controller/pull/406)

FloatingGhost commented 6 years ago

I've run into a similar issue - the ALB is allocated correctly and routes set up, but the /healthz endpoint never comes up and so the pod gets endlessly restarted. I'll hack around it for now by just adding a stupidly long timeout.

Looking at the AWS debug logs, it's trying to call waf-regional/GetWebACLForResource - the endpoint doesn't exist in the region I'm running in (eu-west-2) - might be the root cause on my end

bigkraig commented 6 years ago

There is discussion in #439 about disabling services that are not supported in some regions. In the mean time I think WAF can be taken out of the HC.

I've modified how the healthz endpoint works in ##439 to run the AWS tests on an interval outside of the /healthz endpoint.