kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.87k stars 1.43k forks source link

Deleting and adding the same Ingress does not work #208

Closed olitheolix closed 6 years ago

olitheolix commented 6 years ago

Hi,

I am on quay.io/coreos/alb-ingress-controller:b30d8d28.

I can create the Ingress and it works as expected. I can then delete that ingress and all resources will be cleaned up. However, if I then create that exact same ingress a second time, it fails.

The controller will create a target group but no ALB because it cannot find listeners. Here is the output with obfuscated resource IDs:

E0914 23:47:00.247097       1 rule.go:170] [ALB-INGRESS] [foo/foo-ingress] [ERROR]: Failed Rule creation. Rule: {  Actions: [{      TargetGroupArn: "arn:aws:elasticloadbalancing:something:targetgroup/something",      Type: "forward"    }],  Conditions: [{      Field: "host-header",      Values: ["apidemo.foo.net"]    },{      Field: "path-pattern",      Values: ["/"]    }],  IsDefault: false,  Priority: "1"} | Error: ListenerNotFound: Listener 'arn:aws:elasticloadbalancing:something:listener/app/something' not found
E0914 23:47:00.247192       1 rule.go:170] [ALB-INGRESS] [foo/foo-ingress] [ERROR]:         status code: 400, request id: 0050c181-99a7-11e7-a5c8-210a34bc05db
E0914 23:47:00.247207       1 albingress.go:305] [ALB-INGRESS] [foo/foo-ingress] [ERROR]: Failed to reconcile state on this ingress
E0914 23:47:00.247216       1 albingress.go:307] [ALB-INGRESS] [foo/foo-ingress] [ERROR]:  - ListenerNotFound: Listener 'arn:aws:elasticloadbalancing:ap-southeast-1:541765997109:listener/app/something' not found
E0914 23:47:00.247222       1 albingress.go:307] [ALB-INGRESS] [foo/foo-ingress] [ERROR]:   status code: 400, request id: 0050c181-99a7-11e7-a5c8-210a34bc05db
I0914 23:48:36.192718       1 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"foo", Name:"foo-ingress", UID:"8b197fc5-99a6-11e7-a4fd-0614b9538e06", APIVersion:"extensions", ResourceVersion:"24225", FieldPath:""}): type: 'Normal' reason: 'DELETE' Ingress foo/foo-ingress
I0914 23:48:36.488710       1 controller.go:477] backend reload required

If I reload the controller it works again.

joshrosso commented 6 years ago

Haven't experienced this yet.

Can I get your ingress manifest?

olitheolix commented 6 years ago

I can reproduce this with the echoserver example:

kubectl create -f default-backend.yaml
kubectl create -f alb-ingress-controller.yaml   # Changed image to b30d8d28

kubectl create -f echoservice/echoserver-namespace.yaml                                                                                                                                                                                           
kubectl create -f echoservice/echoserver-deployment.yaml 
kubectl create -f echoservice/echoserver-service.yaml   
kubectl create -f echoservice/echoserver-ingress.yaml

# Wait until AWS declares ALB "active", then delete ingress.
kubectl delete -f echoservice/echoserver-ingress.yaml      

# All resources got cleaned up and everything is fine at this point.

# Re-creating the ingress will trigger the error on my cluster
kubectl create -f echoservice/echoserver-ingress.yaml
joshrosso commented 6 years ago

@olitheolix Thanks for the details.

I'm working on writing a walkthrough/docs now using echoserver as an example.

I'll see if this is reproducible with the newest build.

joshrosso commented 6 years ago

@olitheolix I haven't been able to reproduce this. I've worked through a test example I've written a few times now and at the end I've deleted and recreated the ingress without issue.

A few things to note:

Let me know if you have the opportunity to test. Best I can tell is this was fixed in the newest version.

olitheolix commented 6 years ago

@joshrosso Thank you for the detailed writeup. I will try it tomorrow and let you know.

Just to confirm: the writeup references this controller which uses the v0.8 image. Is this really the version you want me to use? It is rather old and may fight with external-dns over DNS entries.

Also, in Step 5, you create the echoserver-service twice. Is this a typo?

joshrosso commented 6 years ago

@olitheolix yw.

Is this really the version you want me to use?

No, this walkthrough is pointed at master so that when it's merged, it'll include all the updated manifests in continued_stabilization. You'll want to use the examples from that branch. Until we're merged into master of course.

Also, in Step 5, you create the echoserver-service twice. Is this a typo?

Typo, removed. Thanks!

olitheolix commented 6 years ago

@joshrosso The problem persists but I think I know what causes it. When I delete the Ingress resource the controller will also delete the two security groups it created. However, one of them cannot be immediately deleted because the network interface it is attached to takes quite some time to shutdown (up to 1min for me). This is visible in the logs as a warning:

W0921 01:11:37.441349       1 loadbalancer.go:419] [ALB-INGRESS] [echoserver/echoserver] [WARN]: Failed in deletion of managed SG: DependencyViolation: resource sg-6b85310d has a dependent object
W0921 01:11:37.441367       1 loadbalancer.go:419] [ALB-INGRESS] [echoserver/echoserver] [WARN]:        status code: 400, request id: a33247fb-992a-414e-bf46-756659e323ff.

When I then re-create the ingress all the errors I mentioned are due to a missing security group. The error does not occur if I manually delete the SG after the network interface shut down. This error tends to not occur after I have done this two or three times manually - may have something to do with AWS resources not being "warm", but that is pure speculation on my part.

If the above error does not occur, then another one might. When I create the same ingress immediately after deleting it then, according to the logs, it skips the ALB creation step and tries to modify a non-existing ALB.

As an example, normally the logs start with an ALB creation request like so:

I0921 01:32:41.626346       1 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"echoserver", Name:"echoserver", UID:"c29720b3-9e6c-11e7-946f-06fc05f45184", APIVersion:"extensions", ResourceVersion:"45925", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress echoserver/echoserver
I0921 01:26:49.871295       1 controller.go:477] backend reload required
I0921 01:26:49.871432       1 loadbalancer.go:168] [ALB-INGRESS] [echoserver/echoserver] [INFO]: Start ELBV2 (ALB) creation.
I0921 01:26:49.871634       1 session.go:31] [ALB-INGRESS] [session] [INFO]: Request: ec2/&{DescribeSecurityGroups POST / %!s(*request.Paginator=<nil>) %!s(func(*request.Request) error=<nil>)}, Payload: {
I0921 01:26:49.871644       1 session.go:31] [ALB-INGRESS] [session] [INFO]:   Filters: [{

When I create the ingress immediately after I deleted it, the logs start with a modification request like so:

I0921 01:27:25.025645       1 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"echoserver", Name:"echoserver", UID:"05e1c63f-9e6c-11e7-946f-06fc05f45184", APIVersion:"extensions", ResourceVersion:"44842", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress echoserver/echoserver
I0921 01:27:26.537962       1 controller.go:477] backend reload required
I0921 01:27:26.538196       1 loadbalancer.go:185] [ALB-INGRESS] [echoserver/echoserver] [INFO]: Start ELBV2 (ALB) modification.
I0921 01:27:26.538289       1 loadbalancer.go:360] [ALB-INGRESS] [echoserver/echoserver] [INFO]: Start ELBV2 tag modification.

This fails because no ALB has been created yet.

If I pause between deleting/creating the ingress this error seems to not occur.

In case it is relevant: Tectonics cluster, t2.medium, ap-southeast-1.

olitheolix commented 6 years ago

@joshrosso The writeup works with a few minor tweaks:

Other than that, it works like a charm - thank you :+1:

joshrosso commented 6 years ago

Thanks for the detailed follow-up @olitheolix.

Regarding the delay between security group deletion. An update just went in today to make it more resilient (https://github.com/coreos/alb-ingress-controller/pull/197/commits/5de55937d8e232195568673a8cf9839606d3d94b).

Ingress manifest: change scheme from internal to internet-facing

Thanks, done (https://github.com/coreos/alb-ingress-controller/commit/5751e141ab4b587de49c1df615e01e36e83f0bd2).

Mention that the cluster name must be <11 characters (why is that, btw?).

We've just removed this limitation recently (https://github.com/coreos/alb-ingress-controller/issues/198).

Mention tags for security group, just like you did for subnets.

Thanks will take a look tomorrow.

olitheolix commented 6 years ago

@joshrosso Thank you. Any idea when/which image will contain these changes?

joshrosso commented 6 years ago

@olitheolix We're attempting to include all of these in the official 1.0 release.

This is being worked on as part of #197. While this work isn't done, there are some alpha cuts happening that you can see at quay.io/repository/coreos/alb-ingress-controller.

An example being the image: 1.0-alpha.1.