Open FlorisFeddema opened 1 year ago
Note that this causes downtime when used with externaldns because externaldns removes dns upon the clearing of the status by the old leader.
/remove-kind bug
@longwuyuan I have updated the issue with more information and all parts from the template. Hope it is sufficient now 😄
Would you say that new connections from browser/others fail but established connections do not see disruption ?
Would you say that new connections from browser/others fail but established connections do not see disruption ?
As far as I know there are no disruptions for existing connections.
The problem is that the leader thinks it's the last leader which triggers the cleanup procedure. It clears the ingress status (loadbalancerip) until it's restored by the new leader. This is expected if for example you uninstall ingress-nginx. But in a upgrade scenario the ingress status should remain because there are still controllers active that receive traffic and there is a controller ready to become the new leader.
The main problem is that other tools like for example externaldns use the ingress status to register dns. The ingress is marked as inactive (no ip in status) so externaldns deletes the records.
@rouke-broersma thanks. I think you are stating a very valid and reasonable point.
/triage accepted
Next step is to get actionable data inserted here. Hence I asked if you know the precise behaviour of new-connections vs established connections.
This will have a huge impact in terms of work needed so we can help developers by fleshing the issue with the small tiny detailed data.
Then either in the slack channel for dev on this project or in community-meetings, we need to ask if upgrades with multiple replicas can be non-disruptive at all to begin with (leader going away and other aspects)
@longwuyuan we have already identified where the regression happened and the fix seems simple, use a different set of labels te determine existence of replicas during leader shutdown. This is already implemented, the check is just using the wrong labels currently.
Needed change:
The check here needs to be extended to also exclude app.kubernetes.io/version
and helm.sh/chart
since these labels contain specific version numbers. These labels were not present on the pods when this check was written so were not considered, but there were added in #9732 where I guess this check was missed as an impact.
Or the check should be modified to only consider specifically picked labels. Probably this set:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
This should still be stable enough since you should not run multiple instances of ingress-nginx in the same namespace and this check is only performed in the controller's own namespace.
Or the chart label change should be reverted until a better fix is implemented.
@Gacko @strongjz which of the proposals would be most in line with your expectations regarding the previous PR?
Dunno if I'm eligible for deciding or giving input on this, but first of all I'd like to tell that I proposed https://github.com/kubernetes/ingress-nginx/pull/9732 only for aligning the pod's labels to other resources plus improving possibilities to select them by labels in daily business, by other services or in special use cases we might not know of.
Anyway: Adding those labels in the PR was non-breaking, removing them now might be. So simply reverting this PR also brings some risk. At the time implementing it, I was not aware of the isRunningMultiplePods
functionality, sorry for that!
I feel like additionally excluding the labels known for being version specific and breaking this functionality is a better decision than hard-coding a set of labels as I guess the chance of the latter getting changed somewhen in the future is bigger than for the former just by their bare amount.
On the other hand one could also configure their own labels being added to the pods by using controller.labels
or commonLabels
from the values.yaml
.
So while writing this I'm more and more feeling like simply using the same labels the service is using for selecting pods, so ingress-nginx.selectorLabels
, would be the best decision, especially because changing them someday definitely is a breaking change requiring extended effort and investigations, so chances one also thinks of isRunningMultiplePods
might be higher.
One more: Right now isRunningMultiplePods
is bound to one using this Helm chart. Is this a requirement and the only supported way? Because otherwise one could also use their own chart and just the image provided here, which might end up on different labels and similar problems. So in the end it could also be a good idea to make those labels configurable, even just by handing the selector labels by default.
Agreed ingress-nginx.selectorLabels makes the most sense, that would probably require a new startup option to be able to pass those to the controller. I don't know how significant of a change something like that is for the ingress-nginx project.
In the intermediate term I would personally propose checking specifically for
app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
Because these are well-known and in principle stable labels and values. This is even indicated in the comments on isRunningMultiplePods
.
These labels are also all implemented in the static config provided by ingress-nginx:
https://github.com/kubernetes/ingress-nginx/tree/main/deploy/static/provider/cloud
With the exception of app.kubernetes.io/managed-by
but we could simply consider the not-existing label as being equal to the label also not existing on the other pods.
In the long term we could then supply the controller with the service selectorLabels so they are fully inline in the case someone creates their own service or doesn't use the provided helm chart.
Ehm, random shower thought: Why not using the endpoints of the service configured to publish to the Ingress resources?
Like: The load balancer address published to the Ingress resources reconciled by an Ingress NGINX Controller is configured via the --publish-service
flag. So in the same way we could just check the endpoints of this service. If the leader is the last one remaining, it's fine to remove this load balancer address from the reconciled Ingress resources.
No labels, no exceptions involved, just plain Kubernetes concepts. Or am I missing something?
@Gacko that makes sense, sounds like a good long term solution. Much more stable than relying on user-editable labels.
@longwuyuan @strongjz would it be acceptable in the immediate term to modify the isRunningMultiplePods
to use the labels
app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
And in the long term to implement the change suggested by @Gacko?
The current implementation interrupts externaldns integration regularly and I'm surprised more people haven't noticed this.
If it's acceptable we can implement this change.
Hello,
I ended up here because last week we updated the nginx-ingress helm chart from 4.7.1 to 4.8.3, and had downtime in all of our critical applications (more than 100 EKS clusters).
It took us quite a while understanding the issue, because we saw external-dns doing route53 requests to delete and create records and where not understing why. Now we do :-(
For now we will "fix it" by changing external-dns policy to "upsert-only" instead of "sync"... which will leave lots of garbage on route53, but seems the only way to cope with nginx-ingress upgrades without causing downtime... at least until this is fixed.
Is there any way to try to push this to be fixed to a higher priority? This a real dealbreaker for us.
Thanks
Hello,
as a temporary fix could we just pass the "--update-status-on-shutdown= false" somewhere on the helm chart values?
What bad effects would this cause (because this way we could keep external-dns with sync policy and not have garbage entries polluting our dns records)?
Thanks
update-status-on-shutdown
controller:
extraArgs:
update-status-on-shutdown: 'false'
The consequence is that if ingress-controller is not running your loadbalancer services will still contain the external IP in the status and external-dns will not remove the dns record. But it's unlikely that you would completely uninstall ingress-nginx yet keep the ingresses.
@rouke-broersma Thanks, thats we tought also, so for now it's a simple quickfix. It however will need a restart of the deployment that will cause the issue once again right?
@rouke-broersma Thanks, thats we tought also, so for now it's a simple quickfix. It however will need a restart of the deployment that will cause the issue once again right?
Unless you also update ingress-nginx at the same time the selector labels should stay the same so the 'am I the leader leader' check should function correctly in this case and it should not be a problem. Unless you modify the labels yourself with something like an internal build number, then it might be a problem.
@rouke-broersma thank you so much. Hope the main issue gets fixed soon, but for now this fix will help.
Well, I'm not too deep into the controllers Go code, yet, but I could give it a try.
/assign
Just wondering if there was any progress on this issue - we just experienced the same problem updating some development clusters (with only two replicas so the chances of the leader being the last is high). Or is the recommendation to use the extra args setting suggested?
We use a deployment system that adds labels for every deployment with metadata - these labels have different values on each deploy, so if we were exclude some labels as a fix, I would still experience this issue (see https://github.com/kubernetes/ingress-nginx/issues/10475#issuecomment-1756870130).
Using static labels is more rigid and prone to break later, but as a more immediate fix, seems better to me.
https://github.com/kubernetes/ingress-nginx/issues/10475#issuecomment-1757195481
@Gacko Is there potential for a race here? Is it possible that the leader starts a shutdown before another pod has come all the way up and updated the endpoint on the service?
Its a chicken first or egg first situation, it seems to me.
Since the main problem is external-dns messing around temporarily with the DNS records, this looks like a desired feature, rather than a bug fix. I could be wrong. Please wait for comments from developers.
/kind feature
Its a chicken first or egg first situation, it seems to me.
Since the main problem is external-dns messing around temporarily with the DNS records, this looks like a desired feature, rather than a bug fix. I could be wrong. Please wait for comments from developers.
/kind feature
Ingress nginx removes the load balancer ip from the ingress, indicating that the ingress is no longer being served. So external dns removes the dns record. This to avoid dangling dns records etc. External dns is fulfilling it's purpose. However ingress nginx is still available, so the load balancer ip should not have been removed. The only reason this happened is because the logic in ingress-nginx to determine if the leader is the last survivor is flawed since a certain change. This is not only a bug, but also a regression. I really don't understand how you can see this any different way after all the discussion on this issue.
There is no chicken and egg, ingress-nginx is wrong here not external dns.
ok, the owner of the issue can type "/remove-kind feature" on one line and then type "/kind bug", on another line
/remove-kind feature /kind bug
/retitle Ingress Status: Wrong labels when checking for rollover or shutdown.
+1
We run into the same issue, with something I did not see mentioned here yet:
We run a controller in our cluster that adds ec2-instance-id
and ec2-instance-type
labels to all Pods when they are scheduled. I know this is some custom non-standard thing. But it makes our ingress-nginx-controller
always cleanup on shutdown, no matter if it is new replicaset or the same one - since no 2 pods will ever have those labels matching due to zone anti affinity (when we run 2 replicas, each replica will always have different instance id).
This then causes the same result with external-dns
removing the DNS records.
This is particularly bad when the ingress-nginx controller pods are shut-down for "expected" reasons, like node scaling or node rotation.
Anyway I also agree with the selectorLabels
option, that would solve it for us, by ignoring those extra labels we add to Pods.
+1 This happened to us in production :(
I am surprised there are not more people encountering this given how popular both projects are.
+1 We experienced this issue as well in production.
Just wondering if there was any progress on this issue - we just experienced the same problem updating some development clusters (with only two replicas so the chances of the leader being the last is high). Or is the recommendation to use the extra args setting suggested?
Would anyone from the core team be able to comment on whether the work around suggested by others is the recommended course of action or is a fix in the pipeline?
During an ingress-nginx helm upgrade with multiple replicas the ingress it sometimes removes the loadbalancerIP status from all ingress resources. After a few seconds the loadbalancerIPs are added on the ingresses again. This happens because the leader controller thinks it is the last pod left and it should clean up the ingress status. It only happens if the last pod of the old version is the leader. The controller checks if there are any pods left with the same labels, but since helm adds the 'helm.sh/chart=ingress-nginx-4.8.0' and 'app.kubernetes.io/version: 1.9.0' labels and those get changed when upgrading the helm chart. Because it finds no other controller pods it thinks it is the last one and needs to clean up the ingress statusses.
What happened:
kubectl get ingress
and there will be no address for the ingress)Between the last 2 steps there is no loadbalancerIP on the ingresses. Thus tools that use this information like external-dns will remove the dns entry and the ingressdomain will not be resolvable for a while.
This results in the following (no address on the ingress):
What you expected to happen:
We expect the leader controller to see the pods from the new version, and know it is not the last pod. This means it should not clean up the ingress statusses.
We would expect during the totality of the upgrade:
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): v1.9.1
Kubernetes version (use
kubectl version
): v1.27.3Environment:
uname -a
): 5.15.126.1-1.cm2kubectl version
kubectl get nodes -o wide
How was the ingress-nginx-controller installed:
We use ArgoCD with the ingress-nginx helm chart version 4.8.1 (or 4.8.0 before upgrade).
The following values are used:
Current State of the controller:
How to reproduce this issue:
Install the ingress-nginx helm chart:
Create an ingress and wait until it has an loadbalancerIp:
Upgrade the helm chart:
Check if the ingress resource still has an loadbalancerIp during the upgrade:
We would expect:
We get:
Since it does not trigger every upgrade these could be repeated in reverse order, it can also be achieved by downgrading the version.
Anything else we need to know:
The logs of the old leader pod show the following error:
During shutdown this check should be true, but it is not: https://github.com/kubernetes/ingress-nginx/blob/8ce61bdc6761f04d0ce617b9125255e9a147a20c/internal/ingress/status/status.go#L130
In the function it checks the labels on the pod and gets all pods with these exact labels:
But when the version in changed the new pods will have other labels and thus the check fails. It should only check on the labels that do not change during version updates. The selector labels should probably work in this case, since these are also used for the controller service to route traffic to the controller pods. The labels on the pods in the helm chart where changed during this change: https://github.com/kubernetes/ingress-nginx/pull/9732. But since it does not trigger every upgrade and only affects services that use the loadbalancerIP from the ingress resources not everyone will notice it on every upgrade.