kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.39k stars 8.24k forks source link

Ingress Status: Wrong labels when checking for rollover or shutdown. #10475

Open FlorisFeddema opened 1 year ago

FlorisFeddema commented 1 year ago

During an ingress-nginx helm upgrade with multiple replicas the ingress it sometimes removes the loadbalancerIP status from all ingress resources. After a few seconds the loadbalancerIPs are added on the ingresses again. This happens because the leader controller thinks it is the last pod left and it should clean up the ingress status. It only happens if the last pod of the old version is the leader. The controller checks if there are any pods left with the same labels, but since helm adds the 'helm.sh/chart=ingress-nginx-4.8.0' and 'app.kubernetes.io/version: 1.9.0' labels and those get changed when upgrading the helm chart. Because it finds no other controller pods it thinks it is the last one and needs to clean up the ingress statusses.

What happened:

  1. We have 2 pods of the chart version 4.8.0 and all ingress have an status with loadbalancerIP.
  2. We run the helm upgrade command to upgrade the pods chart 4.8.1.
  3. This creates a new replicaset and will do a rolling upgrade of the pods.
  4. A new pod (chart version 4.8.1) is created and becomes healthy.
  5. One of the old pods (chart version 4.8.0) gets removed by Kubernetes, the leader is still on the old pod.
  6. The second new pod (chart version 4.8.1) is created and becomes healthy.
  7. The last old pod (chart version 4.8.0) gets removed by Kubernetes. Because there are no pods with the same labels anymore it removes the ingress statusses. (this can be seen by running kubectl get ingress and there will be no address for the ingress)
  8. One of the new pods (chart version 4.8.1) is elected as leader and updates the ingress statusses with the loadbalancerIP.

Between the last 2 steps there is no loadbalancerIP on the ingresses. Thus tools that use this information like external-dns will remove the dns entry and the ingressdomain will not be resolvable for a while.

This results in the following (no address on the ingress):

kubectl get ingress
NAME                    CLASS   HOSTS                     ADDRESS         PORTS     AGE
test-ingress     nginx    example.com                    80, 443   90d

What you expected to happen:

We expect the leader controller to see the pods from the new version, and know it is not the last pod. This means it should not clean up the ingress statusses.

  1. We have 2 pods of the chart version 4.8.0 and all ingress have an status with loadbalancerIP.
  2. We run the helm upgrade command to upgrade the pods chart 4.8.1.
  3. This creates a new replicaset and will do a rolling upgrade of the pods.
  4. A new pod (chart version 4.8.1) is created and becomes healthy.
  5. One of the old pods (chart version 4.8.0) gets removed by Kubernetes, the leader is still on the old pod.
  6. The second new pod (chart version 4.8.1) is created and becomes healthy.
  7. The last old pod (chart version 4.8.0) gets removed by Kubernetes. It sees the new pods and knows it is not the last one.
  8. One of the new pods (chart version 4.8.1) is elected as leader and updates the ingress statusses with the loadbalancerIP.

We would expect during the totality of the upgrade:

NAME                    CLASS   HOSTS                     ADDRESS         PORTS     AGE
test-ingress     nginx   *    zz.zz.zz.zz    80, 443   90d

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): v1.9.1

Kubernetes version (use kubectl version): v1.27.3

Environment:

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"8b6cfe2c7c54ae110e0c2dbcc52b468bc08bf5f6", GitTreeState:"clean", BuildDate:"2023-07-28T22:18:46Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

kubectl get nodes -o wide
NAME                                 STATUS   ROLES   AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
aks-d2adsweu1-38256127-vmss000000    Ready    agent   7d22h   v1.27.3   10.110.17.100   <none>        CBL-Mariner/Linux   5.15.126.1-1.cm2   containerd://1.6.22
aks-d2adsweu1-38256127-vmss000001    Ready    agent   7d22h   v1.27.3   10.110.18.39    <none>        CBL-Mariner/Linux   5.15.126.1-1.cm2   containerd://1.6.22
aks-d2adsweu1-38256127-vmss000002    Ready    agent   5d20h   v1.27.3   10.110.16.10    <none>        CBL-Mariner/Linux   5.15.126.1-1.cm2   containerd://1.6.22
aks-systemweu1-22699295-vmss000000   Ready    agent   8d      v1.27.3   10.110.16.114   <none>        CBL-Mariner/Linux   5.15.126.1-1.cm2   containerd://1.6.22
aks-systemweu1-22699295-vmss000001   Ready    agent   8d      v1.27.3   10.110.16.223   <none>        CBL-Mariner/Linux   5.15.126.1-1.cm2   containerd://1.6.22

How was the ingress-nginx-controller installed:

We use ArgoCD with the ingress-nginx helm chart version 4.8.1 (or 4.8.0 before upgrade).

The following values are used:

controller:
  resources:
    requests:
      memory: 150Mi
    limits:
      memory: 300Mi
  replicaCount: 2
  allowSnippetAnnotations: true
  service:
    externalTrafficPolicy: 'Local'
    external:
      enabled: true

Current State of the controller:

kubectl describe ingressclasses
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=zzz-zzz-ingress
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.9.1
              helm.sh/chart=ingress-nginx-4.8.1
Annotations:  argocd.argoproj.io/tracking-id: zzz-zzz-ingress:networking.k8s.io/IngressClass:ingresscontroller/nginx
              ingressclass.kubernetes.io/is-default-class: true
Controller:   k8s.io/ingress-nginx
Events:       <none>

kubectl -n ingresscontroller get all
NAME                                                                  READY   STATUS    RESTARTS   AGE     IP              NODE                                 NOMINATED NODE   READINESS GATES
pod/zzz-zzz-ingress-ingress-nginx-controller-7849497db8gmxmz   1/1     Running   0          5d20h   10.110.16.227   aks-systemweu1-22699295-vmss000001   <none>           <none>
pod/zzz-zzz-ingress-ingress-nginx-controller-7849497db8pjwsf   1/1     Running   0          5d20h   10.110.16.188   aks-systemweu1-22699295-vmss000000   <none>           <none>

NAME                                                                TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE    SELECTOR
service/zzz-zzz-ingress-ingress-nginx-controller             LoadBalancer   10.0.18.29     zz.zz.zz.zz 80:31057/TCP,443:31848/TCP   386d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
service/zzz-zzz-ingress-ingress-nginx-controller-admission   ClusterIP      10.0.49.26     <none>          443/TCP                      388d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
service/zzz-zzz-ingress-ingress-nginx-controller-metrics     ClusterIP      10.0.155.249   <none>          10254/TCP                    388d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx

NAME                                                              READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS   IMAGES
  SELECTOR
deployment.apps/zzz-zzz-ingress-ingress-nginx-controller   2/2     2            2           388d   controller   registry/registry.k8s.io/ingress-nginx/controller:v1.9.1   app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx

NAME                                                                         DESIRED   CURRENT   READY   AGE     CONTAINERS   IMAGES
         SELECTOR
replicaset.apps/zzz-zzz-ingress-ingress-nginx-controller-7849497db8   2         2         2       5d20h   controller   registry/registry.k8s.io/ingress-nginx/controller:v1.9.1   app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx,pod-template-hash=7849497db8

kubectl -n ingresscontroller describe po zzz-zzz-ingress-ingress-nginx-controller-7849497db8
Name:                 zzz-zzz-ingress-ingress-nginx-controller-7849497db8gmxmz
Namespace:            ingresscontroller
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      zzz-zzz-ingress-ingress-nginx
Node:                 aks-systemweu1-22699295-vmss000001/10.110.16.223
Start Time:           Wed, 04 Oct 2023 16:48:00 +0200
Labels:               app.kubernetes.io/component=controller
                      app.kubernetes.io/instance=zzz-zzz-ingress
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=ingress-nginx
                      app.kubernetes.io/part-of=ingress-nginx
                      app.kubernetes.io/version=1.9.1
                      helm.sh/chart=ingress-nginx-4.8.1
                      pod-template-hash=7849497db8
Annotations:          kubectl.kubernetes.io/restartedAt: 2023-07-11T13:59:40Z
                      zzz/logging-module: nginx
Status:               Running
SeccompProfile:       RuntimeDefault
IP:                   10.110.16.227
IPs:
  IP:           10.110.16.227
Controlled By:  ReplicaSet/zzz-zzz-ingress-ingress-nginx-controller-7849497db8
Containers:
  controller:
    Container ID:  containerd://be51dd263b59a482b9ce53843d956b1d9ebb2d4a88678fa845cd426f19132c3c
    Image:         registry/registry.k8s.io/ingress-nginx/controller:v1.9.1
    Image ID:      registry/registry.k8s.io/ingress-nginx/controller@sha256:65c804ad254ac378d316919687b782850dd36c1f677f1115a1db29da59376f18
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/zzz-zzz-ingress-ingress-nginx-controller
      --election-id=zzz-zzz-ingress-ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/zzz-zzz-ingress-ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --ingress-class-by-name=true
      --default-ssl-certificate=certmanager/default-public-wildcard-tls-secret
      --enable-ssl-passthrough=false
    State:          Running
      Started:      Wed, 04 Oct 2023 16:48:01 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  350Mi
    Requests:
      cpu:      100m
      memory:   175Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       zzz-zzz-ingress-ingress-nginx-controller-7849497db8gmxmz (v1:metadata.name)
      POD_NAMESPACE:  ingresscontroller (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stjv4 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  zzz-zzz-ingress-ingress-nginx-admission
    Optional:    false
  kube-api-access-stjv4:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               zzz/workload=zzz
                              kubernetes.io/os=linux
Tolerations:                  CriticalAddonsOnly op=Exists
                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
                              topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
Events:                       <none>

Name:                 zzz-zzz-ingress-ingress-nginx-controller-7849497db8pjwsf
Namespace:            ingresscontroller
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      zzz-zzz-ingress-ingress-nginx
Node:                 aks-systemweu1-22699295-vmss000000/10.110.16.114
Start Time:           Wed, 04 Oct 2023 16:48:11 +0200
Labels:               app.kubernetes.io/component=controller
                      app.kubernetes.io/instance=zzz-zzz-ingress
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=ingress-nginx
                      app.kubernetes.io/part-of=ingress-nginx
                      app.kubernetes.io/version=1.9.1
                      helm.sh/chart=ingress-nginx-4.8.1
                      pod-template-hash=7849497db8
Annotations:          kubectl.kubernetes.io/restartedAt: 2023-07-11T13:59:40Z
                     zzz/logging-module: nginx
Status:               Running
SeccompProfile:       RuntimeDefault
IP:                   10.110.16.188
IPs:
  IP:           10.110.16.188
Controlled By:  ReplicaSet/zzz-zzz-ingress-ingress-nginx-controller-7849497db8
Containers:
  controller:
    Container ID:  containerd://f931f28240cabc99c4cd417125d6330af42d0d49d96188ccf5048181648bf404
    Image:         registry/registry.k8s.io/ingress-nginx/controller:v1.9.1
    Image ID:     registry/registry.k8s.io/ingress-nginx/controller@sha256:65c804ad254ac378d316919687b782850dd36c1f677f1115a1db29da59376f18
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/zzz-zzz-ingress-ingress-nginx-controller
      --election-id=zzz-zzz-ingress-ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/zzz-zzz-ingress-ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --ingress-class-by-name=true
      --default-ssl-certificate=certmanager/default-public-wildcard-tls-secret
      --enable-ssl-passthrough=false
    State:          Running
      Started:      Wed, 04 Oct 2023 16:48:12 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  350Mi
    Requests:
      cpu:      100m
      memory:   175Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       zzz-zzz-ingress-ingress-nginx-controller-7849497db8pjwsf (v1:metadata.name)
      POD_NAMESPACE:  ingresscontroller (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qlpls (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  zzz-zzz-ingress-ingress-nginx-admission
    Optional:    false
  kube-api-access-qlpls:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:              zzz/workload=zzz
                              kubernetes.io/os=linux
Tolerations:                  CriticalAddonsOnly op=Exists
                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
                              topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
Events:                       <none>

kubectl -n ingresscontroller describe svc zzz-zzz-ingress-ingress-nginx-controller
Name:                     zzz-zzz-ingress-ingress-nginx-controller
Namespace:                ingresscontroller
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=zzz-zzz-ingress
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/part-of=ingress-nginx
                          app.kubernetes.io/version=1.9.1
                          helm.sh/chart=ingress-nginx-4.8.1
Annotations:              argocd.argoproj.io/tracking-id: zzz-zzz-ingress:/Service:ingresscontroller/zzz-zzz-ingress-ingress-nginx-controller
                          service.beta.kubernetes.io/azure-load-balancer-ipv4: zz.zz.zz.zz
                          service.beta.kubernetes.io/azure-load-balancer-resource-group: zzz
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=zzz-zzz-ingress,app.kubernetes.io/name=ingress-nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.0.18.29
IPs:                      10.0.18.29
LoadBalancer Ingress:     zz.zz.zz.zz
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31057/TCP
Endpoints:                10.110.16.188:80,10.110.16.227:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  31848/TCP
Endpoints:                10.110.16.188:443,10.110.16.227:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     31796
Events:                   <none>

How to reproduce this issue:

Install the ingress-nginx helm chart:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx --set controller.replicaCount=2 --version 4.8.0

Create an ingress and wait until it has an loadbalancerIp:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-ingress
spec:
  defaultBackend:
    service:
      name: test
      port:
        number: 80

Upgrade the helm chart:

helm upgrade ingress-nginx ingress-nginx/ingress-nginx --set controller.replicaCount=2 --version 4.8.1

Check if the ingress resource still has an loadbalancerIp during the upgrade:

k get ingress

We would expect:

NAME                    CLASS   HOSTS                     ADDRESS         PORTS     AGE
test-ingress     nginx   *    zz.zz.zz.zz     80, 443   90d

We get:

NAME                    CLASS   HOSTS                     ADDRESS         PORTS     AGE
test-ingress     nginx    *                 80, 443   90d

Since it does not trigger every upgrade these could be repeated in reverse order, it can also be achieved by downgrading the version.

Anything else we need to know:

The logs of the old leader pod show the following error:

I1004 12:29:58.913037       7 status.go:135] "removing value from ingress status" address=[{"ip":"xx.xx.xx.xx"}]

During shutdown this check should be true, but it is not: https://github.com/kubernetes/ingress-nginx/blob/8ce61bdc6761f04d0ce617b9125255e9a147a20c/internal/ingress/status/status.go#L130

In the function it checks the labels on the pod and gets all pods with these exact labels:

app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
app.kubernetes.io/version: 1.9.0
helm.sh/chart: ingress-nginx-4.8.0

But when the version in changed the new pods will have other labels and thus the check fails. It should only check on the labels that do not change during version updates. The selector labels should probably work in this case, since these are also used for the controller service to route traffic to the controller pods. The labels on the pods in the helm chart where changed during this change: https://github.com/kubernetes/ingress-nginx/pull/9732. But since it does not trigger every upgrade and only affects services that use the loadbalancerIP from the ingress resources not everyone will notice it on every upgrade.

rouke-broersma commented 1 year ago

Note that this causes downtime when used with externaldns because externaldns removes dns upon the clearing of the status by the old leader.

longwuyuan commented 1 year ago

/remove-kind bug

FlorisFeddema commented 1 year ago

@longwuyuan I have updated the issue with more information and all parts from the template. Hope it is sufficient now 😄

longwuyuan commented 1 year ago

Would you say that new connections from browser/others fail but established connections do not see disruption ?

rouke-broersma commented 1 year ago

Would you say that new connections from browser/others fail but established connections do not see disruption ?

As far as I know there are no disruptions for existing connections.

The problem is that the leader thinks it's the last leader which triggers the cleanup procedure. It clears the ingress status (loadbalancerip) until it's restored by the new leader. This is expected if for example you uninstall ingress-nginx. But in a upgrade scenario the ingress status should remain because there are still controllers active that receive traffic and there is a controller ready to become the new leader.

The main problem is that other tools like for example externaldns use the ingress status to register dns. The ingress is marked as inactive (no ip in status) so externaldns deletes the records.

longwuyuan commented 1 year ago

@rouke-broersma thanks. I think you are stating a very valid and reasonable point.

/triage accepted

Next step is to get actionable data inserted here. Hence I asked if you know the precise behaviour of new-connections vs established connections.

This will have a huge impact in terms of work needed so we can help developers by fleshing the issue with the small tiny detailed data.

Then either in the slack channel for dev on this project or in community-meetings, we need to ask if upgrades with multiple replicas can be non-disruptive at all to begin with (leader going away and other aspects)

rouke-broersma commented 1 year ago

@longwuyuan we have already identified where the regression happened and the fix seems simple, use a different set of labels te determine existence of replicas during leader shutdown. This is already implemented, the check is just using the wrong labels currently.

Needed change:

https://github.com/kubernetes/ingress-nginx/blob/8ce61bdc6761f04d0ce617b9125255e9a147a20c/internal/ingress/status/status.go#L236

The check here needs to be extended to also exclude app.kubernetes.io/version and helm.sh/chart since these labels contain specific version numbers. These labels were not present on the pods when this check was written so were not considered, but there were added in #9732 where I guess this check was missed as an impact.

Or the check should be modified to only consider specifically picked labels. Probably this set:

app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx

This should still be stable enough since you should not run multiple instances of ingress-nginx in the same namespace and this check is only performed in the controller's own namespace.

Or the chart label change should be reverted until a better fix is implemented.

@Gacko @strongjz which of the proposals would be most in line with your expectations regarding the previous PR?

Gacko commented 1 year ago

Dunno if I'm eligible for deciding or giving input on this, but first of all I'd like to tell that I proposed https://github.com/kubernetes/ingress-nginx/pull/9732 only for aligning the pod's labels to other resources plus improving possibilities to select them by labels in daily business, by other services or in special use cases we might not know of.

Anyway: Adding those labels in the PR was non-breaking, removing them now might be. So simply reverting this PR also brings some risk. At the time implementing it, I was not aware of the isRunningMultiplePods functionality, sorry for that!

I feel like additionally excluding the labels known for being version specific and breaking this functionality is a better decision than hard-coding a set of labels as I guess the chance of the latter getting changed somewhen in the future is bigger than for the former just by their bare amount.

On the other hand one could also configure their own labels being added to the pods by using controller.labels or commonLabels from the values.yaml.

So while writing this I'm more and more feeling like simply using the same labels the service is using for selecting pods, so ingress-nginx.selectorLabels, would be the best decision, especially because changing them someday definitely is a breaking change requiring extended effort and investigations, so chances one also thinks of isRunningMultiplePods might be higher.

Gacko commented 1 year ago

One more: Right now isRunningMultiplePods is bound to one using this Helm chart. Is this a requirement and the only supported way? Because otherwise one could also use their own chart and just the image provided here, which might end up on different labels and similar problems. So in the end it could also be a good idea to make those labels configurable, even just by handing the selector labels by default.

rouke-broersma commented 1 year ago

Agreed ingress-nginx.selectorLabels makes the most sense, that would probably require a new startup option to be able to pass those to the controller. I don't know how significant of a change something like that is for the ingress-nginx project.

In the intermediate term I would personally propose checking specifically for

app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx

Because these are well-known and in principle stable labels and values. This is even indicated in the comments on isRunningMultiplePods.

These labels are also all implemented in the static config provided by ingress-nginx:

https://github.com/kubernetes/ingress-nginx/tree/main/deploy/static/provider/cloud

With the exception of app.kubernetes.io/managed-by but we could simply consider the not-existing label as being equal to the label also not existing on the other pods.

In the long term we could then supply the controller with the service selectorLabels so they are fully inline in the case someone creates their own service or doesn't use the provided helm chart.

Gacko commented 1 year ago

Ehm, random shower thought: Why not using the endpoints of the service configured to publish to the Ingress resources?

Like: The load balancer address published to the Ingress resources reconciled by an Ingress NGINX Controller is configured via the --publish-service flag. So in the same way we could just check the endpoints of this service. If the leader is the last one remaining, it's fine to remove this load balancer address from the reconciled Ingress resources.

No labels, no exceptions involved, just plain Kubernetes concepts. Or am I missing something?

rouke-broersma commented 1 year ago

@Gacko that makes sense, sounds like a good long term solution. Much more stable than relying on user-editable labels.

rouke-broersma commented 12 months ago

@longwuyuan @strongjz would it be acceptable in the immediate term to modify the isRunningMultiplePods to use the labels

app.kubernetes.io/component: controller
app.kubernetes.io/instance: xxxx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx

And in the long term to implement the change suggested by @Gacko?

The current implementation interrupts externaldns integration regularly and I'm surprised more people haven't noticed this.

If it's acceptable we can implement this change.

flaviomoringa commented 11 months ago

Hello,

I ended up here because last week we updated the nginx-ingress helm chart from 4.7.1 to 4.8.3, and had downtime in all of our critical applications (more than 100 EKS clusters).

It took us quite a while understanding the issue, because we saw external-dns doing route53 requests to delete and create records and where not understing why. Now we do :-(

For now we will "fix it" by changing external-dns policy to "upsert-only" instead of "sync"... which will leave lots of garbage on route53, but seems the only way to cope with nginx-ingress upgrades without causing downtime... at least until this is fixed.

Is there any way to try to push this to be fixed to a higher priority? This a real dealbreaker for us.

Thanks

flaviomoringa commented 11 months ago

Hello,

as a temporary fix could we just pass the "--update-status-on-shutdown= false" somewhere on the helm chart values?

What bad effects would this cause (because this way we could keep external-dns with sync policy and not have garbage entries polluting our dns records)?

Thanks

rouke-broersma commented 11 months ago

update-status-on-shutdown

controller:
  extraArgs:
    update-status-on-shutdown: 'false'

The consequence is that if ingress-controller is not running your loadbalancer services will still contain the external IP in the status and external-dns will not remove the dns record. But it's unlikely that you would completely uninstall ingress-nginx yet keep the ingresses.

flaviomoringa commented 11 months ago

@rouke-broersma Thanks, thats we tought also, so for now it's a simple quickfix. It however will need a restart of the deployment that will cause the issue once again right?

rouke-broersma commented 11 months ago

@rouke-broersma Thanks, thats we tought also, so for now it's a simple quickfix. It however will need a restart of the deployment that will cause the issue once again right?

Unless you also update ingress-nginx at the same time the selector labels should stay the same so the 'am I the leader leader' check should function correctly in this case and it should not be a problem. Unless you modify the labels yourself with something like an internal build number, then it might be a problem.

flaviomoringa commented 11 months ago

@rouke-broersma thank you so much. Hope the main issue gets fixed soon, but for now this fix will help.

Gacko commented 11 months ago

Well, I'm not too deep into the controllers Go code, yet, but I could give it a try.

/assign

baileydoestech commented 9 months ago

Just wondering if there was any progress on this issue - we just experienced the same problem updating some development clusters (with only two replicas so the chances of the leader being the last is high). Or is the recommendation to use the extra args setting suggested?

alapidas commented 9 months ago

We use a deployment system that adds labels for every deployment with metadata - these labels have different values on each deploy, so if we were exclude some labels as a fix, I would still experience this issue (see https://github.com/kubernetes/ingress-nginx/issues/10475#issuecomment-1756870130).

Using static labels is more rigid and prone to break later, but as a more immediate fix, seems better to me.

alapidas commented 9 months ago

https://github.com/kubernetes/ingress-nginx/issues/10475#issuecomment-1757195481

@Gacko Is there potential for a race here? Is it possible that the leader starts a shutdown before another pod has come all the way up and updated the endpoint on the service?

longwuyuan commented 9 months ago

Its a chicken first or egg first situation, it seems to me.

Since the main problem is external-dns messing around temporarily with the DNS records, this looks like a desired feature, rather than a bug fix. I could be wrong. Please wait for comments from developers.

/kind feature

rouke-broersma commented 9 months ago

Its a chicken first or egg first situation, it seems to me.

Since the main problem is external-dns messing around temporarily with the DNS records, this looks like a desired feature, rather than a bug fix. I could be wrong. Please wait for comments from developers.

/kind feature

Ingress nginx removes the load balancer ip from the ingress, indicating that the ingress is no longer being served. So external dns removes the dns record. This to avoid dangling dns records etc. External dns is fulfilling it's purpose. However ingress nginx is still available, so the load balancer ip should not have been removed. The only reason this happened is because the logic in ingress-nginx to determine if the leader is the last survivor is flawed since a certain change. This is not only a bug, but also a regression. I really don't understand how you can see this any different way after all the discussion on this issue.

There is no chicken and egg, ingress-nginx is wrong here not external dns.

longwuyuan commented 9 months ago

ok, the owner of the issue can type "/remove-kind feature" on one line and then type "/kind bug", on another line

FlorisFeddema commented 9 months ago

/remove-kind feature /kind bug

Gacko commented 8 months ago

/retitle Ingress Status: Wrong labels when checking for rollover or shutdown.

DocX commented 6 months ago

+1

We run into the same issue, with something I did not see mentioned here yet:

We run a controller in our cluster that adds ec2-instance-id and ec2-instance-type labels to all Pods when they are scheduled. I know this is some custom non-standard thing. But it makes our ingress-nginx-controller always cleanup on shutdown, no matter if it is new replicaset or the same one - since no 2 pods will ever have those labels matching due to zone anti affinity (when we run 2 replicas, each replica will always have different instance id).

This then causes the same result with external-dns removing the DNS records.

This is particularly bad when the ingress-nginx controller pods are shut-down for "expected" reasons, like node scaling or node rotation.

Anyway I also agree with the selectorLabels option, that would solve it for us, by ignoring those extra labels we add to Pods.

maxsargentdev commented 4 months ago

+1 This happened to us in production :(

I am surprised there are not more people encountering this given how popular both projects are.

amdove commented 4 months ago

+1 We experienced this issue as well in production.

baileydoestech commented 4 months ago

Just wondering if there was any progress on this issue - we just experienced the same problem updating some development clusters (with only two replicas so the chances of the leader being the last is high). Or is the recommendation to use the extra args setting suggested?

Would anyone from the core team be able to comment on whether the work around suggested by others is the recommended course of action or is a fix in the pipeline?