kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.41k stars 8.24k forks source link

Breaking change in helm chart 4.4.1 / controller 1.5.2 #9464

Closed JVMartin closed 1 year ago

JVMartin commented 1 year ago

As soon as we automatically upgraded to 4.4.1, many of our services were inaccessible. Looked for common denominator, and discovered that Ingress definitions that use multiple paths/prefixes broke in strange and bizarre ways.

They would serve fake cert instead of cert-manager brokered cert, and if we bypassed the fake cert with e.g. curl --insecure, they would nginx 404 on all paths/prefixes.

Here is a sample broken Ingress (identifying information redacted):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test-api
  namespace: test-production
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
  tls:
  - hosts:
    - test.com
    secretName: test-cert
  rules:
  - host: test.com
    http:
      paths:
      - path: /?(.*)
        pathType: Prefix
        backend:
          service:
            name: mirror
            port:
              number: 80
      - path: /v1/?(.*)
        pathType: Prefix
        backend:
          service:
            name: test-api
            port:
              number: 80

This is on DOKS v1.24.4.

This caused major outages for us across the board, and was extremely insidious to debug.

There were no errors in the ingress controller pods. No errors in the ingress descriptions (k describe ingress). No issue modifying the ingress definitions (k apply -f). No issues in the compiled nginx configs in the ingress controller pods. Because it was so insidious to debug this issue, we had a long downtime and huge impact.

Reverting to 4.4.0 immediately solved the issues.

rikatz commented 1 year ago

Reason: https://github.com/kubernetes/ingress-nginx/pull/9309/files

We had to do some proper validation on paths, and "?" character was left off. I will check with @strongjz and do a quick patch

strongjz commented 1 year ago

/priority critical-urgent /triage accepted

strongjz commented 1 year ago

https://github.com/kubernetes/ingress-nginx/pull/9465

JVMartin commented 1 year ago

Thanks so much for the quick turnaround on this @rikatz @strongjz

I'm out of my depth, but just wanted to voice - even if this is a bug with e.g. the Ingress yaml/config parser thinking ? isn't a valid character in the regexp of a Prefix path definition, shouldn't we still have seen errors from somewhere when we were debugging? (Something like kubectl describe ingress saying "invalid path", or an error from kubectl logs on the ingress controller pods)?

Perhaps a separate issue would be to increase visibility into any potential similar bugs to this?

Or perhaps the "validation" part of the code thought it was okay, but some sanity check post-validation (deeper in the call stack) didn't and so there was just no visibility whatsoever into the underlying issue to me as a user, and so all that is needed is to fix the bug bc it isn't really possible to surface this kind of error to the user?

rikatz commented 1 year ago

yes, this is actually a bigger problem we have on ingress, the feedback we provide for users is "poor" :)

we had some plans to add this feedback on events (describe wont show anything as we don't have status fields on Ingress).

The validation we could also do is on webhooks, but in your case the ingress object was already on apiserver, so it is "after" anything we could do.

It is still our plan to provide some better feedbacks to users on why some Ingress couldn't be reconciled, add some pre validation before entering the sync loop.

Thanks for raising this bug :D We are rolling back and re-planning how to release this, and also we will have some minor release soon allowing regex to be part only of "ImplementationSpecific" types

strongjz commented 1 year ago

1.5.2 Github release was removed, and the helm chart has rolled forward to 4.4.2 with controller version 1.5.1

JVMartin commented 1 year ago

@rikatz Thanks for that insight.

I think the validations on the webhooks you describe must not exist (or otherwise help during this bug), because while we had 4.4.1 deployed, we were making small tweaks to the Ingress and applying them (kubectl apply -f ingress.yaml) and the apiserver was happily accepting them with no errors (though we couldn't get anything to actually work with regexp paths until we discovered the helm release and rolled back to 4.4.0).

Are you saying that the nginx ingress controller is no longer going to support regexp in Prefix path types? I'll have to research ImplementationSpecific type, thanks.

JVMartin commented 1 year ago

Ohh sorry I misread - you said the validations we could also do, as in they don't exist yet. Got it.

mycrEEpy commented 1 year ago

Don't know if it's a good idea but maybe you should also remove the container image for v1.5.2? Automation systems like Renovate will still pick up this new version and suggest the update to the user or worse, automatically roll it out due to being a patch version.

plevart commented 1 year ago

So there currently isn't a release for kubernetes 1.26 available?

longwuyuan commented 1 year ago

@plevart You can see that the CI tests for K8S v1.26 but yes, we need to wait for the next release https://github.com/kubernetes/ingress-nginx/blob/2ea010986f8348b259468506a3118dcc34f54b38/.github/workflows/ci.yaml#L340

longwuyuan commented 1 year ago

Don't know if it's a good idea but maybe you should also remove the container image for v1.5.2? Automation systems like Renovate will still pick up this new version and suggest the update to the user or worse, automatically roll it out due to being a patch version.

@mycrEEpy did you mean to remove from user environments or remove from k8s registry ?

mycrEEpy commented 1 year ago

@longwuyuan I meant the k8s registry, as tools like Renovate scan the tags API of the registries. But as already said, this might be a big problem for people who are currently running v1.5.2 (assuming they don't have any problems with #9465) in their environments.

longwuyuan commented 1 year ago

ok thanks. If you have the URL, can you try a docker pull and update here. We pulled back the release and may be I could try a docker pull myself too. But please help and do that docker pull yourself and update.

mycrEEpy commented 1 year ago

@longwuyuan The tag v1.5.2 is still listed for the registry k8s.gcr.io/ingress-nginx/controller:

~$ reg tags k8s.gcr.io/ingress-nginx/controller
INFO[0000] domain: k8s.gcr.io                           
INFO[0000] server address: k8s.gcr.io                   
sha256-34ee929b111ffc7aa426ffd409af44da48e5a0eea1eb2207994d9e0c0882d143.sig
sha256-3870522ed937c9efb94bfa31a7eb16009831567a0d4cbe01846fc5486d622655.sig
sha256-4ba73c697770664c1e00e9f968de14e08f606ff961c76e5d7033a4a9c593c629.sig
sha256-54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974.sig
sha256-5516d103a9c2ecc4f026efbd4b40662ce22dc1f824fb129ed121460aaa5c47f8.sig
sha256-7059739637c30865f74cae403fffa55c2cb6d9779cd15654480dd0e4f850d536.sig
sha256-92115f5062568ebbcd450cd2cf9bffdef8df9fc61e7d5868ba8a7c9d773e0961.sig
sha256-d1707ca76d3b044ab8a28277a2466a02100ee9f58a86af1535a3edf9323ea1b5.sig
sha256-d8196e3bc1e72547c5dec66d6556c0ff92a23f6d0919b206be170bc90d5f9185.sig
v0.34.0
v0.34.1
.....
v1.4.0
v1.5.1
v1.5.2
longwuyuan commented 1 year ago

show kubectl -n ingress-nginx get po -o yaml | grep -i image

longwuyuan commented 1 year ago

@strongjz I think the GCP console shows only staging registry. Would you know of a way to remove the v1.5.2 layers !

strongjz commented 1 year ago

Images once they are moved to production can not be removed. please make sure you are using the tag in TAG and/or the container digests in the Helm or static deploys.

mycrEEpy commented 1 year ago

@strongjz That's what I thought. But even when using digests, automation tools will still find the broken v1.5.2 via the registry API and suggest updates to users. It's not your problem but, I just wanted to let you know.

strongjz commented 1 year ago

Thanks for the feedback, it's unfortunate and I apologize for the inconvenience. We're working on improving the release.