nginxinc / kubernetes-ingress

NGINX and NGINX Plus Ingress Controllers for Kubernetes
https://docs.nginx.com/nginx-ingress-controller
Apache License 2.0
4.65k stars 1.96k forks source link

Pods restart loop with error "[emerg] 23#23: bind() to 0.0.0.0:80 failed (13: Permission denied)" in latest chart/version for daemonset #3932

Closed brian-provenzano closed 1 year ago

brian-provenzano commented 1 year ago

Describe the bug Using latest image and helm chart and upgrading from v2.4.2 I am getting permission denied errors in nginx pods which causes constant restarts. It appears the issue revolves around these recent securityContext changes PR 3722 and PR 3573.

To Reproduce Steps to reproduce the behavior:

  1. Deploy v3.1.1 (Chart 0.17.1) in daemonset configuration by using helm template... then kubectl apply. - see sample values.yaml to see our settings.
  2. Pods will not successfully start - continuously restart
  3. View logs on a restarting pod and will see 2023/05/22 17:08:45 [emerg] 23#23: bind() to 0.0.0.0:80 failed (13: Permission denied)
  4. If change daemonset.spec.template.spec.containers.securityContext.allowPrivilegeEscalation to true (current setting is false in chart template) and restart the ds it works fine and pods start. This appears to be the same setting that was present in v.2.4.2 which we currently run without issue.

Expected behavior I expect the pods to start successfully even with the new securityContext in place.

Your environment

Additional context I can provide more information if needed. I would adjust the daemonset.spec.template.spec.containers.securityContext.allowPrivilegeEscalation to false to fix this ourselves (albeit reverting to a previously less secure setup that was present in v.2.4.2), but that param is not configurable in the chart.

v3.1.1 Images tried:nginx/nginx-ingress:3.1.1-ubi and public.ecr.aws/nginx/nginx-ingress:3.1.1-ubi (but we use the aws ecr public image due to dockerhub throttles)

test-values.yaml.txt

github-actions[bot] commented 1 year ago

Hi @brian-provenzano thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this :slightly_smiling_face:

Cheers!

vepatel commented 1 year ago

Hi @brian-provenzano, tested this on Nginx Ingress controller v3.1.1 on k8s 1.27:

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k logs test-release-nginx-ingress-controller-4vkdg | grep Version=
NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.3

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k get pods
NAME                                          READY   STATUS    RESTARTS   AGE
test-release-nginx-ingress-controller-4vkdg   1/1     Running   0          5m21s
test-release-nginx-ingress-controller-9ckjh   1/1     Running   0          5m21s
test-release-nginx-ingress-controller-lt6mj   1/1     Running   0          5m21s

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k get pods test-release-nginx-ingress-controller-4vkdg -o yaml | grep allowPrivilegeEscalation
      allowPrivilegeEscalation: false

can you please make sure if you're on correct release tag while running helm install.. or kubectl apply..

helm cmd used: helm install test-release --set controller.kind=daemonset --set controller.nginxplus=false --set controller.image.repository=nginx/nginx-ingress --set controller.image.tag="3.1.1" --set controller.image.pullPolicy=Always .

brianehlert commented 1 year ago

Specifically, there is tuning to the netbind service in the patch https://docs.nginx.com/nginx-ingress-controller/releases/#nginx-ingress-controller-311 Thus the Helm chart / manifests must match the container version.

brian-provenzano commented 1 year ago

We run helm template...kubectl apply (actually it is run thru spinnaker). I used 3.1.1-ubi from dockerhub and the same image public ecr. I corrected the version in the original post.

I will double check my work though to be sure and get back to you asap...

brian-provenzano commented 1 year ago

OK - I tried with these images nginx/nginx-ingress:3.1.1 and nginx/nginx-ingress:3.1.1-ubi. I have attached a copy of the ds I tried that is using thenginx/nginx-ingress:3.1.1 image which still does not work for us (pods throw the perm error previously described).

Testing process: I edited the ds on the cluster to use the nginx/nginx-ingress:3.1.1 image (which launched new pods), but still get the perm error in pod logs and pods constantly restarting. If I change allowPrivilegeEscalation to true all is fine.

Could this be some issue in how our nodes are configured? AMI, OS etc? We are using custom Ubuntu CIS AMIs and not the official AWS EKS optimized AMIs.

Logs from a pod that successfully starts/runs once I change to allowPrivilegeEscalation: true:

NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.3
I0523 16:51:05.622911       1 flags.go:294] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=nginx-ingress/nginx-config" "-default-server-tls-secret=nginx-ingress/nginx-ingress-secret" "-ingress-class=nginx" "-health-status=false" "-health-status-uri=/nginx-health" "-nginx-debug=false" "-v=1" "-nginx-status=false" "-report-ingress-status" "-external-service=nginx-ingress-external" "-enable-leader-election=true" "-leader-election-lock-name=kdp-core-nginx-ingress-leader-election" "-enable-prometheus-metrics=false" "-prometheus-metrics-listen-port=9113" "-prometheus-tls-secret=" "-enable-service-insight=false" "-service-insight-listen-port=9114" "-service-insight-tls-secret=" "-enable-custom-resources=true" "-enable-snippets=true" "-include-year=false" "-disable-ipv6=false" "-enable-tls-passthrough=false" "-enable-preview-policies=false" "-enable-cert-manager=false" "-enable-oidc=false" "-enable-external-dns=false" "-ready-status=true" "-ready-status-port=8081" "-enable-latency-metrics=false"]
I0523 16:51:05.629088       1 main.go:234] Kubernetes version: 1.23.17
I0523 16:51:05.635203       1 main.go:380] Using nginx version: nginx/1.23.4
I0523 16:51:05.739233       1 main.go:776] Pod label updated: nginx-ingress-q2bvf
2023/05/23 16:51:05 [notice] 18#18: using the "epoll" event method
2023/05/23 16:51:05 [notice] 18#18: nginx/1.23.4
2023/05/23 16:51:05 [notice] 18#18: built by gcc 11.2.1 20220127 (Red Hat 11.2.1-9) (GCC)
2023/05/23 16:51:05 [notice] 18#18: OS: Linux 5.4.0-1100-aws
2023/05/23 16:51:05 [notice] 18#18: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2023/05/23 16:51:05 [notice] 18#18: start worker processes
2023/05/23 16:51:05 [notice] 18#18: start worker process 22
2023/05/23 16:51:05 [notice] 18#18: start worker process 23
2023/05/23 16:51:05 [notice] 18#18: start worker process 24
2023/05/23 16:51:05 [notice] 18#18: start worker process 25
2023/05/23 16:51:05 [notice] 18#18: start worker process 26
2023/05/23 16:51:05 [notice] 18#18: start worker process 27
2023/05/23 16:51:05 [notice] 18#18: start worker process 28
2023/05/23 16:51:05 [notice] 18#18: start worker process 29
2023/05/23 16:51:05 [notice] 18#18: start worker process 30
2023/05/23 16:51:05 [notice] 18#18: start worker process 31
2023/05/23 16:51:05 [notice] 18#18: start worker process 32
2023/05/23 16:51:05 [notice] 18#18: start worker process 33
2023/05/23 16:51:05 [notice] 18#18: start worker process 34
2023/05/23 16:51:05 [notice] 18#18: start worker process 35
2023/05/23 16:51:05 [notice] 18#18: start worker process 36
2023/05/23 16:51:05 [notice] 18#18: start worker process 37
...

Logs from a pod when allowPrivilegeEscalation: false (pod does not start/restarts constantly):

NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.3
I0523 16:49:08.587514       1 flags.go:294] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=nginx-ingress/nginx-config" "-default-server-tls-secret=nginx-ingress/nginx-ingress-secret" "-ingress-class=nginx" "-health-status=false" "-health-status-uri=/nginx-health" "-nginx-debug=false" "-v=1" "-nginx-status=false" "-report-ingress-status" "-external-service=nginx-ingress-external" "-enable-leader-election=true" "-leader-election-lock-name=kdp-core-nginx-ingress-leader-election" "-enable-prometheus-metrics=false" "-prometheus-metrics-listen-port=9113" "-prometheus-tls-secret=" "-enable-service-insight=false" "-service-insight-listen-port=9114" "-service-insight-tls-secret=" "-enable-custom-resources=true" "-enable-snippets=true" "-include-year=false" "-disable-ipv6=false" "-enable-tls-passthrough=false" "-enable-preview-policies=false" "-enable-cert-manager=false" "-enable-oidc=false" "-enable-external-dns=false" "-ready-status=true" "-ready-status-port=8081" "-enable-latency-metrics=false"]
I0523 16:49:08.593176       1 main.go:234] Kubernetes version: 1.23.17
I0523 16:49:08.601693       1 main.go:380] Using nginx version: nginx/1.23.4
I0523 16:49:08.635197       1 main.go:776] Pod label updated: nginx-ingress-dbwbh
2023/05/23 16:49:08 [emerg] 24#24: bind() to 0.0.0.0:80 failed (13: Permission denied)

nginx-ingress-ds.yaml.txt

brianehlert commented 1 year ago

We have had issues with Helm upgrades in the past where changes to rbac.yaml (or in openshift scc.yaml) is not processed properly due to how Helm performs the upgrade.

I see that you are using a daemonset instead of a deployment.. Do you get a different result if you use a deployment? I am curious.

brian-provenzano commented 1 year ago

OK - I will give that a try and report back - shouldn't take long to test

brian-provenzano commented 1 year ago

Same issue - no change in behavior as deployment. Attached is a copy of the deployment.

pod logs when deployed as a deployment (same as before):

NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.3
I0523 20:47:37.302872       1 flags.go:294] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=nginx-ingress/nginx-config" "-default-server-tls-secret=nginx-ingress/nginx-ingress-secret" "-ingress-class=nginx" "-health-status=false" "-health-status-uri=/nginx-health" "-nginx-debug=false" "-v=1" "-nginx-status=false" "-report-ingress-status" "-external-service=nginx-ingress-external" "-enable-leader-election=true" "-leader-election-lock-name=kdp-core-nginx-ingress-leader-election" "-enable-prometheus-metrics=false" "-prometheus-metrics-listen-port=9113" "-prometheus-tls-secret=" "-enable-service-insight=false" "-service-insight-listen-port=9114" "-service-insight-tls-secret=" "-enable-custom-resources=true" "-enable-snippets=true" "-include-year=false" "-disable-ipv6=false" "-enable-tls-passthrough=false" "-enable-preview-policies=false" "-enable-cert-manager=false" "-enable-oidc=false" "-enable-external-dns=false" "-ready-status=true" "-ready-status-port=8081" "-enable-latency-metrics=false"]
I0523 20:47:37.393542       1 main.go:234] Kubernetes version: 1.23.17
I0523 20:47:37.400536       1 main.go:380] Using nginx version: nginx/1.23.4
I0523 20:47:37.432189       1 main.go:776] Pod label updated: nginx-ingress-77d64565d8-mttlk
2023/05/23 20:47:37 [emerg] 24#24: bind() to 0.0.0.0:80 failed (13: Permission denied)

Again if I change to allowPrivilegeEscalation: true it works fine.

NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.3
I0523 20:55:42.888299       1 flags.go:294] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=nginx-ingress/nginx-config" "-default-server-tls-secret=nginx-ingress/nginx-ingress-secret" "-ingress-class=nginx" "-health-status=false" "-health-status-uri=/nginx-health" "-nginx-debug=false" "-v=1" "-nginx-status=false" "-report-ingress-status" "-external-service=nginx-ingress-external" "-enable-leader-election=true" "-leader-election-lock-name=kdp-core-nginx-ingress-leader-election" "-enable-prometheus-metrics=false" "-prometheus-metrics-listen-port=9113" "-prometheus-tls-secret=" "-enable-service-insight=false" "-service-insight-listen-port=9114" "-service-insight-tls-secret=" "-enable-custom-resources=true" "-enable-snippets=true" "-include-year=false" "-disable-ipv6=false" "-enable-tls-passthrough=false" "-enable-preview-policies=false" "-enable-cert-manager=false" "-enable-oidc=false" "-enable-external-dns=false" "-ready-status=true" "-ready-status-port=8081" "-enable-latency-metrics=false"]
I0523 20:55:42.895868       1 main.go:234] Kubernetes version: 1.23.17
I0523 20:55:42.907961       1 main.go:380] Using nginx version: nginx/1.23.4
I0523 20:55:42.935903       1 main.go:776] Pod label updated: nginx-ingress-86bfb79447-4pnh6
2023/05/23 20:55:42 [notice] 25#25: using the "epoll" event method
2023/05/23 20:55:42 [notice] 25#25: nginx/1.23.4
2023/05/23 20:55:42 [notice] 25#25: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
2023/05/23 20:55:42 [notice] 25#25: OS: Linux 5.4.0-1100-aws
2023/05/23 20:55:42 [notice] 25#25: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2023/05/23 20:55:42 [notice] 25#25: start worker processes
2023/05/23 20:55:42 [notice] 25#25: start worker process 26
2023/05/23 20:55:42 [notice] 25#25: start worker process 27
2023/05/23 20:55:42 [notice] 25#25: start worker process 28
2023/05/23 20:55:42 [notice] 25#25: start worker process 29
2023/05/23 20:55:42 [notice] 25#25: start worker process 30
2023/05/23 20:55:42 [notice] 25#25: start worker process 31
2023/05/23 20:55:42 [notice] 25#25: start worker process 32
2023/05/23 20:55:42 [notice] 25#25: start worker process 33
2023/05/23 20:55:42 [notice] 25#25: start worker process 34
2023/05/23 20:55:42 [notice] 25#25: start worker process 35
2023/05/23 20:55:42 [notice] 25#25: start worker process 36
2023/05/23 20:55:42 [notice] 25#25: start worker process 37
2023/05/23 20:55:42 [notice] 25#25: start worker process 38
2023/05/23 20:55:42 [notice] 25#25: start worker process 39
2023/05/23 20:55:42 [notice] 25#25: start worker process 40
2023/05/23 20:55:42 [notice] 25#25: start worker process 41

nginx-ingress-deployment.yaml.txt

vepatel commented 1 year ago

weird, working for me with default values on both GKE and AKS with helm chart=0.17.1 GKE: https://github.com/nginxinc/kubernetes-ingress/issues/3932#issuecomment-1558927659 AKS: In this scenario I performed an upgrade from 2.4.2 to 3.1.1

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k get pods
NAME                                          READY   STATUS    RESTARTS   AGE
test-release-nginx-ingress-controller-2bn59   1/1     Running   0          12s
test-release-nginx-ingress-controller-5kj6z   1/1     Running   0          12s
test-release-nginx-ingress-controller-w596l   1/1     Running   0          12s

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k describe daemonsets.apps test-release-nginx-ingress-controller 
Name:           test-release-nginx-ingress-controller
Selector:       app.kubernetes.io/instance=test-release,app.kubernetes.io/name=nginx-ingress
Node-Selector:  <none>
Labels:         app.kubernetes.io/instance=test-release
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=nginx-ingress
                app.kubernetes.io/version=3.1.1
                helm.sh/chart=nginx-ingress-0.17.1
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: test-release
                meta.helm.sh/release-namespace: default

/nginx/kubernetes-ingress/deployments/helm-chart|72473392⚡ ⇒  k get pods test-release-nginx-ingress-controller-jrlm6 -o yaml | grep allowPrivilegeEscalation 
      allowPrivilegeEscalation: false                

I'll try EKS with official EKS optimized Amazon Linux 2 instances later.

brian-provenzano commented 1 year ago

Alright, I am starting to think it is something unique to our environment.

I did the following:

One other possible variable here is our container runtime is still docker in 1.23, besides the fact we are not using official AWS EKS AMIs. I think the current EKS AMIs built for 1.23 use containerd...?

Anyway, I am going to try another test on another one of our 1.23 clusters created using our IaC (TF not eksctl; our custom ubuntu AMI with docker runtime) to further test, but it appears to be an issue on my end. Sorry about the wild goose chase here :(

I am guessing we can close this for now and I can report back if anything changes...

values-test-nginx.yaml.txt

brianehlert commented 1 year ago

It is fine to leave this until you resolve. I think we all learn from these kinds of things.

vepatel commented 1 year ago

Thanks @brian-provenzano for checking, I'll close this for now 👍🏼

justbert commented 9 months ago

We're running into the same error on 3.3.2. We're building our own image to include some extra modules/capabilities and when our image is built with Docker this issue does not happen, however, when it's built with Kaniko, it does.

vepatel commented 7 months ago

@justbert we'll be adding option to modify securityContext in 3.5.0 via helm so that should solve your issue hopefully

justbert commented 7 months ago

Found the issue! (I should have updated my comment) It seems Kaniko doesn't copy over extended file attributes whereas Docker does which means the NET_CAP_BIND was missing from the binary. It's not a well defined part of the COPY command which (as we can see) causes issues.