NGF Pod fails to become ready due to nginx reload failure: "failed to send the HUP signal to NGINX main: operation not permitted"

nginxinc / nginx-gateway-fabric

NGINX Gateway Fabric provides an implementation for the Gateway API using NGINX as the data plane.

Apache License 2.0

498 stars 96 forks source link

Describe the bug In some environments, the NGINX Gateway Fabric fails to report as ready. The nginx-gateway logs report an error reloading NGINX:

{"level":"error","ts":"2024-03-12T02:21:19Z","logger":"eventLoop.eventHandler","msg":"Failed to update NGINX configuration","batchID":1,"error":"failed to reload NGINX: failed to send the HUP signal to NGINX main: operation not permitted"

This is due to the control plane now having the proper permissions to reload NGINX.

Workaround

To resolve this issue you will need to set allowPrivilegeEscalation to true.

If using Helm, you can set the nginxGateway.securityContext.allowPrivilegeEscalation value. If using the manifests directly, you can update this field under the nginx-gateway container’s securityContext.

Open Questions

So far we have been unable to reproduce this issue on kind or any managed Kubernetes platform. How can we reproduce?
What is the root cause of this permissions issue? Is there a cluster setting that can be tweaked?

Related issues:

A possible way to create a similar error of : {"level":"error","ts":"2024-06-13T18:49:14Z","logger":"eventLoop.eventHandler","msg":"Failed to update NGINX configuration","batchID":16,"error":"failed to reload NGINX: reload unsuccessful: no new NGINX worker processes started for config version 5. Please check the NGINX container logs for possible configuration issues: context deadline exceeded","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:223\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

is by deploying on Openshift, deploying any example, deleting the resources, and waiting a little while. This is also fixed by setting allowPrivilegeEscalation to true.

nginxinc / nginx-gateway-fabric

NGF Pod fails to become ready due to nginx reload failure: "failed to send the HUP signal to NGINX main: operation not permitted" #1695