nginxinc / nginx-gateway-fabric

NGINX Gateway Fabric provides an implementation for the Gateway API using NGINX as the data plane.
Apache License 2.0
498 stars 96 forks source link

NGF Pod fails to become ready due to nginx reload failure: "failed to send the HUP signal to NGINX main: operation not permitted" #1695

Open kate-osborn opened 7 months ago

kate-osborn commented 7 months ago

Describe the bug In some environments, the NGINX Gateway Fabric fails to report as ready. The nginx-gateway logs report an error reloading NGINX:

{"level":"error","ts":"2024-03-12T02:21:19Z","logger":"eventLoop.eventHandler","msg":"Failed to update NGINX configuration","batchID":1,"error":"failed to reload NGINX: failed to send the HUP signal to NGINX main: operation not permitted"

This is due to the control plane now having the proper permissions to reload NGINX.

Workaround

To resolve this issue you will need to set allowPrivilegeEscalation to true.

If using Helm, you can set the nginxGateway.securityContext.allowPrivilegeEscalation value. If using the manifests directly, you can update this field under the nginx-gateway container’s securityContext.

Open Questions

Related issues:

bjee19 commented 4 months ago

A possible way to create a similar error of : {"level":"error","ts":"2024-06-13T18:49:14Z","logger":"eventLoop.eventHandler","msg":"Failed to update NGINX configuration","batchID":16,"error":"failed to reload NGINX: reload unsuccessful: no new NGINX worker processes started for config version 5. Please check the NGINX container logs for possible configuration issues: context deadline exceeded","stacktrace":"github.com/nginxinc/nginx-gateway-fabric/internal/mode/static.(*eventHandlerImpl).HandleEventBatch\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/mode/static/handler.go:223\ngithub.com/nginxinc/nginx-gateway-fabric/internal/framework/events.(*EventLoop).Start.func1.1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/framework/events/loop.go:74"}

is by deploying on Openshift, deploying any example, deleting the resources, and waiting a little while. This is also fixed by setting allowPrivilegeEscalation to true.