nginxinc / kubernetes-ingress

NGINX and NGINX Plus Ingress Controllers for Kubernetes
https://docs.nginx.com/nginx-ingress-controller
Apache License 2.0
4.68k stars 1.97k forks source link

NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

Open shaun-nx opened 1 year ago

shaun-nx commented 1 year ago

Describe the bug When the NGINX master process exists unexpectedly (e.g. the process is killed using kill -9 <master-process-pid>), system files generated by NGINX are not cleaned up.

This bug outlines the impact of unix socket files in /var/lib/nginx persisting after the NGINX master process exists unexpectedly.

Log output from NGINX when master process exists unexpectedly

E1102 09:38:53.243649       1 main.go:501] nginx command exited with an error: signal: killed
I1102 09:38:53.243740       1 main.go:511] Shutting down the controller
I1102 09:38:53.244035       1 main.go:521] Exiting with a status: 1

To Reproduce Steps to reproduce the behavior:

  1. Deploy all the necessary prerequisites outlined in the installation with manifest docs.
  2. Deploy the below Deployment manifest which is configured with a volume or type emptyDir:{} and volumeMount for /var/lib/nginx
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: nginx-ingress
    namespace: nginx-ingress
    spec:
    replicas: 1
    selector:
    matchLabels:
      app: nginx-ingress
    template:
    metadata:
      labels:
        app: nginx-ingress
        app.kubernetes.io/name: nginx-ingress
    spec:
      serviceAccountName: nginx-ingress
      automountServiceAccountToken: true
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      volumes:
      - name: nginx-lib
        emptyDir: {}
      containers:
      - image: nginx/nginx-ingress:3.3.1
        imagePullPolicy: IfNotPresent
        name: nginx-ingress
        ports:
        - name: http
          containerPort: 80
        - name: https
          containerPort: 443
        - name: readiness-port
          containerPort: 8081
        - name: prometheus
          containerPort: 9113
        readinessProbe:
          httpGet:
            path: /nginx-ready
            port: readiness-port
          periodSeconds: 1
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
        securityContext:
          allowPrivilegeEscalation: false
          runAsUser: 101 #nginx
          runAsNonRoot: true
          capabilities:
            drop:
            - ALL
            add:
            - NET_BIND_SERVICE
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: nginx-lib
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        args:
          - -nginx-configmaps=$(POD_NAMESPACE)/nginx-config
  3. Attach a debug container to the running NGINX Ingress Controller pod using kubectl debug -it -n <ic-namespace> <ic-pod> --image=busybox:1.28 --target=nginx-ingress
  4. Within the debug container, run ps -ef to get the process id of the NGINX master process
  5. Stop the NGINX master process using kill -9 <master-process-pid>
  6. View the logs of the NGINX Ingress Controller pod and see NGINX fail to bind to unix sockets.

Expected behavior NGINX Ingress Controller is able to recover and operate normally after exiting unexpectedly.

Your environment

Additional context Full deployment manifest used: Log output

NINX Ingress Controller Version=3.3.1 Commit=0f828bb5f4159d7fb52bcff0159d1ddd99f16f87 Date=2023-10-13T16:23:42Z DirtyState=false Arch=linux/arm64 Go=go1.21.3
I1102 09:38:54.316209       1 flags.go:297] Starting with flags: ["-nginx-configmaps=nginx-ingress/nginx-config"]
I1102 09:38:54.320330       1 main.go:236] Kubernetes version: 1.27.4
I1102 09:38:54.328891       1 main.go:382] Using nginx version: nginx/1.25.2
I1102 09:38:54.337340       1 main.go:782] Pod label updated: nginx-ingress-64f9fcdb96-dpgsk
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: still could not bind()
F1102 09:39:54.341336       1 manager.go:288] Could not get newest config version: could not get expected version: 0 after 1m0s
github-actions[bot] commented 1 year ago

Hi @shaun-nx thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this :slightly_smiling_face:

Cheers!