kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.31k stars 8.22k forks source link

Config Tries to Be Loaded Before Secrets Have Been Injected Into Pod #9593

Open Evesy opened 1 year ago

Evesy commented 1 year ago

What happened:

During startup of nginx we observed Nginx emitting emergency level logs as the configuration contained references to certificate files that Nginx had not yet loaded into the pod

What you expected to happen:

ingress-nginx should fully write secrets to the pod before attempting to start up

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.4.0

Kubernetes version (use kubectl version): 1.24.8

Environment:

How to reproduce this issue: This hasn't been reproducible in a smaller test environment as of yet, it only seems to happen on our cluster with ~1000 ingresses. We've been on 1.4 for some time now and this is the first time we've observed the issue when nginx is rolling out

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

This is not showing a bug.

/remove-kind bug

The error message is stating this

SSL: error:0908F066:PEM routines:get_header_and_data:bad end line

and I suspect that is related to that pem file. And that pem file could be related to the auth-tls-secret annotation.

You could create another app and ingress with a vanilla image nginx:alpine and see if simple no extra-annotation ingress works. If simple ingress works, then you can proceed to add that annotation and see if the previously working ingress fails after adding that annotation.

longwuyuan commented 1 year ago

this file /etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem

Evesy commented 1 year ago

@longwuyuan Is this not showing a bug?

The file (/etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem) is loaded in by Nginx based on the annotation: nginx.ingress.kubernetes.io/auth-tls-secret: ingress-nginx/cloudflare-origin-pull-ca

The referenced Kubernetes secret, ingress-nginx/cloudflare-origin-pull-ca, is not changing when Nginx is being rolling restarted. The data in the secret is static and sound, and ingress-nginx also eventually loads this correctly without intervention.

This leads me to think ingress-nginx is attempting to validate/load the nginx config, which references that PEM on disk, before ingress-nginx has actually read the secret and written it to it's local filesystem

What are your thoughts?

longwuyuan commented 1 year ago

hi @Evesy , thanks for reporting this. the requirement is complete detailed data on that error.

With cloudflare CA being involved in your post, I think there is a lot to be considered, hence the small tiny minute details of the problem will help a lot. Cloudflare CA and fullchain etc for auth etc are a specialist's area

github-actions[bot] commented 1 year ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

Restless-ET commented 1 year ago

Hello 👋

By chance, do you have any other findings around this @Evesy ?

Believe I experience a similar situation but with the CA CRL file instead of the CRT (my secret provided by the annotation holds both "ca.crt" and "ca.crl").

I've confirmed it's happening on versions 1.1.3 ; 1.3.1 ; 1.4.0 and 1.5.1. Although on v1.1.3 the logging format appears slightly different.

Evesy commented 1 year ago

Hey @Restless-ET -- Unfortunately we haven't seen a reoccurrence of this issue since I raised the issue, and I was never able to reliably reproduce the issue either

Restless-ET commented 1 year ago

Yes, I experience the same... when I release a new version or simply do a rollout restart it doesn't happen every time and even when it does it's not for all the controller pods.

It doesn't seem to affect functionality on any of the endpoints configured, so I guess at this stage is really more about a logs noise reduction (and quicker detection of actual problems) then anything else.

Anyway, thank you for getting back on this. :)

613andred commented 1 year ago

This problem has severely impacted us in the past, I have just now been able to compile the information and replicate the problem.

I also believe it's the same underlying issue causing #10234 and #10265

Our context

The following makes this issue occur more often

Symptoms

Error: UPGRADE FAILED: failed to create resource: admission webhook "validate.nginx.ingress.kubernetes.io" denied the request:
-------------------------------------------------------------------------------
Error: exit status 1
2023/06/06 16:55:24 [emerg] 4002#4002: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0B084088:x509 certificate routines:X509_load_cert_crl_file:no certificate or crl found)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0B084088:x509 certificate routines:X509_load_cert_crl_file:no certificate or crl found)
nginx: configuration file /tmp/nginx/nginx-cfg636383756 test failed

or

2023/02/06 17:24:42 [emerg] 34#34: SSL_load_client_CA_file("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)
nginx: [emerg] SSL_load_client_CA_file("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)

Replicating the problem

I did the following in minikube

#!/bin/bash
VERSION=4.7.1
NS=ingress-test

# Install ingress controller
helm upgrade nginx ingress-nginx/ingress-nginx -i --version ${VERSION} -n ${NS} --create-namespace

echo Wait for ingress controller to be live
until kubectl wait --for=condition=Ready pod --selector app.kubernetes.io/component=controller
do
  sleep 1
done

# Create large truststore (increased likelyhood of race condition)
cat << EOF | kubectl apply -n ${NS} -f - --server-side
apiVersion: v1
data:
  ca.crt: |
$(cat /etc/ssl/certs/ca-certificates.crt | base64 | sed "s/^/    /")
kind: Secret
metadata:
  name: truststore
type: Opaque
EOF

# Create ingress
cat <<EOF | kubectl apply -n ${NS} -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-tls-pass-certificate-to-upstream: "true"
    nginx.ingress.kubernetes.io/auth-tls-secret: ingress-test/truststore
    nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
    nginx.ingress.kubernetes.io/auth-tls-verify-depth: "1"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    update-time: ""
  name: ingress
spec:
  ingressClassName: nginx
  rules:
  - host: dummy.host.com
    http:
      paths:
      - backend:
          service:
            name: dummy-service
            port:
              number: 8080
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - dummy.host.com
EOF

Use 2 terminals

Terminal 1

exec into controller pod kubectl exec -it deployment.apps/nginx-ingress-nginx-controller -- bash

Run the following command/script which:

expected_md5=$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)
cnt=0
while true
do
  if [[ "$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)" == "${expected_md5}" ]] ; then
    let cnt++
  else
    echo "success count: $cnt"
    cnt=0
    echo "failure! $(date)"
fi
done

outputs:

success count: 1272
failure! Thu Aug 24 15:41:06 UTC 2023
success count: 663
failure! Thu Aug 24 15:41:12 UTC 2023
success count: 402
failure! Thu Aug 24 15:41:16 UTC 2023
success count: 392
failure! Thu Aug 24 15:41:19 UTC 2023
success count: 246
failure! Thu Aug 24 15:41:22 UTC 2023

or run (performed internally by the controller to validate the config)

cnt=0
while true
do
  if nginx -tq ; then
    let cnt++
  else
    echo "success count: $cnt"
    cnt=0
    echo "failure! $(date)"
fi
done

outputs:

2023/08/24 15:50:18 [emerg] 4320#4320: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 9
failure! Thu Aug 24 15:50:18 UTC 2023
2023/08/24 15:50:21 [emerg] 4332#4332: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 7
failure! Thu Aug 24 15:50:21 UTC 2023
2023/08/24 15:50:24 [emerg] 4347#4347: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 9
failure! Thu Aug 24 15:50:24 UTC 2023
2023/08/24 15:50:25 [emerg] 4352#4352: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 2
failure! Thu Aug 24 15:50:25 UTC 2023

Terminal 2

After the monitoring is running in terminal 1 Create an update storm by constantly patching the Ingress resource.

while true; do
    kubectl patch -n ingress-test ingress ingress  --type merge --patch "metadata: {annotations: {update-time: \"$(date)\"}}"
done

outputs:

ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)

Causes

Mitigations that we applied

Ex. where ingress is the ingress namespaces that has Secret mtls-truststore

metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-tls-secret: ingress/mtls-truststore`

Possible solutions

func ConfigureCACert(name string, ca []byte, sslCert *ingress.SSLCert) error {
    caName := fmt.Sprintf("ca-%v.pem", name)
+   tmpFileName := fmt.Sprintf("%v/.%v", file.DefaultSSLDirectory, caName)
    fileName := fmt.Sprintf("%v/%v", file.DefaultSSLDirectory, caName)

+   // Perform atomic write by doing a write followed by a rename (unix only)
-   err := os.WriteFile(fileName, ca, 0644)
+   err := os.WriteFile(tmpFileName, ca, 0644)

+   if err == nil {
+       err = os.Rename(tmpFileName, fileName)
+   }

    if err != nil {
        return fmt.Errorf("could not write CA file %v: %v", fileName, err)
    }

    sslCert.CAFileName = fileName

    klog.V(3).InfoS("Created CA Certificate for Authentication", "path", fileName)

    return nil
}

Other potentially affected pieces of code:

I am willing to provide a PR with fixes if you can provide some guidance on my proposed solution(s).

qds-x commented 6 months ago

Just to add some information on this, we are able to consistently reproduce the issue by deploying ingresses with the following annotations

  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/proxy-ssl-name: non-existent-service.user-xx-yy-sandbox.svc.cluster.local
    nginx.ingress.kubernetes.io/proxy-ssl-secret: user-xx-yy-sandbox/dummy-proxy-ssl-secret
    nginx.ingress.kubernetes.io/proxy-ssl-verify: "on"
    nginx.ingress.kubernetes.io/proxy-ssl-verify-depth: "2"

Attempts to deploy many such ingresses simultaneously gives errors such as

-------------------------------------------------------------------------------

        * admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: 
-------------------------------------------------------------------------------
Error: exit status 1
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:145
nginx: [warn] the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:145
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:146
nginx: [warn] the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:146
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg16865532:147
nginx: [warn] the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg16865532:147
2024/03/04 12:27:39 [emerg] 2185398#2185398: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-user-xx-yy-sandbox-dummy-proxy-ssl-secret.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-user-xx-yy-sandbox-dummy-proxy-ssl-secret.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /tmp/nginx/nginx-cfg16865532 test failed

Observations:

As our ingresses only need to use a single shared CA bundle which doesn't change often, our workaround right now is to mount said bundle as a configmap into the nginx pods, then use a configuration snippet to turn on TLS verification to the backend pods, referencing the mounted secret.

    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_ssl_trusted_certificate           /path/to/mounted/bundle.pem;
      proxy_ssl_verify                        on;
      proxy_ssl_verify_depth                  2;
      proxy_ssl_name                          non-existent-service.user-xx-yy-sandbox.svc.cluster.local; 

This seems to dodge the race condition but is far from ideal, not least because enabling configuration snippets exposes vulnerabilities.

I created a helm chart which consistently reproduces the issue. It deploys a placeholder secret, then deploys many ingresses with the above annotations which reference said secret.

longwuyuan commented 2 weeks ago

Hi,

It seems distinctly that an event like a rollout of the controller resulting in existing controller pods terminating and new controller pods being created is required to cause this. Another event seems like a large volume of ingresses with the relevant annotation that injects secrets causes this. I see that some comments also concur that race condition(s) like situations are not ruled out.

To state the obvious, just one or a few ingresses syncing concurrently does not cause this problem. Also it is obvious that for the users that have mTLS secrets in ingresses and that too either in large volumes or involved in rollout during upgrades, require a better experience.

But the project is extremely short on resources and there is no developer time available to work on this. If a PR is submitted then it is likely that it will get reviewed but a e2e-test that mirrors the conditions in a kind cluster is a absolute requirement. I see the need for lots of certs there.

The project resources have a priority to work on securing the controller by default and also implementing the Gateway-API. We have actually deprecated features that are far from the implications of the Ingress-API specs like the tcp/udp forwarding.

But the best step forward is that I request you join the community meeting with announcing the intent to do so and discuss this in the ingress-nginx-dev channel of the Kubernetes Slack. It would help a lot.