kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.28k stars 8.21k forks source link

Admission webhook denied intermittently (depending on disk I/O load?) #10234

Closed koolfy closed 1 week ago

koolfy commented 1 year ago

What happened:

When deploying many environments, using the following annotation: nginx.ingress.kubernetes.io/auth-tls-secret: ingress/stg-ca This "ingress/stg-ca" secret is not modified or recreated during these deployments, but sometimes seems to have difficulties being read during the admission webhook validation

* admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: 
Error: exit status 1
nginx: [emerg] SSL_load_client_CA_file("/etc/ingress-controller/ssl/ca-ingress-stg-ca.pem") failed (SSL: error:04800066:PEM routines::bad end line)
nginx: configuration file /tmp/nginx/nginx-cfg2258806396 test failed

Error: exit status 1
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-stg-ca.pem") failed (SSL: error:05800088:x509 certificate routines::no certificate or crl found)
nginx: configuration file /tmp/nginx/nginx-cfg907486804 test failed

Admission webhook sometimes (often) fails on busy but functional clusters with these errors, when deploying new environments with known-to-be-valid ingress objects. Re-running it will work properly on some occasions, confirming there is nothing fundamentally wrong with the ingress objects themselves.

It might be some form of race-condition depending on k8s cluster (disk i/o) business?

What you expected to happen:

Admission webhook should only fail if the ingress objects produce malformed configurations or otherwise invalid certificates, and not produce false-positive failures on deployments.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.8.1
  Build:         dc88dce9ea5e700f3301d16f971fa17c6cfe757d
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2-gke.2100", GitCommit:"00dd416d1e3300d98717d48686c7cde7cb5dd6b5", GitTreeState:"clean", BuildDate:"2023-06-14T09:21:52Z", GoVersion:"go1.20.4 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Environment:

ingress-nginx ingress 69 2023-06-30 22:19:19.263591445 +0000 UTC deployed ingress-nginx-4.7.1 1.8.1

USER-SUPPLIED VALUES:
controller:
  config:
    enable-opentelemetry: "true"
    log-format-escape-json: "true"
    log-format-upstream: |
      {

        "requestId": "$req_id",
        "proxyXForwardedFor": "$proxy_add_x_forwarded_for",
        "proxyUpstreamName": "$proxy_upstream_name",
        "proxyUpstreamAddr": "$upstream_addr",
        "requestMethod": "$request_method",
        "requestUrl": "$host$uri?$args",
        "status": "$status",
        "requestSize": "$request_length",
        "responseSize": "$upstream_response_length",
        "userAgent": "$http_user_agent",
        "remoteIp": "$realip_remote_addr",
        "serverIp": "$remote_addr",
        "referer": "$http_referer",
        "latency": "$upstream_response_time"
      }
    opentelemetry-config: /etc/nginx/opentelemetry.toml
    opentelemetry-location-operation-name: $namespace/$service_name
    opentelemetry-operation-name: $request_method $service_name $uri
    otel-sampler: AlwaysOn
    otel-sampler-parent-based: "true"
    otel-sampler-ration: 1
    otel-service-name: ingress-nginx
    otlp-collector-host: otel-collector.opentelemetry.svc
    server-snippet: |
      opentelemetry_attribute "ingress.namespace" "$namespace";
      opentelemetry_attribute "ingress.service_name" "$service_name";
      opentelemetry_attribute "ingress.name" "$ingress_name";
      opentelemetry_attribute "ingress.upstream" "$proxy_upstream_name";
    ssl-ciphers: EECDH+AESGCM:EDH+AESGCM
    ssl-config: TLSv1.2 TLSv1.3
    ssl-ecdh-curve: secp384r1
    use-gzip: "true"
  image:
    chroot: true
  metrics:
    enabled: true
    prometheusRule:
      enabled: true
      rules:
      - alert: NGINXConfigFailed
        annotations:
          description: bad ingress config - nginx config test failed
          summary: uninstall the latest ingress changes to allow config reloads to
            resume
        expr: count(nginx_ingress_controller_config_last_reload_successful == 0) >
          0
        for: 1s
        labels:
          severity: critical
      - alert: NGINXCertificateExpiry
        annotations:
          description: ssl certificate(s) will expire in less then a week
          summary: renew expiring certificates to avoid downtime
        expr: (avg(nginx_ingress_controller_ssl_expire_time_seconds) by (host) - time())
          < 604800
        for: 1s
        labels:
          severity: critical
      - alert: NGINXTooMany500s
        annotations:
          description: Too many 5XXs
          summary: More than 5% of all requests returned 5XX, this requires your attention
        expr: 100 * sum(rate(nginx_ingress_controller_requests{status=~"5.+"}[3m]))
          / sum(rate(nginx_ingress_controller_requests[3m])) > 5
        for: 1m
        labels:
          business: true
          severity: warning
      - alert: NGINXTooMany400s
        annotations:
          description: Too many 4XXs
          summary: More than 5% of all requests returned 4XX over last 10 minutes,
            this requires your attention
        expr: 100 * sum(rate(nginx_ingress_controller_requests{status=~"4.+"}[3m]))
          / sum(rate(nginx_ingress_controller_requests[3m])) > 50
        for: 15m
        labels:
          business: true
          severity: warning
      - alert: NGINXSlowQueries
        annotations:
          description: Server respond too slowly
          summary: Only {{ $value }}% of all requests respond in less than 2.5 seconds
        expr: 100 * ( sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{le="2.5"}[3m]))
          / sum(rate(nginx_ingress_controller_request_duration_seconds_count[3m]))
          ) < 90
        for: 10m
        labels:
          business: true
          severity: warning
    serviceMonitor:
      enabled: true
  opentelemetry:
    enabled: true
  replicaCount: 2
  service:
    externalTrafficPolicy: Local
    loadBalancerIP: XX.XX.XX.XX
Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.8.1
              helm.sh/chart=ingress-nginx-4.7.1
              helm.toolkit.fluxcd.io/name=ingress-nginx
              helm.toolkit.fluxcd.io/namespace=ingress
Annotations:  meta.helm.sh/release-name: ingress-nginx
              meta.helm.sh/release-namespace: ingress
Controller:   k8s.io/ingress-nginx
Events:       <none>
--SNIP--
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
    State:          Running
      Started:      Mon, 10 Jul 2023 16:11:04 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-7c4ffdf7b8-czhs2 (v1:metadata.name)
      POD_NAMESPACE:  ingress (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /chroot/modules_mount from modules (rw)
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lf9f2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  modules:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  kube-api-access-lf9f2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
--SNIP--

Events:
  Type    Reason  Age                  From                      Message
  ----    ------  ----                 ----                      -------
  Normal  RELOAD  65s (x724 over 11d)  nginx-ingress-controller  NGINX reload triggered due to a change in configuration

How to reproduce this issue: Not 100% sure, problem is definitely not deterministic

Anything else we need to know: It might happen more often at hours when there is notable disk I/O on the system partition of kubernetes nodes, but still shouldn't probably fail like this.

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

/triage needs-information

It will help a lot if you can write a step-by-step guide to reproduce this problem in minikube or kind cluster. It will provide insight on this being a problem with the controller or with the resources allocated to the pods.

longwuyuan commented 1 year ago

/remove-kind bug

github-actions[bot] commented 1 year ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

longwuyuan commented 1 week ago

"No certificate not found" and "bad end line" would be errors when reading the secret is a problem. so in the absence of enough information to pinpoint the root-cause, it can be assumed that the event occured during new deployments and the secret was not ready when the ingress was attempted to be created. Or the I/O caused a failed read of the secret.

/close

k8s-ci-robot commented 1 week ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/10234#issuecomment-2349169059): >"No certificate not found" and "bad end line" would be errors when reading the secret is a problem. so in the absence of enough information to pinpoint the root-cause, it can be assumed that the event occured during new deployments and the secret was not ready when the ingress was attempted to be created. Or the I/O caused a failed read of the secret. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.