kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.28k stars 8.21k forks source link

Potential memory leak in OpenSSL #7647

Closed Lyt99 closed 11 months ago

Lyt99 commented 3 years ago

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 0.44.0 & 0.49.0

Kubernetes version (use kubectl version): 1.18.8

Environment:

What happened:

We've encountered some memory issue both in 0.44.0 and 0.49.0 Some of the ingress pods get a high memory usage, but others are ina normal level

image

We did sone diagnose to the pod, and it shows that one of the nginx worker gained a large amount of memory.

image

the income traffic is balance, about 100 requests per second, and the connection count between pods is of the same order of magnitude (from 10k+ to 100k+).

And then, we use pmap -x <pid> to get details of the memory. There were lots of tiny anon blocks in the memory map.

image

Made a coredump and took a look at this memory area, most of its content seems to be related to TLS certs. And also we tried to run memleak on the process, and result here:

[16:18:49] Top 10 stacks with outstanding allocations:
    300580 bytes in 15029 allocations from stack
        CRYPTO_strdup+0x30 [libcrypto.so.1.1]
        [unknown]
    462706 bytes in 375 allocations from stack
        [unknown] [libcrypto.so.1.1]
    507864 bytes in 9069 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
        [unknown]
    536576 bytes in 131 allocations from stack
        [unknown] [libcrypto.so.1.1]
    848638 bytes in 333 allocations from stack
        ngx_alloc+0xf [nginx]
        [unknown]
    2100720 bytes in 22253 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
    3074792 bytes in 888 allocations from stack
        BUF_MEM_grow+0x81 [libcrypto.so.1.1]
    3496960 bytes in 4398 allocations from stack
        posix_memalign+0x1a [ld-musl-x86_64.so.1]
    5821440 bytes in 9096 allocations from stack
        [unknown] [libssl.so.1.1]
        [unknown]
    9060080 bytes in 22605 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:18:58] Top 10 stacks with outstanding allocations:
    287280 bytes in 14364 allocations from stack
        CRYPTO_strdup+0x30 [libcrypto.so.1.1]
        [unknown]
    393216 bytes in 96 allocations from stack
        [unknown] [libcrypto.so.1.1]
    396428 bytes in 322 allocations from stack
        [unknown] [libcrypto.so.1.1]
    486080 bytes in 8680 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
        [unknown]
    724916 bytes in 286 allocations from stack
        ngx_alloc+0xf [nginx]
        [unknown]
    1949832 bytes in 20300 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
    2032380 bytes in 727 allocations from stack
        BUF_MEM_grow+0x81 [libcrypto.so.1.1]
    3760256 bytes in 5049 allocations from stack
        posix_memalign+0x1a [ld-musl-x86_64.so.1]
    5575680 bytes in 8712 allocations from stack
        [unknown] [libssl.so.1.1]
        [unknown]
    8525968 bytes in 20572 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:19:06] Top 10 stacks with outstanding allocations:
    716420 bytes in 35821 allocations from stack
        CRYPTO_strdup+0x30 [libcrypto.so.1.1]
        [unknown]
    782336 bytes in 191 allocations from stack
        [unknown] [libcrypto.so.1.1]
    885218 bytes in 721 allocations from stack
        [unknown] [libcrypto.so.1.1]
    1233680 bytes in 22030 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
        [unknown]
    1761982 bytes in 775 allocations from stack
        ngx_alloc+0xf [nginx]
        [unknown]
    3814396 bytes in 1525 allocations from stack
        BUF_MEM_grow+0x81 [libcrypto.so.1.1]
    4298576 bytes in 48880 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]
    11922816 bytes in 15455 allocations from stack
        posix_memalign+0x1a [ld-musl-x86_64.so.1]
    14005760 bytes in 21884 allocations from stack
        [unknown] [libssl.so.1.1]
        [unknown]
    21036912 bytes in 49333 allocations from stack
        CRYPTO_zalloc+0xa [libcrypto.so.1.1]

here are more samples m.log

Finally we moved the cert to the load balancer provided by cloud, and it's working fine now, but still have no clue about why could this happen.

The leak is happened on nginx and connection with TLS. We tried to rebuild the image to upgrade libraries to the newest version (for openssl, 1.1.1l-r0), but it doesn't work.

What you expected to happen:

no memory leak with TLS

How to reproduce it:

I have no idea what makes the issue happen, and I can't reproduce it on another cluster.

Anything else we need to know:

As far, we haven't met this issue with 0.30.0 (openssl 1.1.1d-r3), I don't know whether it's a problem in newer openssl.

/kind bug

longwuyuan commented 3 years ago

/remove-kind bug Hi, let us wait until we get some helpful information that hints at a bug. Also, please provide the information asked in the issue template.

We have been making changes for performance and very soon we will be releasing a build that has changed components of the controller. But if you test the current latest release and update as per issue template, it will help get a better perspective.

/triage needs-information

lvauvillier commented 2 years ago

Hi, I have the same issue:

Capture d’écran 2021-09-25 à 19 50 09

nginx -s reload temporary solves the issue.

Here is my infos:

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):


NGINX Ingress controller Release: v0.47.0 Build: 7201e37633485d1f14dbe9cd7b22dd380df00a07 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.20.1


Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.9-gke.1001", GitCommit:"1fe18c314ed577f6047d2712a9d1c8e498e22381", GitTreeState:"clean", BuildDate:"2021-08-23T23:06:28Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Helm: helm -n ingress-nginx get values ingress-nginx USER-SUPPLIED VALUES:

controller:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - nginx-ingress
          topologyKey: kubernetes.io/hostname
        weight: 100
  config:
    use-gzip: true
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
      enabled: true
      namespace: monitoring
  replicaCount: 2
  resources:
    requests:
      memory: 800Mi
  service:
    externalTrafficPolicy: Local

kubectl describe po -n ingress-nginx ingress-nginx-controller-788c5f7f88-d94pj

Name:         ingress-nginx-controller-788c5f7f88-d94pj
Namespace:    ingress-nginx
Priority:     0
Node:         gke-production-pool-1-66bb3111-sldn/10.132.0.4
Start Time:   Sat, 18 Sep 2021 17:17:13 +0200
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              pod-template-hash=788c5f7f88
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-09-18T17:17:13+02:00
Status:       Running
IP:           10.52.3.39
IPs:
  IP:           10.52.3.39
Controlled By:  ReplicaSet/ingress-nginx-controller-788c5f7f88
Containers:
  controller:
    Container ID:  containerd://74fb58bce33d84fb54fb61a3a16772d6edf8858cc14a05c21d0feb79a90e8157
    Image:         k8s.gcr.io/ingress-nginx/controller:v0.47.0@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Image ID:      k8s.gcr.io/ingress-nginx/controller@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
    State:          Running
      Started:      Sat, 18 Sep 2021 17:17:14 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   800Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-788c5f7f88-d94pj (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from ingress-nginx-token-cn2nx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  ingress-nginx-token-cn2nx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-token-cn2nx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

kubectl describe svc -n ingress-nginx ingress-nginx-controller

Name:                     ingress-nginx-controller
Namespace:                ingress-nginx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/version=0.47.0
                          helm.sh/chart=ingress-nginx-3.34.0
Annotations:              cloud.google.com/neg: {"ingress":true}
                          meta.helm.sh/release-name: ingress-nginx
                          meta.helm.sh/release-namespace: ingress-nginx
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     LoadBalancer
IP Families:              <none>
IP:                       10.56.2.89
IPs:                      10.56.2.89
LoadBalancer Ingress:     xxx.xxx.xxx.xxx
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31463/TCP
Endpoints:                10.52.3.39:80,10.52.4.31:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30186/TCP
Endpoints:                10.52.3.39:443,10.52.4.31:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30802
Events:                   <none>
rikatz commented 2 years ago

/priority critical-urgent I will look with other possible "leak" that is happening.

I have received the suggestion to test using boringSSL instead of OpenSSL when building the image (for FIPS compliance, etc) maybe we can try that as well

lvauvillier commented 2 years ago

I have the same memory leak issue with latest version:

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.0.2
  Build:         2b8ed4511af75a7c41e52726b0644d600fc7961b
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------
Capture d’écran 2021-09-30 à 20 42 18 Capture d’écran 2021-09-30 à 20 44 31
rikatz commented 2 years ago

Folks,

in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?

strongjz commented 2 years ago

/kind bug /triage accepted

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/7647#issuecomment-1063605154): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rosscdh commented 2 years ago

+1 still happening

strongjz commented 2 years ago

/reopen /lifecycle frozen

k8s-ci-robot commented 2 years ago

@strongjz: Reopened this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/7647#issuecomment-1232980135): >/reopen >/lifecycle frozen > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-triage-robot commented 1 year ago

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged. Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
rikatz commented 11 months ago

/close

k8s-ci-robot commented 11 months ago

@rikatz: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/7647#issuecomment-1758569773): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.