nginx pods crash with OOM | Prometheus metric collector memory leak

akhilesh-godi commented 1 year ago

What happened: We run the nginx ingress controller in AWS EKS. We use the controller under very high loads (~250M rpm)

When under stress, the metrics handler responses are delayed. A simple curl shows that it takes about 13s for the endpoint to respond.

We notice that after an inflection point, the overall memory of the process starts to increase and keeps increasing until it hits OOM and crashes.

This is consistently reproducible under load.

The heap profile clearly reflects the same:

Memory leak: (48GB is the max memory per pod)

Open FDs:

Throughput: (unreliable because metrics endpoint is latent)

I strongly second the issue reported here https://github.com/kubernetes/ingress-nginx/issues/9738. However, the mitigation to exclude metrics is not feasible as we have already excluded all that we can and the mitigation provided in https://github.com/kubernetes/ingress-nginx/pull/9770 is not feasible. Any more reduction would mean we run blind on what is happening inside the controller.

What you expected to happen: The metrics endpoint shouldn't be latent. There should be configurable timeouts. The goroutine/memory leak should be fixed.

longwuyuan commented 1 year ago

/triage accepted /priority important-longterm

Yes there are open issues on performance. But not much progress is happening for lack of actionable info. We do have a k6 test option in CI but that is just K6 against a vanilla kind cluster workload.

If the small tiny difficult-to-find details are available as actionable info, I think some progress is possible. For one thing, replicating the environ not so much for load but moe so for config is the challenge.

akhilesh-godi commented 1 year ago

Ah! I see.

I'm not sure of the exact root cause yet. However, I have a feeling that the circumstances under which this happens might not be necessary to reproduce the leak.

I'm considering placing an artificial sleep in the handler on the metrics path, and having the client close the connection before the server responds to see if that helps. I haven't gotten my hands dirty with the code yet. I will report back in case I have any findings.

akhilesh-godi commented 1 year ago

Sorry, I hadn't been able to spend time on this again, until today. I took a goroutine dump after having noticed the goroutine leak. Here are further findings:

ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'semacquire' | wc -l
13678
ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'goroutine ' | wc -l
13780

A significant of these are attributed to a lock that was probably not released effectively. About 13678 out of 13780 are goroutines that are waiting on the lock

goroutine 64208 [semacquire]:
sync.runtime_SemacquireMutex(0x193c180?, 0x0?, 0x1b80c81?)
        runtime/sema.go:77 +0x25                                                              
sync.(*Mutex).lockSlow(0xc0010b0010)                             
        sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)                                   
        sync/mutex.go:90
github.com/prometheus/client_golang/prometheus.(*summary).Observe(0xc0010b0000, 0x3f789374bc6a7efa)
        github.com/prometheus/client_golang@v1.14.0/prometheus/summary.go:285 +0x6e
k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage(0xc000135780, {0xc62f4f8000, 0x1729d6, 0x1ac000})
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:336 +0x10d9
k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages({0x7f9a986b8d28?, 0xc0fe3cac40}, 0xc2aea626b0)
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:529 +0xb7
created by k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).Start
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:402 +0xed

This should help with the repro and with the root cause. I'll keep this thread updated.

akhilesh-godi commented 1 year ago

I've confirmed that removing the code corresponding to the summary metric ingress_upstream_latency_seconds and the corresponding references to the variable upstreamLatency fixes the leak. Since this is a metric that is deprecated, it should be removed soon to fix this.

However, it is very odd that this is not replicable on low throughput. I'll put some thought into why this might be so.

akhilesh-godi commented 1 year ago

@odinsy, @domcyrus, @longwuyuan for 👀

ranediwa commented 9 months ago

Is this change to come anytime soon? Will it work if we just switch off this metric from being collected?

exherb commented 7 months ago

any update on this issue?

torbjornvatn commented 5 months ago

I'm also curious about a fix for this one

avinashbo commented 4 months ago

is there any progress on this?

chengjoey commented 3 months ago

The reasons for OOM are as follows:

The socket collector will have a for loop that continuously starts the goroutine handleMessages https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L455-L465
latencyMetric.Observe(stats.Latency) deprecated latencyMetric will start goroutine to grab the lock when calling the Observe method https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L387-L394
summary metric Observe create goroutine and fighting for lock https://github.com/prometheus/client_golang/blob/aa3c00d2ee32f97a06edc29716ae80ba0e713b9e/prometheus/summary.go#L306-L318

https://github.com/prometheus/client_golang/blob/aa3c00d2ee32f97a06edc29716ae80ba0e713b9e/prometheus/summary.go#L366-L377

This is why the number of goroutines has skyrocketed. Since we have already marked latencyMetric in the previous document as deprecated, I think we can delete this metric.

sgburtsev commented 3 months ago

While a deletion of the deprecated metric is a right way, --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds works as a temporary solution for the problem.

longwuyuan commented 3 months ago

How to reproduce. I can't reproduce oom.

On Tue, 13 Aug, 2024, 18:12 Stanislav Burtsev, @.***> wrote:

While a deletion of the deprecated metric is a right way, --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds works as a temporary solution for the problem.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/10141#issuecomment-2286164552, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWQ453BRP74T6NYLMMDZRH5J7AVCNFSM6AAAAAAZVQNDBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWGE3DINJVGI . You are receiving this because you were mentioned.Message ID: @.***>

sgburtsev commented 3 months ago

How to reproduce. I can't reproduce oom. …

In order to reproduce this behavior we had to use c6a.8xlarge to get ~12k RPS on ingress. We used nginx with return 200 as a testing backend, although any high-performance server should do a job. Here is some graphs with before and after behavior.

You don't actually need to wait for OOM, increased memory consumption is enough to detect the problem. Although, it's just a matter of time.

chengjoey commented 3 months ago

While a deletion of the deprecated metric is a right way, --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds works as a temporary solution for the problem.

Yes, it can be solved by excluding it, but it requires configuration to avoid OOM. It is better to delete this metric directly to save unnecessary operations.

longwuyuan commented 3 months ago

How to set expectations for the users who want and need this metric nginx_ingress_controller_ingress_upstream_latency_seconds ?

longwuyuan commented 3 months ago

@chengjoey can you kindly help and post explicitly the link to the doc that announces the deprecation of what you said "latencyMetric" .

chengjoey commented 3 months ago

@chengjoey can you kindly help and post explicitly the link to the doc that announces the deprecation of what you said "latencyMetric" .

https://github.com/kubernetes/ingress-nginx/blob/b933310da5914adbf3f243b5cf902c245d5a0cef/docs/user-guide/monitoring.md?plain=1#L389-L391

The user-guide clearly states that nginx_ingress_controller_ingress_upstream_latency_seconds is deprecated. Sorry I wasn't clear enough.

torbjornvatn commented 3 months ago

Is there any way to specify the --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds flag when the controller with Helm? Couldn't find any value for excluding metrics at least.

lucasfcnunes commented 3 months ago

@torbjornvatn

# ...
controller:
  # ...
  extraArgs:
    # ...
    exclude-socket-metrics: nginx_ingress_controller_ingress_upstream_latency_seconds

EDIT: changed after https://github.com/kubernetes/ingress-nginx/issues/10141#issuecomment-2288462110

sgburtsev commented 3 months ago

ExtraArgs is a map:

{{- range $key, $value := .Values.controller.extraArgs }}

So you need:

controller:
  extraArgs:
    exclude-socket-metrics: "nginx_ingress_controller_ingress_upstream_latency_seconds"

nielsole commented 2 weeks ago

Does this only affect nginx_ingress_controller_ingress_upstream_latency_seconds? Looking at the code surrounding the highlighted code snippet: https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L387-L394 , it seems like any histogram will face the same issue at higher load?

mmerrill3 commented 1 week ago

Does this only affect nginx_ingress_controller_ingress_upstream_latency_seconds? Looking at the code surrounding the highlighted code snippet:

https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L387-L394

, it seems like any histogram will face the same issue at higher load?

I don't see the histogram Oberve() implementation leveraging locks eventually at https://github.com/prometheus/client_golang/blob/aa3c00d2ee32f97a06edc29716ae80ba0e713b9e/prometheus/histogram.go#L649-L700, looks like the only metrics that is a summary metric was this one.

kubernetes / ingress-nginx

nginx pods crash with OOM | Prometheus metric collector memory leak #10141