Open akhilesh-godi opened 1 year ago
/triage accepted /priority important-longterm
Yes there are open issues on performance. But not much progress is happening for lack of actionable info. We do have a k6 test option in CI but that is just K6 against a vanilla kind cluster workload.
If the small tiny difficult-to-find details are available as actionable info, I think some progress is possible. For one thing, replicating the environ not so much for load but moe so for config is the challenge.
Ah! I see.
I'm not sure of the exact root cause yet. However, I have a feeling that the circumstances under which this happens might not be necessary to reproduce the leak.
I'm considering placing an artificial sleep in the handler on the metrics path, and having the client close the connection before the server responds to see if that helps. I haven't gotten my hands dirty with the code yet. I will report back in case I have any findings.
Sorry, I hadn't been able to spend time on this again, until today. I took a goroutine dump after having noticed the goroutine leak. Here are further findings:
ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'semacquire' | wc -l
13678
ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'goroutine ' | wc -l
13780
A significant of these are attributed to a lock that was probably not released effectively. About 13678 out of 13780 are goroutines that are waiting on the lock
goroutine 64208 [semacquire]:
sync.runtime_SemacquireMutex(0x193c180?, 0x0?, 0x1b80c81?)
runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc0010b0010)
sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
sync/mutex.go:90
github.com/prometheus/client_golang/prometheus.(*summary).Observe(0xc0010b0000, 0x3f789374bc6a7efa)
github.com/prometheus/client_golang@v1.14.0/prometheus/summary.go:285 +0x6e
k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage(0xc000135780, {0xc62f4f8000, 0x1729d6, 0x1ac000})
k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:336 +0x10d9
k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages({0x7f9a986b8d28?, 0xc0fe3cac40}, 0xc2aea626b0)
k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:529 +0xb7
created by k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).Start
k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:402 +0xed
This should help with the repro and with the root cause. I'll keep this thread updated.
I've confirmed that removing the code corresponding to the summary metric ingress_upstream_latency_seconds
and the corresponding references to the variable upstreamLatency
fixes the leak. Since this is a metric that is deprecated, it should be removed soon to fix this.
However, it is very odd that this is not replicable on low throughput. I'll put some thought into why this might be so.
@odinsy, @domcyrus, @longwuyuan for 👀
Is this change to come anytime soon? Will it work if we just switch off this metric from being collected?
any update on this issue?
I'm also curious about a fix for this one
is there any progress on this?
The reasons for OOM are as follows:
The socket collector will have a for loop that continuously starts the goroutine handleMessages
https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L455-L465
latencyMetric.Observe(stats.Latency) deprecated latencyMetric will start goroutine to grab the lock when calling the Observe
method
https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L387-L394
summary metric Observe
create goroutine and fighting for lock
https://github.com/prometheus/client_golang/blob/aa3c00d2ee32f97a06edc29716ae80ba0e713b9e/prometheus/summary.go#L306-L318
This is why the number of goroutines has skyrocketed. Since we have already marked latencyMetric
in the previous document as deprecated, I think we can delete this metric.
While a deletion of the deprecated metric is a right way, --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds
works as a temporary solution for the problem.
How to reproduce. I can't reproduce oom.
On Tue, 13 Aug, 2024, 18:12 Stanislav Burtsev, @.***> wrote:
While a deletion of the deprecated metric is a right way, --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds works as a temporary solution for the problem.
— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/10141#issuecomment-2286164552, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWQ453BRP74T6NYLMMDZRH5J7AVCNFSM6AAAAAAZVQNDBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBWGE3DINJVGI . You are receiving this because you were mentioned.Message ID: @.***>
How to reproduce. I can't reproduce oom. …
In order to reproduce this behavior we had to use c6a.8xlarge to get ~12k RPS on ingress. We used nginx with return 200
as a testing backend, although any high-performance server should do a job.
Here is some graphs with before and after behavior.
You don't actually need to wait for OOM, increased memory consumption is enough to detect the problem. Although, it's just a matter of time.
While a deletion of the deprecated metric is a right way,
--exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds
works as a temporary solution for the problem.
Yes, it can be solved by excluding it, but it requires configuration to avoid OOM. It is better to delete this metric directly to save unnecessary operations.
How to set expectations for the users who want and need this metric nginx_ingress_controller_ingress_upstream_latency_seconds
?
@chengjoey can you kindly help and post explicitly the link to the doc that announces the deprecation of what you said "latencyMetric"
.
@chengjoey can you kindly help and post explicitly the link to the doc that announces the deprecation of what you said "
latencyMetric"
.
The user-guide clearly states that nginx_ingress_controller_ingress_upstream_latency_seconds
is deprecated. Sorry I wasn't clear enough.
Is there any way to specify the --exclude-socket-metrics=nginx_ingress_controller_ingress_upstream_latency_seconds
flag when the controller with Helm?
Couldn't find any value for excluding metrics at least.
@torbjornvatn
# ...
controller:
# ...
extraArgs:
# ...
exclude-socket-metrics: nginx_ingress_controller_ingress_upstream_latency_seconds
EDIT: changed after https://github.com/kubernetes/ingress-nginx/issues/10141#issuecomment-2288462110
ExtraArgs is a map:
{{- range $key, $value := .Values.controller.extraArgs }}
So you need:
controller:
extraArgs:
exclude-socket-metrics: "nginx_ingress_controller_ingress_upstream_latency_seconds"
Does this only affect nginx_ingress_controller_ingress_upstream_latency_seconds
?
Looking at the code surrounding the highlighted code snippet: https://github.com/kubernetes/ingress-nginx/blob/f19e9265b0ca266c7f2bc5e4d2ac137479e8b842/internal/ingress/metric/collectors/socket.go#L387-L394 , it seems like any histogram will face the same issue at higher load?
Does this only affect
nginx_ingress_controller_ingress_upstream_latency_seconds
? Looking at the code surrounding the highlighted code snippet:, it seems like any histogram will face the same issue at higher load?
I don't see the histogram Oberve() implementation leveraging locks eventually at https://github.com/prometheus/client_golang/blob/aa3c00d2ee32f97a06edc29716ae80ba0e713b9e/prometheus/histogram.go#L649-L700, looks like the only metrics that is a summary metric was this one.
What happened: We run the nginx ingress controller in AWS EKS. We use the controller under very high loads (~250M rpm)
When under stress, the metrics handler responses are delayed. A simple curl shows that it takes about 13s for the endpoint to respond.
We notice that after an inflection point, the overall memory of the process starts to increase and keeps increasing until it hits OOM and crashes.
This is consistently reproducible under load.
The heap profile clearly reflects the same:
Memory leak: (48GB is the max memory per pod)
Open FDs:
Throughput: (unreliable because metrics endpoint is latent)
I strongly second the issue reported here https://github.com/kubernetes/ingress-nginx/issues/9738. However, the mitigation to exclude metrics is not feasible as we have already excluded all that we can and the mitigation provided in https://github.com/kubernetes/ingress-nginx/pull/9770 is not feasible. Any more reduction would mean we run blind on what is happening inside the controller.
What you expected to happen: The metrics endpoint shouldn't be latent. There should be configurable timeouts. The goroutine/memory leak should be fixed.