opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
366 stars 192 forks source link

Operator 2.6.0 Memory Leak #790

Closed mrvdcsg closed 1 month ago

mrvdcsg commented 2 months ago

What is the bug?

Over time, the operator container consumes more memory until it fails and is recreated. This issue was present in 2.5.0 and 2.5.1 as well. We reverted all clusters to use 2.4.0 and the problem went away.

How can one reproduce the bug?

Setup the 2.6.0 operator and deploy a small OS cluster with it. Monitor it over the next day or two and you'll see the memory consumption of the operator manager pod continually increases. Higher activity clusters grow faster and fail more often. A slow memory creep can be observed within less than an hour.

What is the expected behavior?

Our clusters that use 2.4.0 Operator have garbage collection that runs regularly and memory consumption is very regulated between 20MB and 50MB. Stable for months.

What is your host/environment?

We are running Opensearch 2.6.0 with Operator 2.6.0 (now reverted to 2.4.0) on AKS (Also observed on docker-desktop and Rancher).

Do you have any screenshots?

700 There are plenty of screenshots on this ticket referencing 2.5.1 (which also had this leak). That ticket is for 2.4.0 but we haven't observed any issues with 2.4.0.

Do you have any additional context?

I've tested this with a small cluster without any additional plugins or even indexes on them to rule out some plugin causing the leak. This appears to be something in the Operator itself. A development cluster started at 110MB and overnight is now at 241MB and climbing. I kept the limit of 500 the same and it crashes when it hits the limit. Increasing the limit doesn't help, just delays how long before it crashes and wastes resources.

Any help appreciated!

mrvdcsg commented 1 month ago

I think pprof could be useful in determining root cause here. We are unable to upgrade past 2.4.0 until memory leaks are resolved. There is a prioritized issue for pprof here: pprof

prudhvigodithi commented 1 month ago

[Triage] Thanks @mrvdcsg for reporting the bug, as you noticed there is already an open issue for this bug. Can we club both the issues under one umbrella? Thanks Adding @swoehrl-mw @salyh @bbarani

mrvdcsg commented 1 month ago

[Triage] Thanks @mrvdcsg for reporting the bug, as you noticed there is already an open issue for this bug. Can we club both the issues under one umbrella? Thanks Adding @swoehrl-mw @salyh @bbarani

I agree that this can be included in issue 700. That ticket reported a memory leak in 2.4.0 and I wanted to create visibility to it existing in 2.5.1 and 2.6.0. I will close this ticket and allow it to be tracked on the other open issue.