OpenSearch Data Nodes memory exhaustion after upgrade from 2.9 to 2.12 (JDK 21 upgrade)

rlevytskyi commented 3 months ago

Describe the bug

Hello OpenSearch Team, We’ve just updated our OpenSearch cluster from version 2.9.0 to 2.12.0. Among other issues, we’ve noticed that Opensearch is now consuming waaay more memory than previous version, i.e. it became unusable with the same configuration, even after providing it with 15% more RAM. To make it responsive again, we had to close many indices.

Related component

Other

To Reproduce

Have a 2.9 cluster of 4 data nodes with 112GB of Xmx RAM and 13.6 TB of storage
Fill it with 5500 indices (mostly small of 1 shards, but several big of 4 shards) up to 75% of capacity
Update 2.9 to 2.12 and add RAM to make it 128GB
See many GC messages at logs and almost inoperable cluster
Close 2000 indices to make it work again

Expected behavior

We didn't expect significant memory usage increase at version upgrade

Additional Details

Plugins Security pluging for SAML authn and authz

Screenshots Please note almost horizontal Heap usage before upgrade, increase after upgrade, and horizontal again after closing some indices.

Host/Environment (please complete the following information):

OS: Oracle Linux 8.9
OS: Docker image opensearchproject/opensearch:2.12.0

shwetathareja commented 3 months ago

Thanks @rlevytskyi for reporting the issue. did you try a heap dump? It will help us debug further here. (You can try with smaller heap, the issue might generate faster in that case).

Couple of questions:

Are you running non dedicated cluster manager cluster?
What is the cluster state size, you can check via _cluster/state API output?
How many shards are there overall?
When you observed JVM heap spiking, was it during upgrade from 2.9 to 2.12 or post upgrade as well, it was consistently high?

rlevytskyi commented 3 months ago

Thank you @shwetathareja for your reply! Here are clarifications:

Yes we are running non-dedicated manager cluster, we have four nodes running both data and master-eligible nodes and two coordinating nodes: % curl logs:9200/_cat/nodes\?s=name d data - v480-data.company.com m master - v480-master.company.com d data - v481-data.company.com m master * v481-master.company.com d data - v482-data.company.com m master - v482-master.company.com d data - v483-data.company.com m master - v483-master.company.com - - - v484-coordinator.company.com - - - v485-coordinator.company.com
Quite a lot of output: % curl logs:9200/_cluster/state | wc 0 4989 193053405
26696 reported by _cat/shards
It was hitting the top during upgrade and also post upgrade.

rlevytskyi commented 3 months ago

Re heap dump, where should we collect it and when? Right now, we see nothing at data nodes. Coordinating nodes sometimes telling something like [INFO ][o.o.i.b.HierarchyCircuitBreakerService] [v484-coordinator.company.com] attempting to trigger G1GC due to high heap usage [8204216264] [INFO ][o.o.i.b.HierarchyCircuitBreakerService] [v484-coordinator.company.com] GC did bring memory usage down, before [8204216264], after [3248648136], allocations [71], duration [62] but will it's heap dump useful?

reta commented 3 months ago

@rlevytskyi one of the major changes in 2.12 is that it is bundled with JDK-21 by default, any chances you could downgrade JDK to 17 for your deployment (may need altering Docker image) to eliminate the JDK version change as a suspect? Thank you.

rlevytskyi commented 3 months ago

Thank you Andriy for your reply. I've searched the https://github.com/opensearch-project/OpenSearch and was unable to find appropriate Dockerfile. Could you please point me to the right one?

reta commented 3 months ago

I think you need those https://github.com/opensearch-project/opensearch-build/tree/main/docker/release/dockerfiles, but may be simpler way is to "inherit" from 2.12 image and install/replace JDK version to run with.

peternied commented 3 months ago

[Triage - attendees 1 2 3 4 5] @rlevytskyi Thanks for filing - we will keep this issue untriaged for 1 week and if it does not have a root cause we will close the issue.

The following were some recent investigations in the security plugin for your consideration.

rlevytskyi commented 3 months ago

I am unable to build OpenSearch image yet. Moreover, Dockerfile ( https://github.com/opensearch-project/opensearch-build/blob/main/docker/release/dockerfiles/opensearch.al2.dockerfile ) is telling that

This dockerfile generates an AmazonLinux-based image containing an OpenSearch installation (1.x Only). Dockerfile for building an OpenSearch image. It assumes that the working directory contains these files: an OpenSearch tarball (opensearch.tgz), log4j2.properties, opensearch.yml, opensearch-docker-entrypoint.sh, opensearch-onetime-setup.sh.`

First of all, it tells "1.x Only" Second, it tells that I have to put some files there but I see no way to make sure I use exactly the same files you use.

So my question is if there is a way to build image exactly as yours to make sure we have the same configuration?

peternied commented 3 months ago

@rlevytskyi I believe the new file is right next to the that one dockerfile. Take a look at the readme.md, maybe that will help if you are looking to construct a docker image from a custom configuration

Note; following "inherit" from 2.12 image and install/replace JDK version to run with. seems easier IMO

peternied commented 3 months ago

@rlevytskyi I'm not sure if you've managed to capture and investigate a heap dump of the OpenSearch process, see this guide to capture that information in a docker environment [1]. This will steer the investigation towards what is causing memory to be consumed. They can also be used to compare a 2.9 vs 2.12 versions for the difference.

[1] https://iceburn.medium.com/thread-and-heap-dumps-in-docker-containers-9aada82226fb

rlevytskyi commented 3 months ago

Thank you Peter, However, neither I am a Java programmer nor a Docker enthusiast, and that "inherit" from 2.12 image and install/replace JDK version to run with" doesn't seem to be clear to me. As far as I understand, it can be achieved by changing "ENTRYPOINT" to "/bin/bash", starting a container, install new Java inside, set JAVA_HOME and run Opensearch. However, you need to rebuild the image to change ENTRYPOINT, and we got in recursion...

rlevytskyi commented 3 months ago

Re Heap Dump, I managed to get and even sanitize it using Paypal's tool https://github.com/paypal/heap-dump-tool . However, it's not feasible to get it right now because cluster is running smoothly now.

rlevytskyi commented 3 months ago

Thank you again @peternied Peter for pointing out the https://github.com/opensearch-project/opensearch-build/blob/main/docker/release/README.md I managed to build the 2.12 image with JDK17 from 2.11.1. Have a nice weekend!

rlevytskyi commented 2 months ago

I managed to create an image based on 2.12 using the following Dockerfile: FROM opensearchproject/opensearch:2.12.0 USER root RUN dnf install -y java-17-amazon-corretto USER opensearch ENV JAVA_HOME=/usr ` Running it at the test installation doesn't reveal any memory usage difference. Looking forward to run a big (prod) installation with it. Do you guys think if it is safe?

peternied commented 2 months ago

[Triage - attendees 1 2 3 4 5]

Do you guys think if it is safe?

@rlevytskyi Without a root cause / and bugfix it is hard to qualify what next steps to take. I would recommend doing testing and have a mitigation plan if something happens, but your mileage my vary.

Thanks for filing - we will keep this issue untriaged for 1 week and if it does not have a root cause we will close the issue.

Since it has been a week and there is no root cause, we are closing out this issue. Feel free to open a new issue if you find a proximal cause from a heap analysis or a way to reproduce the leak.

tophercullen commented 4 weeks ago

Want to chime in and say we were running into something similar after upgrading to 2.12. Suddenly all sorts of previously normal operations were causing the overall parent circuit breakers to trip, and there were significantly more GC logs emitted by opensearch overall. This problem was most exacerbated by the snapshot and reindex APIs.

I applied the image changes from @rlevytskyi to use JDK17 and it has completely solved the issues and symptoms we were seeing. Average heap dropped considerably and is much more stable.

dblock commented 3 weeks ago

Sounds like upgrading to JDK 21 is the change that caused this. Seems like a real problem. I am going to reopen this and edit the title to say something to this effect. @tophercullen do you think you can help us debug what's going on? There are a few suggestions above to take some heap dumps and compare.

tophercullen commented 3 weeks ago

Using the above paypal tool that sanitizes them, I've generated heap dumps from all nodes in a new standalone cluster (nothing else using it) while taking a full cluster snapshot at 1x JDK17 and 2x JDK21. This is 24 files and ~5GB compressed. I'm unsure what I'm supposed to be comparing between them.

From the stdout logging for the cluster, there were no GC logs with JDK17, and a bunch with JDK21. So it seems to be repeatable in an otherwise idle cluster, assuming that is not just a red herring.

Might also consider the reproducer in #12694. That seems fairly similar to our real use case, and the operations we were seeing/getting circuit breakers tripped. Snaphots never directly tripped breakers and/or failed, and were seemingly just exacerbating the problem

dblock commented 3 weeks ago

Maybe @backslasht has some ideas about what to do with this next?

reta commented 3 weeks ago

Using the above paypal tool that sanitizes them, I've generated heap dumps from all nodes in a new standalone cluster (nothing else using it) while taking a full cluster snapshot at 1x JDK17 and 2x JDK21. This is 24 files and ~5GB compressed. I'm unsure what I'm supposed to be comparing between them.

May be sharing class histogram first could help (even as a screenshot) , thanks @tophercullen

dblock commented 3 weeks ago

https://github.com/opensearch-project/OpenSearch/issues/12694 could be related

ansjcy commented 3 weeks ago

This might be related to this issue in JDK: https://bugs.openjdk.org/browse/JDK-8297639 The G1UsePreventiveGC was introduced and set to true by default in JDK17 (introduced in this commit, renamed in this commit ) The related issue is https://bugs.openjdk.org/browse/JDK-8257774. This was introduced to solve

...bursts of short lived humongous object allocations. These bursts quickly consume all of the G1ReservePercent regions and then the rest of the free regions

In JDK 20, this flag was set to false by default and in JDK 21 it was completely removed in https://bugs.openjdk.org/browse/JDK-8293861.

Summarizing the observations and reproducing efforts by the community around this JDK issue: removing this flag might have caused memory increase when sending and receiving document with chunks > 2MB. In JDK 20 we can add the G1UsePreventiveGC flag back to bypass this issue but in JDK21 it is not an option anymore :( We either need go back to JDK 20 with that flag enabled, or we need to explore other possible ways to fix this.

reta commented 3 weeks ago

@ansjcy that was suggested before (I think on the forum) but we did not use -XX:+G1UsePreventiveGC (AFAIK)

dblock commented 3 weeks ago

@rlevytskyi @tophercullen Do you still have your repro. Care you try with JDK 21 and -XX:+G1UsePreventiveGC, please?

tophercullen commented 3 weeks ago

@dblock I can do what I did before: create a new cluster and populate it with data, run snapshots.

However based on what @ansjcy provided, that option is no longer available in JDK21. The issue tracker for openJDK links to a similar issue with Elasticsearch in this regard, which also has no solution using JDK21.

dblock commented 3 weeks ago

However based on what @ansjcy provided, that option is no longer available in JDK21.

Yes, my bad for not reading carefully enough.

ansjcy commented 3 weeks ago

but we did not use -XX:+G1UsePreventiveGC

No, but if I'm understanding correctly, this flag was enabled by default in g1_globals.hpp for G1GC in JDK 17.

Also, today I did some more experiments using https://github.com/kroepke/opensearch-jdk21-memory (Thanks, @kroepke! ). I ran bulk (20MB workload per request, ~5MB each document) with docker-based set up, each for 1 hour in the following scenarios:

2.11 with JDK 17, G1UsePreventiveGC flags enabled [1].
2.11 with JDK 17, G1UsePreventiveGC flags disabled [2].
2.11 with JDK 21 [3]

captured the jvm usage results in the 1 hour run:

for [1], the average jvm usage is 191707377 bytes
for [2], the average jvm usage is 196708634 bytes
for [3], the average jvm usage is 201973645 bytes

The results shows certain but not significant impact from disabling the flag G1UsePreventiveGC in JDK 17, but there might be some unknown factors impacting the jvm usage in JDK 21 as well. We need to run even longer and heavier benchmark tests to better understand this.

backslasht commented 2 weeks ago

@ansjcy - Do you think G1UsePreventiveGC is the root cause or it is something else?

@tophercullen - Can you please share the heap dumps?

@dblock - Is there a common share location where these heap dumps can be uploaded?

dblock commented 1 week ago

@dblock - Is there a common share location where these heap dumps can be uploaded?

AFAIK no, we don't have a place to host outputs from individual runs - I would just make an S3 bucket and give access to the folks in this thread offline if they don't have a place to put these

opensearch-project / OpenSearch