quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.77k stars 2.68k forks source link

Memory leak on TestContainers IT tests #41156

Open brunobat opened 4 months ago

brunobat commented 4 months ago

Describe the bug

When trying to run 10 IT tests with TestContainers in this PR: https://github.com/quarkusio/quarkus/pull/39032 The CI build fails with an OOM exception on all JDKs. Example: https://github.com/quarkusio/quarkus/actions/runs/9283986768/job/25545787368#step:16:4086

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  06:03 min
[INFO] Finished at: 2024-05-29T10:10:21Z
[INFO] ------------------------------------------------------------------------
[INFO] 327 goals, 235 executed, 92 from cache, saving at least 3m 27s
Error:  Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.2.5:test (default-test) on project quarkus-opentelemetry-deployment: 
Error:  
Error:  Please refer to /home/runner/work/quarkus/quarkus/extensions/opentelemetry/deployment/target/surefire-reports for the individual test results.
Error:  Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
Error: [ERROR] There was an error in the forked process
Error:  Java heap space
Error:  java.lang.OutOfMemoryError: Java heap space
Error:      at jdk.zipfs/jdk.nio.zipfs.ZipFileSystem.initCEN(ZipFileSystem.java:1552)
Error:      at jdk.zipfs/jdk.nio.zipfs.ZipFileSystem.<init>(ZipFileSystem.java:179)
Error:      at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getZipFileSystem(ZipFileSystemProvider.java:125)
Error:      at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:120)
Error:      at io.quarkus.fs.util.ZipUtils.newFileSystem(ZipUtils.java:228)
Error:      at io.quarkus.fs.util.ZipUtils.newFileSystem(ZipUtils.java:210)
Error:      at io.quarkus.paths.ArchivePathTree.openFs(ArchivePathTree.java:179)
Error:      at io.quarkus.paths.ArchivePathTree.apply(ArchivePathTree.java:117)
Error:      at io.quarkus.paths.PathTreeWithManifest.apply(PathTreeWithManifest.java:74)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.getExtensionInfoOrNull(ApplicationDependencyTreeResolver.java:566)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.getExtensionDependencyOrNull(ApplicationDependencyTreeResolver.java:489)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:426)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:469)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependency(ApplicationDependencyTreeResolver.java:469)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.visitRuntimeDependencies(ApplicationDependencyTreeResolver.java:404)
Error:      at io.quarkus.bootstrap.resolver.maven.ApplicationDependencyTreeResolver.resolve(ApplicationDependencyTreeResolver.java:172)
Error:      at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.buildAppModel(BootstrapAppModelResolver.java:382)
Error:      at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.doResolveModel(BootstrapAppModelResolver.java:319)
Error:      at io.quarkus.bootstrap.resolver.BootstrapAppModelResolver.resolveManagedModel(BootstrapAppModelResolver.java:200)
Error:      at io.quarkus.bootstrap.BootstrapAppModelFactory.resolveAppModel(BootstrapAppModelFactory.java:294)
Error:      at io.quarkus.bootstrap.app.QuarkusBootstrap.bootstrap(QuarkusBootstrap.java:133)
Error:      at io.quarkus.test.QuarkusUnitTest.beforeAll(QuarkusUnitTest.java:651)
Error:      at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.lambda$invokeBeforeAllCallbacks$12(ClassBasedTestDescriptor.java:396)
Error:      at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor$$Lambda$449/0x00007f367813c4d0.execute(Unknown Source)
Error:      at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
Error:      at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.invokeBeforeAllCallbacks(ClassBasedTestDescriptor.java:396)
Error:      at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:212)
Error:      at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:85)
Error:      at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:148)
Error:      at org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$367/0x00007f3678118dc0.execute(Unknown Source)
Error:  
Error:  org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process

Local execution reveals a memory leak:

Screenshot 2024-06-12 at 11 13 20

Initial analysis pointed to the new feature being added, however that was just a tipping point. The bulk of memory being used is related with the QuarkusTestExtension class or with TestContainers.

Screenshot 2024-06-12 at 11 11 19

When running all the tests the containers are not immediately shut down. They linger in the system until the last test is finished, according to my view of Podman desktop.

I profiled the test execution and I'm attacking the JFR stream file here:

GrpcNoTLSNoCompressionTest_and_9_more_2024_06_12_102823.jfr.zip

Further analyses reveals this on QuarkusTestExtension:

Screenshot 2024-06-12 at 11 19 29

Expected behavior

No out of memory when running mutiple IT tests with TestContainers

Actual behavior

CI will fail with out of memory

How to Reproduce?

  1. Run all these tests on Intellij Idea in profiler mode: https://github.com/quarkusio/quarkus/blob/fb79c7c22ee746c904a95199390e295865024f73/integration-tests/opentelemetry-vertx-exporter/src/test/java/io/quarkus/it/opentelemetry/vertx/exporter

Output of uname -a or ver

Darwin xxxxxxxxxxx 22.6.0 Darwin Kernel Version 22.6.0: Mon Apr 22 20:49:37 PDT 2024; root:xnu-8796.141.3.705.2~1/RELEASE_ARM64_T6000 arm64

Output of java -version

openjdk version "21.0.2" 2024-01-16 LTS OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode)

Quarkus version or git rev

git version 2.38.1

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae) Maven home: /Users/xxxxxxx/.sdkman/candidates/maven/current Java version: 21.0.2, vendor: Eclipse Adoptium, runtime: /Users/xxxxxxx/.sdkman/candidates/java/21.0.2-tem Default locale: en_PT, platform encoding: UTF-8 OS name: "mac os x", version: "13.6.7", arch: "aarch64", family: "mac"

Additional information

Running Podman Desktop Version 1.10.3 (1.10.3)

See companion discussion on Zulip: https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/TestContainer.20IT.20tests.20out.20of.20memory

quarkus-bot[bot] commented 4 months ago

/cc @geoand (testing)

brunobat commented 4 months ago

Running the tests without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTest leads to less allocations, proportional to less tests being run:

Screenshot 2024-06-12 at 12 33 57

The profiling file:

GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_123021.jfr.zip

Also of note... TestContainers, in addition to the OpenTelemetry Collector container will lunch 2 other types of containers per test:

These containers are not cleaned up right away. They will only go away after all TestContainer tests are finished:

Screenshot 2024-06-12 at 12 32 27
brunobat commented 4 months ago

Also on the configuration without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTest I got this memory dump while executing the 7th test:

https://drive.google.com/file/d/13hvrMHEYO4KWX13zDd5xV7BkGTqKxDNf/view?usp=drive_link

(The file is too big for github)

geoand commented 4 months ago

https://github.com/quarkusio/quarkus/compare/main...geoand:%2341156-test?expand=1 is worth a test

brunobat commented 4 months ago

Configuration without GrpcWithTLSWithTrustAllWithCompressionTest and HttpWithTLSWithTrustAllWithCompressionTestwithwith export TESTCONTAINERS_RYUK_DISABLED=true:

[Uploading GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_143605.jfr.zip…]()

gsmet commented 4 months ago

@brunobat I think we will need the heap dumps for the tests not the JFR file.

holly-cummins commented 4 months ago

I've also noticed recent memory increases in dev mode tests, which tipped the pact ecosystem CI tests over the edge on June 5th, and got worse again around June 10th. I have https://github.com/quarkusio/quarkus/issues/38814 open for a general frog-boiling in the memory requirements of dev/test mode.

gsmet commented 4 months ago

I'm slowly making progress here on a patch that might improve the situation but things are a bit messy: https://github.com/quarkusio/quarkus/pull/41172

Also see https://github.com/quarkusio/quarkus/issues/41233 .

holly-cummins commented 4 months ago

If want a hungry test to play with ...

git clone git@github.com:quarkiverse/quarkus-pact.git
cd quarkus-pact
mvn install 
cd cross-extension-integration-tests/target/test-classes/projects/happy-everyone-all-together-processed  
<edit the pom.xml to use a 999-SNAPSHOT quarkus>
mvn compile quarkus:dev -Dquarkus.analytics.disabled=true -Dquarkus.http.test-port=8083 -Ddebug=false -Djvm.args='-Xmx140m'

That 140m setting passed on 3.11, but fails in 3.12 (you won't see an OOM in most cases, just test failures, but I'm pretty sure it's OOMing under the covers somewhere). In 3.6, 120m would work. With quarkus:test, 105m will pass and 95m will fail.

gsmet commented 4 months ago

It would be interesting to get a heap dump in both cases.

holly-cummins commented 4 months ago

It would be interesting to get a heap dump in both cases.

The heap dump thing is kind of interesting and complicated, IMO. There's another comment on it that I put somewhere ... A single heap dump isn't going to be useful, at all (and I know that's not what you're suggesting). So we'd need two heap dumps, from (say) Quarkus 3.6 and 3.7, or Quarkus 3.11 and 3.12.

Then we can compare them. But let's assume the problem isn't a memory leak, but just that ... we do more, so we have more allocations. So the question we want the dumps to answer isn't "which objects are leaking," but rather "which objects do I have now that I didn't have before?" or "what objects do I have more of now?" ... but there's then another challenge, which is "how do I get dumps at the exact same stage in the good/bad case?" and "what stage should I be getting dumps at?" The simplest thing to do is to set the heap to a identical, constrained, value in good and bad cases (because different heap sizes would seem like lunacy) and then wait for the OOM dump to be generated. But because the bad case is hungrier, it will OOM earlier than the good case, so it will be at a different stage in the quarkus lifecycle, so there will be a ton of 'noise' in the delta between the two dumps. The objects on the heap will be different in the two cases, because the app is doing different things when the dump happens.

To avoid that issue you could start the applications with an unconstrained heap and let them fully finish starting, and then trigger a dump. In that case, you're avoiding the problem where the two apps are at different stages, but you have a new problem, which is that memory usage peaks during test execution. So you're not measuring 'at high tide', you're measuring after things have settled down. The offending objects might have already been freed and removed from the heap.

And there's another question - what problem are we trying to solve? Maybe we did increase memory requirements, and we get lucky and pinpoint the exact classes responsible. But they're just the straw that broke the camel's back - maybe those new classes are doing something really useful. If so, we won't do anything with them after identifying them - we'd need to reduce memory requirements by finding something else in the heap which is waste that we can optimise away.

Or maybe the problem is a leak, and then we wouldn't want to compare 3.6 and 3.7, but 3.7 mid-test and 3.7 at the end of tests.

All of which is a long way of saying that I did gather various dumps when I was investigating the 3.6->3.7 increase, and after looking in MAT for a while, nothing was jumping out. I concluded that the increase in memory requirements wasn't a big jump, just an increase that tipped these tests over the threshold into an OOM, but investigating that would have needed decisions about what problem was being solved, and a different set of dumps.

gsmet commented 4 months ago

@holly-cummins In any case, I would be interesting in having a dump for each version :). I agree we might not find anything but there's something odd.

I wonder if it could be due to the new version of Netty because, in my branch which logs something every time we try to access a closed CL, I see things like:

2024-06-17 19:42:59,363 ERROR [io.qua.boo.cla.Qua.lifecycle] (Finalizer) Class loader QuarkusClassLoader[type=BASE_RUNTIME, name=Quarkus Base Runtime ClassLoader: TEST for GrpcOpenInstrumentationDisabledTest (QuarkusUnitTest), id=285fbc53] has been closed and may not be accessed anymore: java.lang.RuntimeException: Created to log a stacktrace
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:510)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:505)
    at io.netty.util.internal.shaded.org.jctools.queues.ConcurrentCircularArrayQueue.isEmpty(ConcurrentCircularArrayQueue.java:71)
    at io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.isEmpty(MpscArrayQueue.java:207)
    at io.netty.buffer.PoolThreadCache.checkCacheMayLeak(PoolThreadCache.java:236)
    at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:227)
    at io.netty.buffer.PoolThreadCache$FreeOnFinalize.finalize(PoolThreadCache.java:500)
    at java.base/java.lang.System$2.invokeFinalize(System.java:2297)
    at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:88)
    at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)

that I wasn't seeing before.

So we have some ConcurrentCircularArrayQueue that are surviving the lifetime of their class loader (not for long as they apparently check that there are no leaks but what is odd is that I see some of them trying to load classes from the bootstrap CL).

franz1981 commented 4 months ago

I can help on this @gsmet : there are 2 problems here:

  1. Despite the finalizer doesn't have anything to do and the logger.isDebug at https://github.com/netty/netty/blob/53be385142f73b613e8729b709e2c1c296e8c63f/buffer/src/main/java/io/netty/buffer/PoolThreadCache.java#L237 is false, the finalizer still keep the heap reference of the caches alive
  2. I don't think is a good idea to have blocking thread pools to use the pool buffer caches - will increase RSS dramatically (and heap usage as well) see #39062

I think both could be fixed on Netty side i.e.

  1. I can make a check on the debugger level and cleanup the caches references as soon as https://github.com/netty/netty/blob/53be385142f73b613e8729b709e2c1c296e8c63f/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L543 kicks-in, which should happen when the JBoss executor is shutdown - this will make them eligible to be GCed without waiting the finalizer pass - but just a normal major GC (because they will likely be in old gen...)
  2. i can create a new system property which avoid non-event loop threads which are still FastThreadLocalThread to NOT uses thread local caches - and we can use it on our side to reduce the footprint of the blocking thread pool
gsmet commented 4 months ago

@franz1981 so one problem with the finalizers is that they are actually loading classes and they are doing so from closed class loaders.

Ideally, it would be nice to get rid of them in the Quarkus case if we could. I can get you a stacktrace later if you need it.

franz1981 commented 4 months ago

Yeah,I think I can, regardless, expose a sys property to disable finalizers and just rely on FastThreadLocal::onRemove, which is dependent by stopping the event loop threads instead. And that would help. I can cleanup the costy fields as soon as the race to free them up is won, too, which would help to defer the problem - because the pool cache will become a stale empty instance, at worse.

franz1981 commented 4 months ago

Let me know if is key for me work to work on this so I can prioritize it, ok? @gsmet

ikovalyov commented 1 month ago

We are having this issue too

franz1981 commented 1 month ago

@ikovalyov how things are going? did you opened a separate issue or it's the same issue reported here?