Open brunobat opened 4 months ago
/cc @geoand (testing)
Running the tests without GrpcWithTLSWithTrustAllWithCompressionTest
and HttpWithTLSWithTrustAllWithCompressionTest
leads to less allocations, proportional to less tests being run:
The profiling file:
GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_123021.jfr.zip
Also of note... TestContainers, in addition to the OpenTelemetry Collector container will lunch 2 other types of containers per test:
These containers are not cleaned up right away. They will only go away after all TestContainer tests are finished:
Also on the configuration without GrpcWithTLSWithTrustAllWithCompressionTest
and HttpWithTLSWithTrustAllWithCompressionTest
I got this memory dump while executing the 7th test:
https://drive.google.com/file/d/13hvrMHEYO4KWX13zDd5xV7BkGTqKxDNf/view?usp=drive_link
(The file is too big for github)
Configuration without GrpcWithTLSWithTrustAllWithCompressionTest
and HttpWithTLSWithTrustAllWithCompressionTestwith
with export TESTCONTAINERS_RYUK_DISABLED=true
:
[Uploading GrpcNoTLSWithCompressionTest_and_7_more_2024_06_12_143605.jfr.zip…]()
@brunobat I think we will need the heap dumps for the tests not the JFR file.
I've also noticed recent memory increases in dev mode tests, which tipped the pact ecosystem CI tests over the edge on June 5th, and got worse again around June 10th. I have https://github.com/quarkusio/quarkus/issues/38814 open for a general frog-boiling in the memory requirements of dev/test mode.
I'm slowly making progress here on a patch that might improve the situation but things are a bit messy: https://github.com/quarkusio/quarkus/pull/41172
Also see https://github.com/quarkusio/quarkus/issues/41233 .
If want a hungry test to play with ...
git clone git@github.com:quarkiverse/quarkus-pact.git
cd quarkus-pact
mvn install
cd cross-extension-integration-tests/target/test-classes/projects/happy-everyone-all-together-processed
<edit the pom.xml to use a 999-SNAPSHOT quarkus>
mvn compile quarkus:dev -Dquarkus.analytics.disabled=true -Dquarkus.http.test-port=8083 -Ddebug=false -Djvm.args='-Xmx140m'
That 140m setting passed on 3.11, but fails in 3.12 (you won't see an OOM in most cases, just test failures, but I'm pretty sure it's OOMing under the covers somewhere). In 3.6, 120m would work. With quarkus:test
, 105m will pass and 95m will fail.
It would be interesting to get a heap dump in both cases.
It would be interesting to get a heap dump in both cases.
The heap dump thing is kind of interesting and complicated, IMO. There's another comment on it that I put somewhere ... A single heap dump isn't going to be useful, at all (and I know that's not what you're suggesting). So we'd need two heap dumps, from (say) Quarkus 3.6 and 3.7, or Quarkus 3.11 and 3.12.
Then we can compare them. But let's assume the problem isn't a memory leak, but just that ... we do more, so we have more allocations. So the question we want the dumps to answer isn't "which objects are leaking," but rather "which objects do I have now that I didn't have before?" or "what objects do I have more of now?" ... but there's then another challenge, which is "how do I get dumps at the exact same stage in the good/bad case?" and "what stage should I be getting dumps at?" The simplest thing to do is to set the heap to a identical, constrained, value in good and bad cases (because different heap sizes would seem like lunacy) and then wait for the OOM dump to be generated. But because the bad case is hungrier, it will OOM earlier than the good case, so it will be at a different stage in the quarkus lifecycle, so there will be a ton of 'noise' in the delta between the two dumps. The objects on the heap will be different in the two cases, because the app is doing different things when the dump happens.
To avoid that issue you could start the applications with an unconstrained heap and let them fully finish starting, and then trigger a dump. In that case, you're avoiding the problem where the two apps are at different stages, but you have a new problem, which is that memory usage peaks during test execution. So you're not measuring 'at high tide', you're measuring after things have settled down. The offending objects might have already been freed and removed from the heap.
And there's another question - what problem are we trying to solve? Maybe we did increase memory requirements, and we get lucky and pinpoint the exact classes responsible. But they're just the straw that broke the camel's back - maybe those new classes are doing something really useful. If so, we won't do anything with them after identifying them - we'd need to reduce memory requirements by finding something else in the heap which is waste that we can optimise away.
Or maybe the problem is a leak, and then we wouldn't want to compare 3.6 and 3.7, but 3.7 mid-test and 3.7 at the end of tests.
All of which is a long way of saying that I did gather various dumps when I was investigating the 3.6->3.7 increase, and after looking in MAT for a while, nothing was jumping out. I concluded that the increase in memory requirements wasn't a big jump, just an increase that tipped these tests over the threshold into an OOM, but investigating that would have needed decisions about what problem was being solved, and a different set of dumps.
@holly-cummins In any case, I would be interesting in having a dump for each version :). I agree we might not find anything but there's something odd.
I wonder if it could be due to the new version of Netty because, in my branch which logs something every time we try to access a closed CL, I see things like:
2024-06-17 19:42:59,363 ERROR [io.qua.boo.cla.Qua.lifecycle] (Finalizer) Class loader QuarkusClassLoader[type=BASE_RUNTIME, name=Quarkus Base Runtime ClassLoader: TEST for GrpcOpenInstrumentationDisabledTest (QuarkusUnitTest), id=285fbc53] has been closed and may not be accessed anymore: java.lang.RuntimeException: Created to log a stacktrace
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:510)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:505)
at io.netty.util.internal.shaded.org.jctools.queues.ConcurrentCircularArrayQueue.isEmpty(ConcurrentCircularArrayQueue.java:71)
at io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.isEmpty(MpscArrayQueue.java:207)
at io.netty.buffer.PoolThreadCache.checkCacheMayLeak(PoolThreadCache.java:236)
at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:227)
at io.netty.buffer.PoolThreadCache$FreeOnFinalize.finalize(PoolThreadCache.java:500)
at java.base/java.lang.System$2.invokeFinalize(System.java:2297)
at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:88)
at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)
that I wasn't seeing before.
So we have some ConcurrentCircularArrayQueue
that are surviving the lifetime of their class loader (not for long as they apparently check that there are no leaks but what is odd is that I see some of them trying to load classes from the bootstrap CL).
I can help on this @gsmet : there are 2 problems here:
I think both could be fixed on Netty side i.e.
FastThreadLocalThread
to NOT uses thread local caches - and we can use it on our side to reduce the footprint of the blocking thread pool@franz1981 so one problem with the finalizers is that they are actually loading classes and they are doing so from closed class loaders.
Ideally, it would be nice to get rid of them in the Quarkus case if we could. I can get you a stacktrace later if you need it.
Yeah,I think I can, regardless, expose a sys property to disable finalizers and just rely on FastThreadLocal::onRemove, which is dependent by stopping the event loop threads instead. And that would help. I can cleanup the costy fields as soon as the race to free them up is won, too, which would help to defer the problem - because the pool cache will become a stale empty instance, at worse.
Let me know if is key for me work to work on this so I can prioritize it, ok? @gsmet
We are having this issue too
@ikovalyov how things are going? did you opened a separate issue or it's the same issue reported here?
Describe the bug
When trying to run 10 IT tests with TestContainers in this PR: https://github.com/quarkusio/quarkus/pull/39032 The CI build fails with an OOM exception on all JDKs. Example: https://github.com/quarkusio/quarkus/actions/runs/9283986768/job/25545787368#step:16:4086
Local execution reveals a memory leak:
Initial analysis pointed to the new feature being added, however that was just a tipping point. The bulk of memory being used is related with the
QuarkusTestExtension
class or with TestContainers.When running all the tests the containers are not immediately shut down. They linger in the system until the last test is finished, according to my view of Podman desktop.
I profiled the test execution and I'm attacking the JFR stream file here:
GrpcNoTLSNoCompressionTest_and_9_more_2024_06_12_102823.jfr.zip
Further analyses reveals this on
QuarkusTestExtension
:Expected behavior
No out of memory when running mutiple IT tests with TestContainers
Actual behavior
CI will fail with out of memory
How to Reproduce?
Output of
uname -a
orver
Darwin xxxxxxxxxxx 22.6.0 Darwin Kernel Version 22.6.0: Mon Apr 22 20:49:37 PDT 2024; root:xnu-8796.141.3.705.2~1/RELEASE_ARM64_T6000 arm64
Output of
java -version
openjdk version "21.0.2" 2024-01-16 LTS OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode)
Quarkus version or git rev
git version 2.38.1
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae) Maven home: /Users/xxxxxxx/.sdkman/candidates/maven/current Java version: 21.0.2, vendor: Eclipse Adoptium, runtime: /Users/xxxxxxx/.sdkman/candidates/java/21.0.2-tem Default locale: en_PT, platform encoding: UTF-8 OS name: "mac os x", version: "13.6.7", arch: "aarch64", family: "mac"
Additional information
Running Podman Desktop Version 1.10.3 (1.10.3)
See companion discussion on Zulip: https://quarkusio.zulipchat.com/#narrow/stream/187038-dev/topic/TestContainer.20IT.20tests.20out.20of.20memory