quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.64k stars 2.64k forks source link

Netty finalizers load classes from closed class loaders #41607

Open gsmet opened 3 months ago

gsmet commented 3 months ago

@franz1981 @cescoffier I decided to create a specific issue for this one. I think it would have value for us to be able to fully control the Netty lifecycle instead of relying on finalizers. Obviously, I have no idea how easy it would be.

This is the stacktrace we obtain:

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for NonAppEndpointsDisabledTest@75b45343 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
    at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:730)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:504)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:556)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:504)
    at io.netty.util.internal.shaded.org.jctools.queues.ConcurrentCircularArrayQueue.isEmpty(ConcurrentCircularArrayQueue.java:71)
    at io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.isEmpty(MpscArrayQueue.java:207)
    at io.netty.buffer.PoolThreadCache.checkCacheMayLeak(PoolThreadCache.java:236)
    at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:227)
    at io.netty.buffer.PoolThreadCache$FreeOnFinalize.finalize(PoolThreadCache.java:500)
    at java.base/java.lang.System$2.invokeFinalize(System.java:2297)
    at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:88)
    at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)

This is logged when trying to load IndexedQueueSizeUtil to check if the queue is empty.

While it might sound like something that is not that problematic because it works, we actually have a lot of problems coming from the fact that we don't enforce the closing of CLs. When a QuarkusClassLoader is closed, we shouldn't try to load anything from it anymore. This is not exactly easy and this is just one of the occurrences but I thought it would be better isolated in a specific issue.

quarkus-bot[bot] commented 3 months ago

/cc @cescoffier (netty), @franz1981 (netty), @jponge (netty)

gsmet commented 3 months ago

Note that this isn't the only issue I have with Netty.

I also get the following for some tests:

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@27494e46 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
    at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:501)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:503)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
franz1981 commented 3 months ago

For the finalizer part I have sent a small fix - if you can give it a shot: https://github.com/franz1981/netty/tree/4.1_classloader_leak

Ideally we would like to "just" rely on FastThreadLocalThreads removal, and this could be implemented in a different PR, but the point is that it would assume that the cache is used only within such threads, which is not a safe assumption for many users, hence IDK how the Netty devs will react to such proposal - to open the proverbial "can of worms"

franz1981 commented 3 months ago

For the other one re scheduled tasks I think they are not cancelled, because https://github.com/netty/netty/pull/9580 shows that we are currently removing them on cancel.

franz1981 commented 3 months ago

@gsmet FYI https://github.com/netty/netty/pull/14155#issuecomment-2203111706

If is accepted - it means an additional property to us :"(

franz1981 commented 3 months ago

ok @gsmet https://github.com/netty/netty/pull/14155 is ready to go, so before is getting merged can you provide a feedback if it is working for your case? you need to set an additional sys property i.e. -Dio.netty.allocator.disableCacheFinalizersForFastThreadLocalThreads=true

which disable finalizers for both jboss blocking thread pool and event loops (which should be both FastThreadLocalThreads afaik, @cescoffier am I right?)

gsmet commented 3 months ago

@franz1981 so the issue mentioned in the description looks solved by your PR.

Still wondering if we could improve the unrelated situation I was mentioning in this comment: https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2202876679

franz1981 commented 3 months ago

For that issue I have commented https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2203053666 which means that we have some scheduled task which is stopped likely (just guessing, but I have no idea actually...)

gsmet commented 3 months ago

@cescoffier do you have any idea of what could schedule these tasks and what they could be?

gsmet commented 3 months ago

@franz1981 @cescoffier I added some breakpoints and AFAICS the tasks are from the GlobalEventExecutor. And I see a lot of cancelScheduledTasks() called for the NioEventLoop but I haven't seen a call to the method for GlobalEventExecutor.

Now it's hard to debug as things pop up everywhere but I wonder if it could be an interesting lead?

franz1981 commented 3 months ago

It seems that https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L55 is a singleton event loop, single threaded, which, if has its single thread started, schedule a periodic task every each second to check if it should stop it, because idle. It is used as fallback executor, but i don't know why it cannot be unloaded...in theory after 1 second of inactivity, it should stop it's Thread. If you can, try debugging at https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L262 to print if there's some task which keep on getting run and prevent it to stop.

The executor itself should stop its single thread if there is nothing to do, see https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L278 (the break parts) - so we just need to understand why is not happening. When it will happen, it should be just an empty shell to unload.

gsmet commented 3 months ago

This very naive patch seems to improve things:

diff --git a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
index e0adae3bef1..30e8102f811 100644
--- a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
+++ b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
@@ -34,6 +34,7 @@

 import io.netty.channel.EventLoopGroup;
 import io.netty.util.concurrent.FastThreadLocal;
+import io.netty.util.concurrent.GlobalEventExecutor;
 import io.quarkus.arc.Arc;
 import io.quarkus.arc.InstanceHandle;
 import io.quarkus.runtime.ExecutorRecorder;
@@ -453,6 +454,11 @@ public void handle(AsyncResult<Void> ar) {
             }
             LateBoundMDCProvider.setMDCProviderDelegate(null);
             vertx = null;
+            try {
+                GlobalEventExecutor.INSTANCE.awaitInactivity(0, TimeUnit.SECONDS);
+            } catch (InterruptedException e) {
+                // ignore
+            }
         }
     }

Now, I think we would need a proper way to actually shutdown this executor. All the shutdown methods have been disabled for this class. I'm not entirely sure what I did ensures we won't have any more tasks executed.

franz1981 commented 3 months ago

The executor is not really "running": it lazily creates a single Thread to handle all tasks, which will stop after 1 second of inactivity (which is using a timer to detect it). While stopped it should be an empty shell, really.

Re https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2208941664

beware that if there is no Thread running it will throw NPE, so we need to submit a task and right after await its inactivity. But...if there's something which keep on awake it or keeping it alive, we need to debug which task is.

At best this approach will force at least 1 second of pause between "restart": based on the configured -Dio.netty.globalEventExecutor.quietPeriodSeconds parameter, which is 1 by default.

cescoffier commented 3 months ago

I agree with Franz, we need to see why the executor is not shutdown nicely.

I don't know if possible (didn't check the code yet) but in dev mode, we could imagine using reflection to shutdown the executor properly.

Do you have a way to debug / trigger this @gsmet, I can try to have a look tomorrow or Monday.

gsmet commented 3 months ago

@franz1981

From what I can see, you have the GlobalEventExecutor#quietPeriodTask and then a bunch of tasks coming from: https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/DefaultPromise.java#L500-L505

As for reproducing it, I'm trying to get my PR in but the reviewing guys are fighting... :)

gsmet commented 2 months ago

@cescoffier @franz1981 so to reproduce it, make sure you have built a tree that contains: https://github.com/quarkusio/quarkus/pull/41692

Then you can run:

./mvnw clean verify -f extensions/opentelemetry/deployment/ -Dstart-containers -Dtest-containers -Dquarkus-log-access-to-closed-class-loaders -Dtest=InvalidAttributesConfigTest,NonAppEndpointsDisabledTest

You should see several stack traces similar to the following when the second test starts (you need two tests to see the issue):

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@4d157787 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
    at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:728)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:520)
    at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)