Open gsmet opened 3 months ago
/cc @cescoffier (netty), @franz1981 (netty), @jponge (netty)
Note that this isn't the only issue I have with Netty.
I also get the following for some tests:
Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@27494e46 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:501)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:503)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
For the finalizer part I have sent a small fix - if you can give it a shot: https://github.com/franz1981/netty/tree/4.1_classloader_leak
Ideally we would like to "just" rely on FastThreadLocalThread
s removal, and this could be implemented in a different PR, but the point is that it would assume that the cache is used only within such threads, which is not a safe assumption for many users, hence IDK how the Netty devs will react to such proposal - to open the proverbial "can of worms"
For the other one re scheduled tasks I think they are not cancelled, because https://github.com/netty/netty/pull/9580 shows that we are currently removing them on cancel.
@gsmet FYI https://github.com/netty/netty/pull/14155#issuecomment-2203111706
If is accepted - it means an additional property to us :"(
ok @gsmet https://github.com/netty/netty/pull/14155 is ready to go, so before is getting merged can you provide a feedback if it is working for your case?
you need to set an additional sys property i.e. -Dio.netty.allocator.disableCacheFinalizersForFastThreadLocalThreads=true
which disable finalizers for both jboss blocking thread pool and event loops (which should be both FastThreadLocalThread
s afaik, @cescoffier am I right?)
@franz1981 so the issue mentioned in the description looks solved by your PR.
Still wondering if we could improve the unrelated situation I was mentioning in this comment: https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2202876679
For that issue I have commented https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2203053666 which means that we have some scheduled task which is stopped likely (just guessing, but I have no idea actually...)
@cescoffier do you have any idea of what could schedule these tasks and what they could be?
@franz1981 @cescoffier I added some breakpoints and AFAICS the tasks are from the GlobalEventExecutor
. And I see a lot of cancelScheduledTasks()
called for the NioEventLoop
but I haven't seen a call to the method for GlobalEventExecutor
.
Now it's hard to debug as things pop up everywhere but I wonder if it could be an interesting lead?
It seems that https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L55 is a singleton event loop, single threaded, which, if has its single thread started, schedule a periodic task every each second to check if it should stop it, because idle. It is used as fallback executor, but i don't know why it cannot be unloaded...in theory after 1 second of inactivity, it should stop it's Thread. If you can, try debugging at https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L262 to print if there's some task which keep on getting run and prevent it to stop.
The executor itself should stop its single thread if there is nothing to do, see https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L278 (the break
parts) - so we just need to understand why is not happening.
When it will happen, it should be just an empty shell to unload.
This very naive patch seems to improve things:
diff --git a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
index e0adae3bef1..30e8102f811 100644
--- a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
+++ b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
@@ -34,6 +34,7 @@
import io.netty.channel.EventLoopGroup;
import io.netty.util.concurrent.FastThreadLocal;
+import io.netty.util.concurrent.GlobalEventExecutor;
import io.quarkus.arc.Arc;
import io.quarkus.arc.InstanceHandle;
import io.quarkus.runtime.ExecutorRecorder;
@@ -453,6 +454,11 @@ public void handle(AsyncResult<Void> ar) {
}
LateBoundMDCProvider.setMDCProviderDelegate(null);
vertx = null;
+ try {
+ GlobalEventExecutor.INSTANCE.awaitInactivity(0, TimeUnit.SECONDS);
+ } catch (InterruptedException e) {
+ // ignore
+ }
}
}
Now, I think we would need a proper way to actually shutdown this executor. All the shutdown methods have been disabled for this class. I'm not entirely sure what I did ensures we won't have any more tasks executed.
The executor is not really "running": it lazily creates a single Thread to handle all tasks, which will stop after 1 second of inactivity (which is using a timer to detect it). While stopped it should be an empty shell, really.
Re https://github.com/quarkusio/quarkus/issues/41607#issuecomment-2208941664
beware that if there is no Thread
running it will throw NPE, so we need to submit a task and right after await its inactivity. But...if there's something which keep on awake it or keeping it alive, we need to debug which task is.
At best this approach will force at least 1 second of pause between "restart": based on the configured -Dio.netty.globalEventExecutor.quietPeriodSeconds
parameter, which is 1 by default.
I agree with Franz, we need to see why the executor is not shutdown nicely.
I don't know if possible (didn't check the code yet) but in dev mode, we could imagine using reflection to shutdown the executor properly.
Do you have a way to debug / trigger this @gsmet, I can try to have a look tomorrow or Monday.
@franz1981
From what I can see, you have the GlobalEventExecutor#quietPeriodTask
and then a bunch of tasks coming from: https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/DefaultPromise.java#L500-L505
As for reproducing it, I'm trying to get my PR in but the reviewing guys are fighting... :)
@cescoffier @franz1981 so to reproduce it, make sure you have built a tree that contains: https://github.com/quarkusio/quarkus/pull/41692
Then you can run:
./mvnw clean verify -f extensions/opentelemetry/deployment/ -Dstart-containers -Dtest-containers -Dquarkus-log-access-to-closed-class-loaders -Dtest=InvalidAttributesConfigTest,NonAppEndpointsDisabledTest
You should see several stack traces similar to the following when the second test starts (you need two tests to see the issue):
Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@4d157787 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:728)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:520)
at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
@franz1981 @cescoffier I decided to create a specific issue for this one. I think it would have value for us to be able to fully control the Netty lifecycle instead of relying on finalizers. Obviously, I have no idea how easy it would be.
This is the stacktrace we obtain:
This is logged when trying to load
IndexedQueueSizeUtil
to check if the queue is empty.While it might sound like something that is not that problematic because it works, we actually have a lot of problems coming from the fact that we don't enforce the closing of CLs. When a
QuarkusClassLoader
is closed, we shouldn't try to load anything from it anymore. This is not exactly easy and this is just one of the occurrences but I thought it would be better isolated in a specific issue.