quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.77k stars 2.68k forks source link

RunnerClassLoader - Some classes are loaded multiple times #42874

Open gsmet opened 2 months ago

gsmet commented 2 months ago

I think there is still some issue with the new RunnerClassLoader implementation from https://github.com/quarkusio/quarkus/pull/42139.

I have already mentioned in some other work that I was under the impression something was odd with class loading stuff being a bit more present than in 3.13. I was experimenting with another patch today and just trying to start the very simple Quarkus GitHub Bot app in prod mode.

Again I saw the ClassLoader being a bit more present than in 3.13.

I did something quite simple, I applied this patch:

diff --git a/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java b/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java
index 071d77ffef1..4f6fdb1895f 100644
--- a/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java
+++ b/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java
@@ -88,6 +88,9 @@ public Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundExce
         if (loaded != null) {
             return loaded;
         }
+
+        System.out.println("Loading class: " + name);
+
         final ClassLoadingResource[] resources;
         if (packageName == null) {
             resources = resourceDirectoryMap.get("");

While most classes are loaded once, I have several classes loaded multiple times, up to 16 times for some of them:

     16 Loading class: io.vertx.core.http.impl.HttpUtils
     16 Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
     16 Loading class: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
     16 Loading class: io.netty.util.NetUtil
      8 Loading class: io.vertx.core.net.impl.SocketAddressImpl
      6 Loading class: io.vertx.core.http.impl.HttpServerImpl
      5 Loading class: io.vertx.core.net.impl.TCPServerBase

(the rest of the classes are loaded only once)

Reverting https://github.com/quarkusio/quarkus/pull/42139 brings us back to normal AFAICS.

It doesn't sound like a desired behavior to me?

gsmet commented 2 months ago

/cc @geoand @mariofusco @franz1981 @Sanne for your awareness

geoand commented 2 months ago

It definitely warrants a thorough investigation

gsmet commented 2 months ago

Note that it's not stable, another run just now gave me:

     16 Loading class: io.vertx.core.http.impl.HttpUtils
     16 Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
     16 Loading class: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
     16 Loading class: io.netty.util.NetUtil
     12 Loading class: io.vertx.core.net.impl.SocketAddressImpl
      8 Loading class: io.vertx.core.net.impl.TCPServerBase
      6 Loading class: io.vertx.core.http.impl.HttpServerImpl

No idea why they are all related to Vert.x/Netty.

gsmet commented 2 months ago

/cc @cescoffier @maxandersen too

franz1981 commented 2 months ago

https://github.com/quarkusio/quarkus/pull/42139/files#diff-a8ea1b8267b00e0c29800f9e3d324797e9b62667b590fe3c3c8ad81f708f44f4R196

I remember that apart from the concurrency of the algorithm, this part has changed as well

franz1981 commented 2 months ago

@gsmet the events of class loading happen in burst or are separated from each others? Because being now concurrent (truly concurrent) maybe N loading are initiated but only one really succeeded...

gsmet commented 2 months ago

They happen all at once for each class and very close to each other.

See this attached file: class-loading-events-main.txt

franz1981 commented 2 months ago

@gsmet I believe it depends by the number of event loops which initiate the loading by accessing the classes, try printing the current thread too...

gsmet commented 2 months ago

Here is an extract with the thread name (not all of them but you get the idea):

Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-2
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-2
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-4
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-3
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-5
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-6
Loading class: io.vertx.core.net.SocketAddress from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.spi.metrics.TCPMetrics from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.spi.metrics.NetworkMetrics from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.net.impl.NetServerImpl from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.impl.future.Mapping from thread: main
Loading class: io.vertx.core.impl.future.Operation from thread: main
Loading class: io.vertx.core.impl.future.FutureImpl$4 from thread: main
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-13
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-15
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-14
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-6
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-5
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-7
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-8
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-3
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-4
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-2
gsmet commented 2 months ago

BTW, what concerns me a bit is that I have seen the following pattern several times in various flame graphs of various applications and always with main in this position so while it could be an artifact of the method I used to get the flamegraph, it still looks a bit odd to me:

Loading a given class in 3.13

Screenshot from 2024-08-29 19-04-52

Loading the same class in main

Screenshot from 2024-08-29 19-04-45

The class loader stuff looks a lot more present in main. And it's not the first time I observe that, I have seen it in different flamegraphs of different applications lately.

franz1981 commented 2 months ago

I suggest to capture traces via event=Java_java_lang_ClassLoader_defineClass1

Adding the threads option, so you will see the threads which performed the class loading and count the number of samples.

If they match, the actual number of class loading event is the same.

I have the gut feeling (to verify) that we are not observing multiple class loading but just a racing between threads while printing that we initiated a class load event, with a single winner performing it, but I need to look at the code to make sure of it.

Sanne commented 2 months ago

I believe it's a consequence of the new design: it's not using locks, rather letting it fail (and ignore) on duplicate class definition.

Not ideal, but a tradeoff which was chosen intentionally.

gsmet commented 2 months ago

But why are we paying this tradeoff when not using virtual threads?

I can understand we want a new design for this particular use case but why make the standard case worse?

gsmet commented 2 months ago

I have the gut feeling (to verify) that we are not observing multiple class loading but just a racing between threads while printing that we initiated a class load event, with a single winner performing it, but I need to look at the code to make sure of it.

Sorry, when I said the classes are loaded several times, I don't mean that we define them several times. What I'm saying is that we will load the content of the class several times, and if it has been defined already it will fail in defineClass and we will load the class again with findLoadedClass.

Basically what we are doing 16 times is to read the class bytes from the ClassLoadingResource.

That being said, I'm not sure it explains what I have seen in my flame graphs over and over these past few weeks. I have other things to do atm and if you have a strong opinion of how this particular additional pressure (if it's actually real) should be chased (the ones from the flamegraphs), please do.

Maybe it's a false alarm but again, I have seen it consistently for the past weeks and always in main compared to previous versions.

mariofusco commented 2 months ago

I agree with @Sanne's and @franz1981's analysis: the behaviour found by @gsmet is admittedly not ideal but it is a trade-off that is the direct consequence of having flagged this class loader as parallel capable plus the fully non-blocking design required to support virtual threads. This is also empirically demonstrated by the fact that the maximum number of parallel class loading in Guillaume experiments is 16 which is probably also the number of cores of his machine.

But why are we paying this tradeoff when not using virtual threads?

This is a fair observation and indeed one of the first implementation that I attempted had 2 different execution paths depending on the nature of the thread accessing the class loader. I abandoned that idea for a few reasons:

  1. it made the algorithm even more complex and hard to debug than how it already is
  2. it required to keep in the class loader all the locks and data structures necessary for both cases thus making the class loader itself heavier and more memory demanding
  3. (more important) it cannot work if both native and virtual threads try to access concurrently to this class loader

The only other fix (or hack) that could allow us to mitigate this problem that comes me to mind is putting together a short list of the classes having the highest probability of triggering this problem and preload them when the class loader is created. I have the feeling (but correct me if I'm wrong) that they are always more or less the same, depending on the quarkus/vertx/netty stack and not by the specific test case used by @gsmet to investigate this issue. I admit that this is an ugly workaround (and feels a bit like cheating) but this class loader is already specifically designed for quarkus and far from being a general purpose one, so if we could leverage our knowledge of our architecture why shouldn't we?

franz1981 commented 2 months ago

@mariofusco https://github.com/quarkusio/quarkus/pull/42139/files#diff-f20997e90aae1d2fe046b512defb1a3e14215b7a834177d43369542bae99a7f6R104 it shows that we still have separate paths for these....

franz1981 commented 2 months ago

I think instead this "problem" could be solved actually: https://github.com/mariofusco/quarkus/blob/0c479cecdbd6625edc9146bc958b483886f378d4/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L111C24-L111C35

here the defineClass can fail with linkage error due to concurrently defining a classs; this is where we can decide to have virtual threads to fail and normal threads to synchronize vs the resource data (byte[]? maybe) before making definePackage and defineClass to happen. This will save the exception to occur assuming the 2 threads are using the same byte[] (which is likely, given that we use reference counting for this reason!). And perf-wise should still be "decent" because the concurrency level and granularity will be enormously better than what we had with a non-parallel class laoder.

wdyt @gsmet @Sanne @mariofusco ?

mariofusco commented 2 months ago

@mariofusco https://github.com/quarkusio/quarkus/pull/42139/files#diff-f20997e90aae1d2fe046b512defb1a3e14215b7a834177d43369542bae99a7f6R104 it shows that we still have separate paths for these....

True, that's on a lower level than the point where @gsmet was counting the number of class loading requests and more internal though. Moreover it doesn't require the use of any read/write lock as the original implementation did.

gsmet commented 2 months ago

I think we need to be careful not making things slower by trying to detect a narrow case. Maybe we can live with this additional loading of resources, I don't know.

@franz1981 were you able to reproduce the CL activity that looks more intense in main compared to 3.13 that I was showing in the flame graph?

I was observing that with https://github.com/quarkusio/quarkus-github-bot but I have seen the same with other apps too. It might just be a matter of sampling but I have seen it three times now always with main vs 3.13.

franz1981 commented 2 months ago

I think we need to be careful not making things slower by trying to detect a narrow case. Maybe we can live with this additional loading of resources, I don't know.

we don't load but define class more times, failing due to linking error

@franz1981 were you able to reproduce the CL activity that looks more intense in main compared to 3.13 that I was showing in the flame graph?

If you detected it because of the number of cores of your machine, is not needed - it is by design that would happen - including throwing exception to fallback on the existing defined class

The thing is that the solution I've proposed at https://github.com/quarkusio/quarkus/issues/42874#issuecomment-2320508629 can be prone to deadlock(s) - @dmlloyd can help me here, actually.

Another solution instead is to leverage the existing algorithm which distinguish the thread which has loaded the resource data the first time from who is concurrently doing it.

gsmet commented 2 months ago

we don't load but define class more times, failing due to linking error

I said loading resources, we are reading the byte[] multiple times.

If you detected it because of the number of cores of your machine, is not needed - it is by design that would happen - including throwing exception to fallback on the existing defined class

No, the flamegraphs that I posted above are about loading some Jackson classes, they are not affected by this very issue which only happens when starting the Vert.x HTTP server on the multiple event loop threads.

dmlloyd commented 2 months ago

I think instead this "problem" could be solved actually: https://github.com/mariofusco/quarkus/blob/0c479cecdbd6625edc9146bc958b483886f378d4/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L111C24-L111C35

here the defineClass can fail with linkage error due to concurrently defining a classs; this is where we can decide to have virtual threads to fail and normal threads to synchronize vs the resource data (byte[]? maybe) before making definePackage and defineClass to happen. This will save the exception to occur assuming the 2 threads are using the same byte[] (which is likely, given that we use reference counting for this reason!). And perf-wise should still be "decent" because the concurrency level and granularity will be enormously better than what we had with a non-parallel class laoder.

wdyt @gsmet @Sanne @mariofusco ?

If you're talking about rejecting class loading from virtual threads to avoid pinning, I think this is not necessarily a bad idea, but it will require us to preload some classes, won't it?

franz1981 commented 2 months ago

I said loading resources, we are reading the byte[] multiple times.

@gsmet sorry I've misread it - this shouldn't really happen because ref counting should handle, if the resources are not yet closed - to share the existing byte[] (eventually waiting till it is made available) , see https://github.com/quarkusio/quarkus/blob/main/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarFileReference.java#L91

If this is not happening, something is wrong with the algorithm (@mariofusco ) or the cache where we store it was evicted. But given that you said it was happening in batch I suppose is the former.

franz1981 commented 2 months ago

@gsmet Please try add the System.out at:

https://github.com/quarkusio/quarkus/blob/708a7624102fc4a2ee06c0a0349632b67ab40366/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L110

System.out.println(Thread.currentThread() + " is defining " + name + " using the data: " + System.identityHashCode(data))

and at

https://github.com/quarkusio/quarkus/blob/708a7624102fc4a2ee06c0a0349632b67ab40366/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L154

as

System.out.println(Thread.currentThread() + " has failed to define " + name)

If, as I would expect, the data's identity hashcode is the same, I would expect 15/16 concurrent define to fail with linkage error. If the data identity hashcode would be different, instead, I need to think more what could have happened.

Regardless this issue, as agreed with @Sanne some time ago, I think we need some custom JFR events here to emit what the class loader is doing (can I delegate this to you @mariofusco ?). These events, together with the existing JFR class loading/defining events, should be enough to both write regression tests and eventually collect them for perf purposes.

franz1981 commented 2 months ago

My proposal here @mariofusco :

This should avoid the storm of linkage errors/repeated byte[] reads on the same jar res. Clearly this is a problem with laptop machines and big servers, where the concurrent behaviour is more evident and possible