Open gsmet opened 2 months ago
/cc @geoand @mariofusco @franz1981 @Sanne for your awareness
It definitely warrants a thorough investigation
Note that it's not stable, another run just now gave me:
16 Loading class: io.vertx.core.http.impl.HttpUtils
16 Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler
16 Loading class: io.quarkus.vertx.http.runtime.VertxHttpRecorder$WebDeploymentVerticle$3
16 Loading class: io.netty.util.NetUtil
12 Loading class: io.vertx.core.net.impl.SocketAddressImpl
8 Loading class: io.vertx.core.net.impl.TCPServerBase
6 Loading class: io.vertx.core.http.impl.HttpServerImpl
No idea why they are all related to Vert.x/Netty.
/cc @cescoffier @maxandersen too
I remember that apart from the concurrency of the algorithm, this part has changed as well
@gsmet the events of class loading happen in burst or are separated from each others? Because being now concurrent (truly concurrent) maybe N loading are initiated but only one really succeeded...
They happen all at once for each class and very close to each other.
See this attached file: class-loading-events-main.txt
@gsmet I believe it depends by the number of event loops which initiate the loading by accessing the classes, try printing the current thread too...
Here is an extract with the thread name (not all of them but you get the idea):
Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.http.impl.HttpServerImpl from thread: vert.x-eventloop-thread-2
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-2
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-4
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-3
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-5
Loading class: io.vertx.core.net.impl.TCPServerBase from thread: vert.x-eventloop-thread-6
Loading class: io.vertx.core.net.SocketAddress from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.spi.metrics.TCPMetrics from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.spi.metrics.NetworkMetrics from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.net.impl.NetServerImpl from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.impl.future.Mapping from thread: main
Loading class: io.vertx.core.impl.future.Operation from thread: main
Loading class: io.vertx.core.impl.future.FutureImpl$4 from thread: main
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-13
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-15
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-0
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-14
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-6
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-5
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-7
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-1
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-8
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-3
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-4
Loading class: io.vertx.core.http.impl.HttpServerImpl$HttpStreamHandler from thread: vert.x-eventloop-thread-2
BTW, what concerns me a bit is that I have seen the following pattern several times in various flame graphs of various applications and always with main
in this position so while it could be an artifact of the method I used to get the flamegraph, it still looks a bit odd to me:
The class loader stuff looks a lot more present in main
. And it's not the first time I observe that, I have seen it in different flamegraphs of different applications lately.
I suggest to capture traces via event=Java_java_lang_ClassLoader_defineClass1
Adding the threads option, so you will see the threads which performed the class loading and count the number of samples.
If they match, the actual number of class loading event is the same.
I have the gut feeling (to verify) that we are not observing multiple class loading but just a racing between threads while printing that we initiated a class load event, with a single winner performing it, but I need to look at the code to make sure of it.
I believe it's a consequence of the new design: it's not using locks, rather letting it fail (and ignore) on duplicate class definition.
Not ideal, but a tradeoff which was chosen intentionally.
But why are we paying this tradeoff when not using virtual threads?
I can understand we want a new design for this particular use case but why make the standard case worse?
I have the gut feeling (to verify) that we are not observing multiple class loading but just a racing between threads while printing that we initiated a class load event, with a single winner performing it, but I need to look at the code to make sure of it.
Sorry, when I said the classes are loaded several times, I don't mean that we define them several times. What I'm saying is that we will load the content of the class several times, and if it has been defined already it will fail in defineClass
and we will load the class again with findLoadedClass
.
Basically what we are doing 16 times is to read the class bytes from the ClassLoadingResource
.
That being said, I'm not sure it explains what I have seen in my flame graphs over and over these past few weeks. I have other things to do atm and if you have a strong opinion of how this particular additional pressure (if it's actually real) should be chased (the ones from the flamegraphs), please do.
Maybe it's a false alarm but again, I have seen it consistently for the past weeks and always in main
compared to previous versions.
I agree with @Sanne's and @franz1981's analysis: the behaviour found by @gsmet is admittedly not ideal but it is a trade-off that is the direct consequence of having flagged this class loader as parallel capable plus the fully non-blocking design required to support virtual threads. This is also empirically demonstrated by the fact that the maximum number of parallel class loading in Guillaume experiments is 16 which is probably also the number of cores of his machine.
But why are we paying this tradeoff when not using virtual threads?
This is a fair observation and indeed one of the first implementation that I attempted had 2 different execution paths depending on the nature of the thread accessing the class loader. I abandoned that idea for a few reasons:
The only other fix (or hack) that could allow us to mitigate this problem that comes me to mind is putting together a short list of the classes having the highest probability of triggering this problem and preload them when the class loader is created. I have the feeling (but correct me if I'm wrong) that they are always more or less the same, depending on the quarkus/vertx/netty stack and not by the specific test case used by @gsmet to investigate this issue. I admit that this is an ugly workaround (and feels a bit like cheating) but this class loader is already specifically designed for quarkus and far from being a general purpose one, so if we could leverage our knowledge of our architecture why shouldn't we?
@mariofusco https://github.com/quarkusio/quarkus/pull/42139/files#diff-f20997e90aae1d2fe046b512defb1a3e14215b7a834177d43369542bae99a7f6R104 it shows that we still have separate paths for these....
I think instead this "problem" could be solved actually: https://github.com/mariofusco/quarkus/blob/0c479cecdbd6625edc9146bc958b483886f378d4/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L111C24-L111C35
here the defineClass
can fail with linkage error due to concurrently defining a classs; this is where we can decide to have virtual threads to fail and normal threads to synchronize
vs the resource data (byte[]? maybe) before making definePackage
and defineClass
to happen. This will save the exception to occur assuming the 2 threads are using the same byte[]
(which is likely, given that we use reference counting for this reason!).
And perf-wise should still be "decent" because the concurrency level and granularity will be enormously better than what we had with a non-parallel class laoder.
wdyt @gsmet @Sanne @mariofusco ?
@mariofusco https://github.com/quarkusio/quarkus/pull/42139/files#diff-f20997e90aae1d2fe046b512defb1a3e14215b7a834177d43369542bae99a7f6R104 it shows that we still have separate paths for these....
True, that's on a lower level than the point where @gsmet was counting the number of class loading requests and more internal though. Moreover it doesn't require the use of any read/write lock as the original implementation did.
I think we need to be careful not making things slower by trying to detect a narrow case. Maybe we can live with this additional loading of resources, I don't know.
@franz1981 were you able to reproduce the CL activity that looks more intense in main
compared to 3.13
that I was showing in the flame graph?
I was observing that with https://github.com/quarkusio/quarkus-github-bot but I have seen the same with other apps too. It might just be a matter of sampling but I have seen it three times now always with main
vs 3.13
.
I think we need to be careful not making things slower by trying to detect a narrow case. Maybe we can live with this additional loading of resources, I don't know.
we don't load but define class more times, failing due to linking error
@franz1981 were you able to reproduce the CL activity that looks more intense in main compared to 3.13 that I was showing in the flame graph?
If you detected it because of the number of cores of your machine, is not needed - it is by design that would happen - including throwing exception to fallback on the existing defined class
The thing is that the solution I've proposed at https://github.com/quarkusio/quarkus/issues/42874#issuecomment-2320508629 can be prone to deadlock(s) - @dmlloyd can help me here, actually.
Another solution instead is to leverage the existing algorithm which distinguish the thread which has loaded the resource data the first time from who is concurrently doing it.
we don't load but define class more times, failing due to linking error
I said loading resources
, we are reading the byte[]
multiple times.
If you detected it because of the number of cores of your machine, is not needed - it is by design that would happen - including throwing exception to fallback on the existing defined class
No, the flamegraphs that I posted above are about loading some Jackson classes, they are not affected by this very issue which only happens when starting the Vert.x HTTP server on the multiple event loop threads.
I think instead this "problem" could be solved actually: https://github.com/mariofusco/quarkus/blob/0c479cecdbd6625edc9146bc958b483886f378d4/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/RunnerClassLoader.java#L111C24-L111C35
here the
defineClass
can fail with linkage error due to concurrently defining a classs; this is where we can decide to have virtual threads to fail and normal threads tosynchronize
vs the resource data (byte[]? maybe) before makingdefinePackage
anddefineClass
to happen. This will save the exception to occur assuming the 2 threads are using the samebyte[]
(which is likely, given that we use reference counting for this reason!). And perf-wise should still be "decent" because the concurrency level and granularity will be enormously better than what we had with a non-parallel class laoder.wdyt @gsmet @Sanne @mariofusco ?
If you're talking about rejecting class loading from virtual threads to avoid pinning, I think this is not necessarily a bad idea, but it will require us to preload some classes, won't it?
I said loading resources, we are reading the byte[] multiple times.
@gsmet sorry I've misread it - this shouldn't really happen because ref counting should handle, if the resources are not yet closed - to share the existing byte[] (eventually waiting till it is made available) , see https://github.com/quarkusio/quarkus/blob/main/independent-projects/bootstrap/runner/src/main/java/io/quarkus/bootstrap/runner/JarFileReference.java#L91
If this is not happening, something is wrong with the algorithm (@mariofusco ) or the cache where we store it was evicted. But given that you said it was happening in batch I suppose is the former.
@gsmet Please try add the System.out at:
System.out.println(Thread.currentThread() + " is defining " + name + " using the data: " + System.identityHashCode(data))
and at
as
System.out.println(Thread.currentThread() + " has failed to define " + name)
If, as I would expect, the data's identity hashcode is the same, I would expect 15/16 concurrent define to fail with linkage error.
If the data
identity hashcode would be different, instead, I need to think more what could have happened.
Regardless this issue, as agreed with @Sanne some time ago, I think we need some custom JFR events here to emit what the class loader is doing (can I delegate this to you @mariofusco ?). These events, together with the existing JFR class loading/defining events, should be enough to both write regression tests and eventually collect them for perf purposes.
My proposal here @mariofusco :
This should avoid the storm of linkage errors/repeated byte[] reads on the same jar res. Clearly this is a problem with laptop machines and big servers, where the concurrent behaviour is more evident and possible
I think there is still some issue with the new
RunnerClassLoader
implementation from https://github.com/quarkusio/quarkus/pull/42139.I have already mentioned in some other work that I was under the impression something was odd with class loading stuff being a bit more present than in 3.13. I was experimenting with another patch today and just trying to start the very simple Quarkus GitHub Bot app in prod mode.
Again I saw the ClassLoader being a bit more present than in 3.13.
I did something quite simple, I applied this patch:
While most classes are loaded once, I have several classes loaded multiple times, up to 16 times for some of them:
(the rest of the classes are loaded only once)
Reverting https://github.com/quarkusio/quarkus/pull/42139 brings us back to normal AFAICS.
It doesn't sound like a desired behavior to me?