quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.57k stars 2.63k forks source link

3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes #40917

Open imperatorx opened 3 months ago

imperatorx commented 3 months ago

Describe the bug

reproducer below

After upgrading my app to 3.11.0 from 3.10.2 and changing nothing else, suddenly I get deadlocks. I turned on virtual thread pinning logs and indeed my carrier threads got pinned. Interestingly enough, there is no <- monitor in the stack, so I think this is not caused by synchronization but maybe a native call? This might be related to the class loader changes in 3.11:

Pinned thread trace log:

Thread[#67,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

Expected behavior

Should not pin carrier threads as before

Actual behavior

Pins carrier threads. This happens in parallel in the actual application, so it ends up pinning all threads and deadlocking the application.

How to Reproduce?

Minimal reproducer:

var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());

        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for(int i = 0; i < 2; i++) {
                exec.submit(()->{
                    try {
                       cl.loadClass("an application class name here");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }

Output of uname -a or ver

win64

Output of java -version

build 21+35-2513

Quarkus version or git rev

3.11.0

Build tool (ie. output of mvnw --version or gradlew --version)

mvn

Additional information

No response

dmlloyd commented 3 months ago

I can only hypothesize, but my hypothesis is that perhaps the JVM has an internal lock (or more likely, an invisible native call frame) which is being acquired in between the org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32) frame and the io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71) frame.

Maybe @franz1981 has an idea?

dmlloyd commented 3 months ago

I'll try and think of some way to prove this hypothesis...

dmlloyd commented 3 months ago

Looking into the JDK source, it appears that the stack walker used to print the pinned stack trace does not include the SHOW_HIDDEN_FRAMES flag, which I think strengthens my hypothesis considerably.

franz1981 commented 3 months ago

we do have a thread unsafe declared class loader so maybe is the synchronization performed by the JVM itself that is not well-behaving here...?

quarkus-bot[bot] commented 3 months ago

/cc @cescoffier (virtual-threads), @ozangunalp (virtual-threads)

franz1981 commented 3 months ago

image

@dmlloyd @Sanne @geoand these seems the only 2 commits which separate the 2 versions mentioned

I was wrong about being thread-unsafe because of registerAsParallelCapable(); added in such (with the relevant related changes to make it possible - which I have to learn/look at)

dmlloyd commented 3 months ago

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

I think that we may want to revisit the idea of blocking inside the class loader (in this case, we are using a shared lock on the resource map). I believe we should change our assumptions and assume that because class loading may happen in a virtual thread, we cannot block (wait) inside of a class loading operation. (Yes I realize that loading a resource may be a blocking operation, but it is generally not a waiting operation.)

franz1981 commented 3 months ago

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

yeah I meant that my previous statement about not being parallel capable was very wrong :P

franz1981 commented 3 months ago

re

I think that we may want to revisit the idea of blocking inside the class loader

I need first to understand what means "pinning" here, looking at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L385 to match what the user is doing.

franz1981 commented 3 months ago

@dmlloyd I think that what the JDK does to understand if pinning is happening is to check on (a failed) preemption the context of the frame, see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L69-L71

So would be "nice" to have in the output of the pinned reason, which detailed reason is...

dmlloyd commented 3 months ago

Right, I can only infer the possibility of a hidden native frame based on what's happening at the time, but we don't really know without either the pinned reason or full stack trace.

franz1981 commented 3 months ago

Exactly, and I've found this -> https://github.com/openjdk/jdk/blob/5abc02927b480a85fadecf8d03850604510276e4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L205-L207

I could debug it via gdb, in case (poor me .-. very rusty on this)

dmlloyd commented 3 months ago

It might be simpler to patch the JDK PinnedThreadPrinter with more stack walker options as a first try using --patch-module.

dmlloyd commented 3 months ago

The pinned thread printer could easily be enhanced to also print the pinning reason as well. I wonder if it is worth opening a bug and a PR?

geoand commented 3 months ago

Is there a JFR event that we could use to find the reason for the pinning?

franz1981 commented 3 months ago

@geoand https://sap.github.io/SapMachine/jfrevents/21.html#virtualthreadpinned

The interesting bit is that you can configure threshold to check if is a temporary pinning or not

eg read/write reentrant locks in the past used to spin a bit, sometime, causing to actually "pin" the thread, IDK if this is still valid.

franz1981 commented 3 months ago

@imperatorx what is printed with jdk.tracePinnedThreads=full?

imperatorx commented 3 months ago

@imperatorx what is printed with jdk.tracePinnedThreads=full?

full pin trace:

Thread[#74,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$3(GreetingResource.java:50)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

JFR stack frames printed:

{
  method = java.lang.VirtualThread.parkOnCarrierThread(boolean, long)
  lineNumber = 687
  bytecodeIndex = 116
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.park()
  lineNumber = 603
  bytecodeIndex = 200
  type = "Interpreted"
}

{
  method = java.lang.System$2.parkVirtualThread()
  lineNumber = 2639
  bytecodeIndex = 17
  type = "Interpreted"
}

{
  method = jdk.internal.misc.VirtualThreads.park()
  lineNumber = 54
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.LockSupport.park(Object)
  lineNumber = 219
  bytecodeIndex = 16
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer$Node, int, boolean, boolean, boolean, long)
  lineNumber = 754
  bytecodeIndex = 361
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(int)
  lineNumber = 1079
  bytecodeIndex = 15
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock()
  lineNumber = 738
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference()
  lineNumber = 156
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.getResourceData(String)
  lineNumber = 72
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String, boolean)
  lineNumber = 105
  bytecodeIndex = 159
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String)
  lineNumber = 71
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource.lambda$init$3(ClassLoader)
  lineNumber = 50
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource$$Lambda+0x0000024c211e1248.904209509.run()
  lineNumber = -1
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.Executors$RunnableAdapter.call()
  lineNumber = 572
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.FutureTask.run()
  lineNumber = 317
  bytecodeIndex = 39
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.runWith(Object, Runnable)
  lineNumber = 341
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.run(Runnable)
  lineNumber = 311
  bytecodeIndex = 63
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread$VThreadContinuation$1.run()
  lineNumber = 192
  bytecodeIndex = 8
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter0()
  lineNumber = 320
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter(Continuation, boolean)
  lineNumber = 312
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enterSpecial(Continuation, boolean, boolean)
  lineNumber = -1
  bytecodeIndex = 0
  type = "Native"
}
imperatorx commented 3 months ago

When I run it with the java 23 loom early acces JDK, the JFR event has an additional message:

pinnedReason = "Native frame or <clinit> on stack"

franz1981 commented 3 months ago

It looks to me as a reentrancy case (because I expect no other threads are there holding the write lock) likely due to a write lock already held and a read lock unable to be acquired (hence leading to park).

franz1981 commented 3 months ago

I've tried to reproduce with this on 3.11.0 on an hello world endpoint:

@Path("/hello")
public class GreetingResource {

    private static class Fun {

    }

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.GreetingResource$Fun");
                        System.out.println("Loaded");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }
}

on

Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, sharing)

without much luck. I've used

curl http://localhost:8080/hello

few times, both running the quarkus application with -Djdk.tracePinnedThreads=short and without it, not getting the same outcome of this issue (deadlock or any log re pinned threads).

@imperatorx Can you share the full reproducer? To check if is any different from what I've done...? Did you produce an high load of requests instead?

franz1981 commented 3 months ago

aha, found why: the classloading with inner classes is not the same as any top level application classes .-. dumb me :D

Said that, by switching to

@Path("/hello")
public class GreetingResource {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.Fun");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }

}

or increasing the iteration count to 10_000 still didn't worked to cause the deadlock.

imperatorx commented 3 months ago

The deadlock happens in the actual application with more complex stuff going on, the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread in 3.11 where it did not in 3.10. The thread.park is not a problem on itself, but there is a native frame or a static class initializer on the stack somewhere (not visible in pin trace though) that is causing the pinning of the carrier thread.

franz1981 commented 3 months ago

the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread

Got it: so @dmlloyd @Sanne @geoand from what I can see here:

I'm not quite sure about the deadlock, here, but that's why I would like a reproducer that at least make the deadlock to happen, because, under heavy contention lock/unlock can cause some form of monopolization, for sure, due to spinning.

@imperatorx The reproducer you placed there is able to produce the stack trace due to pinning you posted? Because that's what I cannot reproduce...

imperatorx commented 3 months ago

Yes, it prints the trace. here is a complete reproducer project. I've reproduced it on a 16 core windows machine: https://github.com/imperatorx/quarkus-reproducer-pin/blob/master/src/main/java/org/acme/GreetingResource.java

I've also added JFR frame logs to it.

franz1981 commented 3 months ago

From what I can see reading the existing implementation, defineClass (which is native) can subsequently trigger new class loads (making it effectively recursive) - and if the runner class loader perform any new attempt to compete (against other concurrent load attempts) against the rw lock on the Jar resource, it can make it to "wait" while having a native method on the stack...does it makes sense @dmlloyd @mariofusco @geoand ?

A similar thing can happen if within a <clinit> a new class load happen and it hit contention on any of the existing JarResource's lock(s).

mariofusco commented 3 months ago

@franz1981 @dmlloyd @geoand I'm available to further investigate this and I will start giving a look at it immediately. If I understand correctly the pinning is caused (or at least put in evidence) by something that changed on our classloader between version 3.10.2 and 3.11.0. Do you already have any idea of which could be the problematic commit or change?

geoand commented 3 months ago

@franz1981 yeah that does sound reasonable to me.

@mariofusco cool, thanks! I am assuming that the commit that introduced the issue is 27102380b3caa867e195a2ec84c70aeac860d782

dmlloyd commented 3 months ago

The issue was always there, the commit just changes the timing of it for some reason.

geoand commented 3 months ago

Right, there should be nothing wrong with the ClassLoading algorithm itself

franz1981 commented 3 months ago

The issue was always there, the commit just changes the timing of it for some reason.

Maybe, we made load jar file + defineClass not atomic anymore (as a single atomic op) because the changes in the cl doesn't make load class a "fully locked op" from my understanding

mariofusco commented 3 months ago

Not sure if I'm looking at a red herring, but for now the only thing that I can confirm is that avoiding to configure the RunnerClassLoader with registerAsParallelCapable() prevents the pinning. Of course I'm still digging into this trying to completely understand the root cause of the problem, but wouldn't it be safer to send a PR that, at least temporarily, simply removes that statement?

geoand commented 3 months ago

We can certainly have that as a backup option

dmlloyd commented 3 months ago

The issue was always there, the commit just changes the timing of it for some reason.

Maybe, we made load jar file + defineClass not atomic anymore (as a single atomic op) because the changes in the cl doesn't make load class a "fully locked op" from my understanding

Yeah, this makes sense: if the lock granularity of class loading is exactly the same as the lock granularity of the shared JAR lock, then the shared JAR lock would only ever be taken when the class loading lock is also taken (during class loading anyway). In this case we never actually needed the shared lock, as we could have just used the class loading lock instead for an exactly identical result.

The bug to fix remains unchanged though: we should not have a lock inside the class loading code path. It should be atomic in some other (non-blocking) way.

mariofusco commented 3 months ago

The bug to fix remains unchanged though: we should not have a lock inside the class loading code path. It should be atomic in some other (non-blocking) way.

I agree and tried to implement this same idea here. Please give it a look.

geoand commented 3 months ago

Taking a step back for a bit, here is a basic example of the layout of the fast-jar (which uses RunnerClassLoader and JarResource):

Screenshot from 2024-06-04 08-51-25

The reason we have the locks on JarResource is to ensure that we don't keep too many jars open which increases memory footprint.

One thing I have thought of in the past is to get rid of the jars entirely, and just use directories which correspond to the jar and would be "exploded" when the jar is built. This intuitively seems to me to overcome a lot of the issues we are facing (including this one), however I would like to invike folks to challenge this idea for potential drawbacks. Furthermore, even if we do decide it makes sense, it's not something we want to implement in a micro release as it requires various non trivial changes.

dmlloyd commented 3 months ago

How much memory footprint are we actually talking about? Does it matter in practice?

geoand commented 3 months ago

@Sanne might remember as he is the one that added the ability to close jars and keep a "live set" of them open

geoand commented 3 months ago

Just as an FYI, https://github.com/quarkusio/quarkus/pull/41023 reverts the change in 3.11

Sanne commented 3 months ago

@Sanne might remember as he is the one that added the ability to close jars and keep a "live set" of them open

Essentially what I had noticed is that keeping a Jar open implies keeping a ZipStream open (obviously), and to allow random access within such stream, the whole header section of the zip stream is kept in memory. Add to this file handles.. for a system in which we can assume that after bootstrap+warmup the classloading events become rare sightings I'd favour releasing memory over classloading-latency.

I don't remember the exact data amounts - it certainly did scale with the number of jars. And even if I did remember, I wouldn't trust such details as they tend to change with JVM improvements. I'd wager it might be better nowadays, however the principle still applies IMO .. we don't need to keep them open.

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread. Did that change actually manifest in tangible benefits?

dmlloyd commented 3 months ago

I seem to recall that we did notice a performance difference. Also there's the ever-present risk of ABA deadlocks in the event that there is any kind of circularity happening when locking is in use.

I think that we could definitely look at a different release strategy (time based for example). I don't think the tradeoff is between latency and size though. If properly designed, most class loading operations should have low latency and high throughput, while still being able to release unused resources. But it does need to be designed in such a way that deadlocks and virtual thread stalls (like this one) are not possible.

geoand commented 3 months ago

Coming back to what I said a few days ago, wouldn't using directories instead of jars overcome all these issues?

Sanne commented 3 months ago

@geoand yes, I like your idea. Worth exploring for sure.

FWIW when I was an early JBoss user I always deployed my code like that, over rsync. Much faster than packing things in WARs & EARs and uploading them 😉 That was JBoss AS 4.2 in ~2007. It supported "exploded archives" already.

franz1981 commented 3 months ago

@Sanne

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread

We don't need it, indeed: we had no reports that it was required: the only reason why has happened IIRC was to "fix" the issue mentioned at https://github.com/smallrye/smallrye-context-propagation/pull/443#issuecomment-2042560065 but already fixed by https://github.com/quarkusio/quarkus/pull/39988

in short: not necessary in the current form.

mariofusco commented 3 months ago

I will try to recap the current situation:

  1. At the moment there is no evidence that we could have any relevant advantage adding parallel capabilities to this classloader.
  2. Even though we could probably make it works, the proposed lock-free implementation necessary for virtual threads compatibility, is quite hard to understand and validate ( ... and I'm the one who wrote it :smile: ) and could cause other maintainance problems in future.
  3. We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

If my premises are all correct what I suggest is:

  1. Closing my pull request ... 3 wasted working days, but at least I learnt something and I hope that @Sanne will offer me a beer to recover from this when we will meet in London :)
  2. Reverting also on main the commit that flagged the classloader as parallel capable like we already did on branch 3.11.
  3. Closing this issue accordingly.
  4. Open a feature request to implement the directories based solution suggested by @geoand (I'm available to work on it).
geoand commented 3 months ago

We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

I would still like to people to try and poke holes in this idea... I personally don't see why it would not work better, but others might :).

Also, this solution would definitely be more involved (in terms of how many things need to be changed) than fixes such as this one.

mariofusco commented 3 months ago

We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

I would still like to people to try and poke holes in this idea... I personally don't see why it would not work better, but others might :).

Also, this solution would definitely be more involved (in terms of how many things need to be changed) than fixes such as this one.

Yes, we need to further validate this idea and check the implications, but it is only the (optional) point 4. of the actions that I suggested and I think that 1. to 3. could be taken regardless of it.

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread. Did that change actually manifest in tangible benefits?

@Sanne In reality this is for sure the main point of this whole discussion. The first thing that we need to address is figuring out if we need a parallel classloader or not. See premise 1. of my former comment, I had a quick chat with @franz1981 this morning about it and he thinks that, regardless the smallrye fix for which the parallel classloader has been introduced, there could be other relevant use cases where it could be necessary or at least a nice to have. Can you please help us to double check this? If @franz1981 is right and since @geoand's suggestion is a more long-term solution, the only alternative would be finalizing my pull request, making it as good as possible, and merging it.

Sanne commented 3 months ago

Also there's the ever-present risk of ABA deadlocks in the event that there is any kind of circularity happening when locking is in use.

@dmlloyd this concerns me and I believe you mentioned this in the past as well - I'm not aware of us getting in potential deadlocks with the existing code but that's more likely due to not having understood the scenario you have in mind.

You think you could describe a potential scenario in such a way that someone here might try to create some tests for it?

FWIW when I last reviewed the design (and it's been a while - I have not been able to look carefully at the current state) we only had reentrant locks protecting resource state changes, locks would not be held for longer than such state transitions and (I believe) not held during operations that could trigger loading of other classes - at least not from a different thread, which would have been necessary to get in trouble with the reentrant locks.

vsevel commented 3 months ago

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183) we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader) the liveness probe does not answer. we see timeouts on kafka consumer poll. some examples of thread being blocked:

"micrometer-kafka-metrics@16921" tid=0x64 nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              blocks micrometer-kafka-metrics@16901
              blocks micrometer-kafka-metrics@16912
              blocks micrometer-kafka-metrics@16918
              blocks micrometer-kafka-metrics@16924
              blocks micrometer-kafka-metrics@16927
              blocks micrometer-kafka-metrics@16930
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.micrometer.core.instrument.Tags$ArrayIterator.<init>(Tags.java:131)
                at io.micrometer.core.instrument.Tags.iterator(Tags.java:128)
                at io.micrometer.core.instrument.Tags.of(Tags.java:224)
...

or

"smallrye-kafka-consumer-thread-2@16897" tid=0x4b nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.smallrye.mutiny.groups.UniOnItem.delayIt(UniOnItem.java:188)
                at io.smallrye.mutiny.helpers.ExponentialBackoff.lambda$randomExponentialBackoffFunction$0(ExponentialBackoff.java:48)
                at io.smallrye.mutiny.helpers.ExponentialBackoff$$Lambda/0x00007fd03090fa88.apply(Unknown Source:-1)
                at io.smallrye.context.impl.wrappers.SlowContextualFunction.apply(SlowContextualFunction.java:21)
                at io.smallrye.mutiny.groups.MultiOnItem.lambda$transformToUni$6(MultiOnItem.java:268)

or

"executor-thread-1@16867" tid=0x1f nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at com.x.HealthCheckService_Bean.proxy(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:554)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:534)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:567)
                at io.quarkus.arc.impl.ArcContainerImpl.instance(ArcContainerImpl.java:339)
                at x.HealthCheckService_ScheduledInvoker_autoCheckAll_01fef9f6dc57a56c94b1934cb946de439403b427.invokeBean(Unknown Source:-1)

we have to wait for the liveness to kill the pod. and processing can restart. that is harsh.

mariofusco commented 3 months ago

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183) we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader) the liveness probe does not answer. we see timeouts on kafka consumer poll.

This seems to be definitively a bug of the parallel RunnerClassLoader implementation and indeed we (temporaraly?) reverted it to the original sequential version. What at the moment isn't clear to me is why this is happening only with virtual threads, while if there is a potential deadlock condition this should probably happen regardless of the kind of threads. The only (admittedly partial and not strong enough) explanation that I can see at the moment is that with virtual threads we have a higher degree of parallelism thus making the problem more evident and easier to reproduce.

About the reproducer it would be great if you could provide one, or at least give us some more hints on how to create one on our own, for instance providing the full stack traces of the Quarkus application when the deadlock is detected. I will try to write a proper unit test for the parallel RunnerClassLoader anyway, otherwise any potential fix would be a blind one.

As a final note, these days I'm working on an alternative lock-free implementation of the parallel RunnerClassLoader. @vsevel it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.