3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes

imperatorx commented 5 months ago

Describe the bug

reproducer below

After upgrading my app to 3.11.0 from 3.10.2 and changing nothing else, suddenly I get deadlocks. I turned on virtual thread pinning logs and indeed my carrier threads got pinned. Interestingly enough, there is no <- monitor in the stack, so I think this is not caused by synchronization but maybe a native call? This might be related to the class loader changes in 3.11:

Pinned thread trace log:

Thread[#67,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

Expected behavior

Should not pin carrier threads as before

Actual behavior

Pins carrier threads. This happens in parallel in the actual application, so it ends up pinning all threads and deadlocking the application.

How to Reproduce?

Minimal reproducer:

var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());

        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for(int i = 0; i < 2; i++) {
                exec.submit(()->{
                    try {
                       cl.loadClass("an application class name here");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }

Build with maven, then run from target/quarkus-app with -Djdk.tracePinnedThreads=short
Observe thread pin warnings in the log
Another weirdness is that even though prinning trace is set to "short", it prints full stacks?

Output of `uname -a` or `ver`

win64

Output of `java -version`

build 21+35-2513

Quarkus version or git rev

3.11.0

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

mvn

Additional information

No response

vsevel commented 4 months ago

it would be great if you could provide one

not likely. when it happens it always is close to startup (10 seconds after the app has started). the app consumes messages at a decent rate (25 to 50 messages per second) with virtual thread max-concurrency: 5.

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.

sure. no problem.

mariofusco commented 4 months ago

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

Yes sorry, I meant a complete thread dump of the application when the deadlock happens.

vsevel commented 4 months ago

threads.txt

mariofusco commented 4 months ago

Thanks a lot for that thread dump. I must admit that it looks quite weird though and I'm still trying to make sense of it. What I mean is that I see 53 threads saying

  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

@vsevel for now I will continue my investigations, mostly trying to reproduce the same problem with a more isolated test case. Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

vsevel commented 4 months ago

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

same for me. I could not find the VT holding the lock. we created the dump by connecting with the remote debugger from intellij into our ocp cluster using port forward, and executing Run > Debubgging Actions > Get Thread Dump. I do not know if this dump has limited visibility.

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

FYI I also created https://github.com/quarkusio/quarkus/issues/41313 this is not related to class loading, but I need help investigating this

mariofusco commented 4 months ago

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

https://github.com/quarkusio/quarkus/pull/40942

FYI I also created #41313 this is not related to class loading, but I need help investigating this

I will give a look asap, but I would like to make sense of this issue first.

franz1981 commented 4 months ago

Sadly the thread dump analysis could be unfruitful due to how RW locks work, see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L336

It stores the information of first uncontended read locker Thread, but just for performance reasons, and there is no concept of "read ownership" because a read lock has no exclusive ownership, while the write one, yep. This means that a thread dump cannot print read lock owners which prevent writers or others to be unblocked. (Let's remember that no write lock can be acquired while a read one is already held). I believe that's the deadlock happening here:

a read lock acquired
a reentrant class loading on recursive define class
a subsequent write lock, which cause a deadlock with itself, but invisible to thread dumps

mariofusco commented 4 months ago

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

40942

@vsevel Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

vsevel commented 4 months ago

Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

I will spend some time these next 2 days

franz1981 commented 4 months ago

FYI https://mail.openjdk.org/pipermail/loom-dev/2024-June/006811.html

geoand commented 4 months ago

Thanks for that link!

but class loading and initialization is something that needs to be re-examined next

I assume Alan means from the JDK folks?

franz1981 commented 4 months ago

Yep, but that means too that using reentrant locks while on a native call (i.e. defineClass) "maybe" won't be a problem in subsequent releases - as per synchronized blocks. That means that I won't be worried for the pinning event but only to fix the deadlocking issue.

franz1981 commented 4 months ago

After a call with @mariofusco we found how to trigger it reliably: there's no need (i.e. it is not seen as pinning by the JVM) to call <ClassLoader>::loadClass but the real reason why pinning happen is if we have native stack frames in the carrier i.e. Class::forName or just NameOfTheClass.class.getName()

geoand commented 4 months ago

That's a very interesting find...

dmlloyd commented 4 months ago

Any late linkage will cause it, for example calling a method that returns a Foo will cause Foo to be loaded if it already isn't loaded, plus the corresponding hidden native frame.

franz1981 commented 4 months ago

Now the funny part is the deadlock, but actually If waiting on a lock pinn the carrier - be just need to make some recursive class definition to enter in the locks in the right other and exhausting the number of carriers and it's done :P

vsevel commented 4 months ago

we were able to spend some time on this issue. we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs. but we still see the pinning issue when going through RunnerClassLoader. I understand it is more of an issue inside the open jdk. correct? just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

mariofusco commented 4 months ago

we were able to spend some time on this issue. we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs.

This is a great news. Thanks for giving it a try.

but we still see the pinning issue when going through RunnerClassLoader. I understand it is more of an issue inside the open jdk. correct?

Yes, this is not completely related even though still unexpected. Can you please paste the stack trace of the pinning using my classloader?

just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

This is also unexpected to be honest, or probably I'm missing something. If you reverted to 3.8 it means that you're using the RunnerClassLoader when it still was sequential, or in other words far before it was registered as parallel capable in version 3.11.0. Do you mean that also the sequential version of the old classloader can suffer of deadlock? If so the revert that we did recently to that sequential version is practically useless, can you please double check?

vsevel commented 4 months ago

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue https://github.com/quarkusio/quarkus/issues/41314

can you please double check?

let me check with the dev team.

mariofusco commented 4 months ago

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue #41314

Ok, this makes sense now: it is a totally different pinning problem and completely unrelated with the classloader.

vsevel commented 4 months ago

can you please double check?

short answer (before we re-test) is: I am definitely sure. because they always stayed with the LTS, and the behavior they we were saying is the stack with the runner that we searched across the quarkus issues. and that is how we found this current issue.

mariofusco commented 4 months ago

I finally recreated in isolation the virtual thread pinning condition originally reported here.

Given all that have been discussed so far and the feedback provided by @vsevel I don't see any reason to further hold this pull request removing the read/write lock from the RunnerClassLoader.

imperatorx commented 4 months ago

If so the revert that we did recently to that sequential version is practically useless

I can confirm it is not useless, it fixed the original issue.

vsevel commented 4 months ago

we tested it again and were able to reproduce it:

2024-06-26 09:28:18,231 INFO [io.quarkus] (main) {} x 1.1.6-SNAPSHOT on JVM (powered by Quarkus 3.8.4) started in 6.142s. Listening on: http://0.0.0.0:8080 ...
...
Thread[#80,ForkJoinPool-1-worker-2,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:582)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:143)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:99)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:65)
    io.quarkus.cache.runtime.CacheInterceptor.getArcCacheInterceptionContext(CacheInterceptor.java:82)
    io.quarkus.cache.runtime.CacheInterceptor.getInterceptionContext(CacheInterceptor.java:61)
    io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:43)
    io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(Unknown Source)
    io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
    io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
...

thread dump shows multiple stacks (22 threads blocked) such as:

"vert.x-eventloop-thread-0@16949" tid=0x1a nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
     waiting for quarkus-virtual-thread-16@17044 to release lock on <0x4293> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
      at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:537)
      at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:500)
      at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:67)
      at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:244)
      at io.vertx.core.net.impl.pool.SimpleConnectionPool.evict(SimpleConnectionPool.java:549)
      at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint.checkExpired(SharedClientHttpStreamEndpoint.java:115)
      at io.vertx.core.http.impl.HttpClientImpl.lambda$static$1(HttpClientImpl.java:106)
      at io.vertx.core.http.impl.HttpClientImpl$$Lambda/0x00007fd358434700.accept(Unknown Source:-1)
      at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4783)
      at io.vertx.core.net.impl.pool.ConnectionManager.forEach(ConnectionManager.java:33)
      at io.vertx.core.http.impl.HttpClientImpl.checkExpired(HttpClientImpl.java:150)
      at io.vertx.core.http.impl.HttpClientImpl.access$100(HttpClientImpl.java:48)
      at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:144)
      at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:132)
      at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1056)
      at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1027)
      at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:328)
      at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:321)
      at io.vertx.core.impl.ContextInternal.emit(ContextInternal.java:209)
      at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:1045)
      at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
      at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
      at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
      at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
      at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
      at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
      at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.lang.Thread.runWith(Thread.java:1596)
      at java.lang.Thread.run(Thread.java:1583)

but I can't see quarkus-virtual-thread-16@17044 anywhere.

vsevel commented 4 months ago

we did some additional tests with your branch, hoping to capture a full thread dump when detecting the pinning. unfortunately (for this test) we did not reproduce the pinning with your fix.

quarkusio / quarkus