Open imperatorx opened 5 months ago
it would be great if you could provide one
not likely. when it happens it always is close to startup (10 seconds after the app has started). the app consumes messages at a decent rate (25 to 50 messages per second) with virtual thread max-concurrency: 5.
full stack traces of the Quarkus application when the deadlock is detected.
a thread dump you mean?
it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.
sure. no problem.
full stack traces of the Quarkus application when the deadlock is detected.
a thread dump you mean?
Yes sorry, I meant a complete thread dump of the application when the deadlock happens.
Thanks a lot for that thread dump. I must admit that it looks quite weird though and I'm still trying to make sense of it. What I mean is that I see 53 threads saying
java.lang.Thread.State: BLOCKED
waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977
in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>
, so I still have no clue of what is holding that lock and why it is not releasing it.
@vsevel for now I will continue my investigations, mostly trying to reproduce the same problem with a more isolated test case. Please let me know when you'll have a chance to give a try to my pull request and how it works for you.
but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.
same for me. I could not find the VT holding the lock. we created the dump by connecting with the remote debugger from intellij into our ocp cluster using port forward, and executing Run > Debubgging Actions > Get Thread Dump. I do not know if this dump has limited visibility.
Please let me know when you'll have a chance to give a try to my pull request and how it works for you.
where is it?
FYI I also created https://github.com/quarkusio/quarkus/issues/41313 this is not related to class loading, but I need help investigating this
Please let me know when you'll have a chance to give a try to my pull request and how it works for you.
where is it?
https://github.com/quarkusio/quarkus/pull/40942
FYI I also created #41313 this is not related to class loading, but I need help investigating this
I will give a look asap, but I would like to make sense of this issue first.
Sadly the thread dump analysis could be unfruitful due to how RW locks work, see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L336
It stores the information of first uncontended read locker Thread, but just for performance reasons, and there is no concept of "read ownership" because a read lock has no exclusive ownership, while the write one, yep. This means that a thread dump cannot print read lock owners which prevent writers or others to be unblocked. (Let's remember that no write lock can be acquired while a read one is already held). I believe that's the deadlock happening here:
Please let me know when you'll have a chance to give a try to my pull request and how it works for you.
where is it?
40942
@vsevel Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(
Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(
I will spend some time these next 2 days
Thanks for that link!
but class loading and initialization is something that needs to be re-examined next
I assume Alan means from the JDK folks?
Yep, but that means too that using reentrant locks while on a native call (i.e. defineClass
) "maybe" won't be a problem in subsequent releases - as per synchronized
blocks.
That means that I won't be worried for the pinning event but only to fix the deadlocking issue.
After a call with @mariofusco we found how to trigger it reliably: there's no need (i.e. it is not seen as pinning by the JVM) to call <ClassLoader>::loadClass
but the real reason why pinning happen is if we have native stack frames in the carrier i.e. Class::forName
or just NameOfTheClass.class.getName()
That's a very interesting find...
Any late linkage will cause it, for example calling a method that returns a Foo will cause Foo to be loaded if it already isn't loaded, plus the corresponding hidden native frame.
Now the funny part is the deadlock, but actually If waiting on a lock pinn the carrier - be just need to make some recursive class definition to enter in the locks in the right other and exhausting the number of carriers and it's done :P
we were able to spend some time on this issue.
we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs. but we still see the pinning issue when going through RunnerClassLoader
.
I understand it is more of an issue inside the open jdk. correct?
just to make sure, we reverted the image to the one built with 3.8
, and on the first run we got some pinning + deadlock.
we were able to spend some time on this issue. we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs.
This is a great news. Thanks for giving it a try.
but we still see the pinning issue when going through
RunnerClassLoader
. I understand it is more of an issue inside the open jdk. correct?
Yes, this is not completely related even though still unexpected. Can you please paste the stack trace of the pinning using my classloader?
just to make sure, we reverted the image to the one built with
3.8
, and on the first run we got some pinning + deadlock.
This is also unexpected to be honest, or probably I'm missing something. If you reverted to 3.8
it means that you're using the RunnerClassLoader
when it still was sequential, or in other words far before it was registered as parallel capable in version 3.11.0
. Do you mean that also the sequential version of the old classloader can suffer of deadlock? If so the revert that we did recently to that sequential version is practically useless, can you please double check?
Can you please paste the stack trace of the pinning using my classloader?
I believe this is this issue https://github.com/quarkusio/quarkus/issues/41314
can you please double check?
let me check with the dev team.
Can you please paste the stack trace of the pinning using my classloader?
I believe this is this issue #41314
Ok, this makes sense now: it is a totally different pinning problem and completely unrelated with the classloader.
can you please double check?
short answer (before we re-test) is: I am definitely sure. because they always stayed with the LTS, and the behavior they we were saying is the stack with the runner that we searched across the quarkus issues. and that is how we found this current issue.
I finally recreated in isolation the virtual thread pinning condition originally reported here.
Given all that have been discussed so far and the feedback provided by @vsevel I don't see any reason to further hold this pull request removing the read/write lock from the RunnerClassLoader.
If so the revert that we did recently to that sequential version is practically useless
I can confirm it is not useless, it fixed the original issue.
we tested it again and were able to reproduce it:
2024-06-26 09:28:18,231 INFO [io.quarkus] (main) {} x 1.1.6-SNAPSHOT on JVM (powered by Quarkus 3.8.4) started in 6.142s. Listening on: http://0.0.0.0:8080 ...
...
Thread[#80,ForkJoinPool-1-worker-2,5,CarrierThreads]
java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
java.base/java.lang.VirtualThread.park(VirtualThread.java:582)
java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:143)
io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:99)
io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:65)
io.quarkus.cache.runtime.CacheInterceptor.getArcCacheInterceptionContext(CacheInterceptor.java:82)
io.quarkus.cache.runtime.CacheInterceptor.getInterceptionContext(CacheInterceptor.java:61)
io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:43)
io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(Unknown Source)
io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
...
thread dump shows multiple stacks (22 threads blocked) such as:
"vert.x-eventloop-thread-0@16949" tid=0x1a nid=NA waiting for monitor entry
java.lang.Thread.State: BLOCKED
waiting for quarkus-virtual-thread-16@17044 to release lock on <0x4293> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:537)
at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:500)
at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:67)
at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:244)
at io.vertx.core.net.impl.pool.SimpleConnectionPool.evict(SimpleConnectionPool.java:549)
at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint.checkExpired(SharedClientHttpStreamEndpoint.java:115)
at io.vertx.core.http.impl.HttpClientImpl.lambda$static$1(HttpClientImpl.java:106)
at io.vertx.core.http.impl.HttpClientImpl$$Lambda/0x00007fd358434700.accept(Unknown Source:-1)
at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4783)
at io.vertx.core.net.impl.pool.ConnectionManager.forEach(ConnectionManager.java:33)
at io.vertx.core.http.impl.HttpClientImpl.checkExpired(HttpClientImpl.java:150)
at io.vertx.core.http.impl.HttpClientImpl.access$100(HttpClientImpl.java:48)
at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:144)
at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:132)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1056)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1027)
at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:328)
at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:321)
at io.vertx.core.impl.ContextInternal.emit(ContextInternal.java:209)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:1045)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.runWith(Thread.java:1596)
at java.lang.Thread.run(Thread.java:1583)
but I can't see quarkus-virtual-thread-16@17044
anywhere.
we did some additional tests with your branch, hoping to capture a full thread dump when detecting the pinning. unfortunately (for this test) we did not reproduce the pinning with your fix.
Describe the bug
reproducer below
After upgrading my app to 3.11.0 from 3.10.2 and changing nothing else, suddenly I get deadlocks. I turned on virtual thread pinning logs and indeed my carrier threads got pinned. Interestingly enough, there is no
<- monitor
in the stack, so I think this is not caused by synchronization but maybe a native call? This might be related to the class loader changes in 3.11:Expected behavior
Should not pin carrier threads as before
Actual behavior
Pins carrier threads. This happens in parallel in the actual application, so it ends up pinning all threads and deadlocking the application.
How to Reproduce?
Minimal reproducer:
Output of
uname -a
orver
win64
Output of
java -version
build 21+35-2513
Quarkus version or git rev
3.11.0
Build tool (ie. output of
mvnw --version
orgradlew --version
)mvn
Additional information
No response