Open DanielThomas opened 2 weeks ago
If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:
The RealScopedSpan
does not need a synchronized block since it does not do anything with the Span
directly, why it is needed?
While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.
This is coming from the previous point: the RealScopedSpan
does not need synchronized block but the underlying span could be mutated later on, and in this case the synchronized (mutableSpan)
block is going to be used by the component in question. This pattern does not fit ReentrantLock
usage
I don't quite follow? I didn't suggest RealScopedSpan
should be synchronized, just the fact that there's two callers for the CountBoundedQueue
lock, one that will pin and one that will suspend, which can lead to this deadlock.
Rather, RealSpan
should switch to ReentrantLock
instead of synchronized
to guard MutableSpan
, because it will allow those threads to suspend, rather than pin. The switch to lightweight locking to address the monitor pinning limitation is still some time off, and this is almost certain to cause deadlocks when run on virtual threads currently.
The reasoning behind keeping synchronized
are provided here https://github.com/openzipkin/brave/issues/1426 (plus, the recent updates on progress of the Java monitors https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html , you may be interested in trying your examples with Loom early access builds https://jdk.java.net/loom/).
Yes, people should not wholesale replace synchronized
with ReentrantLock
, pinning itself is not a concern, unless it surrounds other long running blocking operations or has the potential to deadlock.
Here, the latter is true, this was a real-world application that deadlocked in production. Any Lock
provides the same memory visibility guarantees as synchronized, so can provide the same semantics, while ensuring that there isn't pinning around the lock in CountBoundedQueue
.
We track loom nightly in fact, and of course, lightweight locks avoid the pinning, but that won't land until JDK 24 or later.
Here, the latter is true, this was a real-world application that deadlocked in production.
This part certainly deserves a fix, I will try to take a look shortly
@DanielThomas could you please share if you tune themessageTimeout
setting in your application [1] or the default value (1s) is being used? Thank you.
[1] https://github.com/openzipkin/zipkin-reporter-java?tab=readme-ov-file#tuning
I think so, I believe we're just getting the default from upstream. We're using Brave via Micrometer/Spring. If I remember correctly, every data fetcher in DGS is observed, so these calls are quite frequent.
Incidentally, there's some more discussion on the expected effectiveness of lightweight monitors versus j.u.l here:
https://mail.openjdk.org/pipermail/loom-dev/2024-July/006885.html
Yeah, I am looking into the way to address the root cause, meantime (as I workaround, not a solution), could you try please to lower the messageTimeout
so the lock held by AsyncReporter
would be much smaller (and consequently, the synchronized
may not be able to block all carrier threads):
@Bean
AsyncZipkinSpanHandler asyncZipkinSpanHandler(BytesMessageSender sender,
BytesEncoder<MutableSpan> mutableSpanBytesEncoder) {
return AsyncZipkinSpanHandler.newBuilder(sender).messageTimeout(100, TimeUnit.MILLIS).build(mutableSpanBytesEncoder);
}
Thank you.
We've moved those applications back off virtual threads and avoiding further adoption for now, because avoiding pinning is the only way to avoid this deadlock. The lock only has to be held and callers queued for the lock in the "wrong" order for this to occur, the lock only needs to be held for very short time.
Describe the Bug
We observed a deadlock with Virtual Threads and Brave/Zipkin. In short, there are two paths to CountBoundedQueue.offer when finishing a span. RealSpan.finish has a synchronized block, where RealScopedSpan.finish does not.
If an unmounted virtual thread using
RealScopedSpan
is next line for the lock, but all carriers are currently occupied by pinned VTs inRealSpan.finish
, the application will deadlock:https://gist.github.com/DanielThomas/dddd850f7e491cac3a2dd734978f4267
Steps to Reproduce
See https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e for a reduced example.
Expected Behaviour
While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to
ReentrantLock
to guard theMutableSpan
to ensure compatibility with virtual threads.