openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.35k stars 714 forks source link

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Open DanielThomas opened 2 weeks ago

DanielThomas commented 2 weeks ago

Describe the Bug

We observed a deadlock with Virtual Threads and Brave/Zipkin. In short, there are two paths to CountBoundedQueue.offer when finishing a span. RealSpan.finish has a synchronized block, where RealScopedSpan.finish does not.

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

https://gist.github.com/DanielThomas/dddd850f7e491cac3a2dd734978f4267

Steps to Reproduce

See https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e for a reduced example.

Expected Behaviour

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

reta commented 2 weeks ago

If an unmounted virtual thread using RealScopedSpan is next line for the lock, but all carriers are currently occupied by pinned VTs in RealSpan.finish, the application will deadlock:

The RealScopedSpan does not need a synchronized block since it does not do anything with the Span directly, why it is needed?

While the monitor pinning limitation will be addressed in future OpenJDK releases, in the meantime there's a good case for switching this class to ReentrantLock to guard the MutableSpan to ensure compatibility with virtual threads.

This is coming from the previous point: the RealScopedSpan does not need synchronized block but the underlying span could be mutated later on, and in this case the synchronized (mutableSpan) block is going to be used by the component in question. This pattern does not fit ReentrantLock usage

DanielThomas commented 2 weeks ago

I don't quite follow? I didn't suggest RealScopedSpan should be synchronized, just the fact that there's two callers for the CountBoundedQueue lock, one that will pin and one that will suspend, which can lead to this deadlock.

Rather, RealSpan should switch to ReentrantLock instead of synchronized to guard MutableSpan, because it will allow those threads to suspend, rather than pin. The switch to lightweight locking to address the monitor pinning limitation is still some time off, and this is almost certain to cause deadlocks when run on virtual threads currently.

reta commented 2 weeks ago

The reasoning behind keeping synchronized are provided here https://github.com/openzipkin/brave/issues/1426 (plus, the recent updates on progress of the Java monitors https://mail.openjdk.org/pipermail/loom-dev/2024-May/006632.html , you may be interested in trying your examples with Loom early access builds https://jdk.java.net/loom/).

DanielThomas commented 2 weeks ago

Yes, people should not wholesale replace synchronized with ReentrantLock, pinning itself is not a concern, unless it surrounds other long running blocking operations or has the potential to deadlock.

Here, the latter is true, this was a real-world application that deadlocked in production. Any Lock provides the same memory visibility guarantees as synchronized, so can provide the same semantics, while ensuring that there isn't pinning around the lock in CountBoundedQueue.

We track loom nightly in fact, and of course, lightweight locks avoid the pinning, but that won't land until JDK 24 or later.

reta commented 2 weeks ago

Here, the latter is true, this was a real-world application that deadlocked in production.

This part certainly deserves a fix, I will try to take a look shortly

reta commented 5 days ago

@DanielThomas could you please share if you tune themessageTimeout setting in your application [1] or the default value (1s) is being used? Thank you.

[1] https://github.com/openzipkin/zipkin-reporter-java?tab=readme-ov-file#tuning

DanielThomas commented 4 days ago

I think so, I believe we're just getting the default from upstream. We're using Brave via Micrometer/Spring. If I remember correctly, every data fetcher in DGS is observed, so these calls are quite frequent.

Incidentally, there's some more discussion on the expected effectiveness of lightweight monitors versus j.u.l here:

https://mail.openjdk.org/pipermail/loom-dev/2024-July/006885.html

reta commented 3 days ago

Yeah, I am looking into the way to address the root cause, meantime (as I workaround, not a solution), could you try please to lower the messageTimeout so the lock held by AsyncReporter would be much smaller (and consequently, the synchronized may not be able to block all carrier threads):

@Bean
AsyncZipkinSpanHandler asyncZipkinSpanHandler(BytesMessageSender sender,
        BytesEncoder<MutableSpan> mutableSpanBytesEncoder) {
    return AsyncZipkinSpanHandler.newBuilder(sender).messageTimeout(100, TimeUnit.MILLIS).build(mutableSpanBytesEncoder);
}

Thank you.

DanielThomas commented 20 hours ago

We've moved those applications back off virtual threads and avoiding further adoption for now, because avoiding pinning is the only way to avoid this deadlock. The lock only has to be held and callers queued for the lock in the "wrong" order for this to occur, the lock only needs to be held for very short time.