Closed stepanv closed 2 years ago
To work around this problem, one can instruct the reactive stream to cancel on a different thread. But that needs to be done after the particular thread executes MonoCollectList.cancel()
. That is to declare it before the .collectList()
statement.
var publish = Flux.using(
() -> driver.rxSession(SessionConfig.forDatabase(GraphDatabaseSettings.DEFAULT_DATABASE_NAME)),
session -> session.readTransaction(tx -> Flux.from(tx.run("call db.ping()").records())),
rxSession -> {
rxSession.close();
}
)
.cancelOn(Schedulers.parallel())
.collectList();
note that this wouldn't work as the parallel
thread would get into a deadlock:
.collectList()
.cancelOn(Schedulers.parallel());
but note that this work around works only thanks to the fact that other implementations of the Subscription
interface do not use the same lock for onNext()
and cancel()
.
Thank you for detailed explanation of the problem. We will see if we can improve it.
In the meantime, have you had a chance to try the following update? https://github.com/reactor/reactor-core/pull/3053
@injectives , yep that Simon's improvement prevents the deadlock I reported.
However, as he is pointing out, you guys should be ready to accept concurrent cancel()
and onNext()
calls. This is not possible now as you're using the same monitor (this
) as these methods in BasicPullResponseHandler.java
are synchronized
. Furthermore, you keep the lock while calling upstream (in cancel()
) and downstream (in onRecord()
) reactive stream.
@stepanv, would you mind giving the following update a go? https://github.com/neo4j/neo4j-java-driver/pull/1233 I have checked it locally and it seems to prevent this issue from happening.
Hi, I'm sorry for the late response.
With the 4.4.6
version, I'm no longer hitting the deadlock. Thanks!
We detected a deadlock in our production setup between 2 threads, each locking same locks but in the opposite order.
See the 2 callstacks from a thread dump we collected:
Here you can see that
Neo4jDriverIO-2-3
is emitting data and while executing the on next chain it first locksBasicPullResponseHandler
monitor in the methodonRecord()
and later it tries to lock theMonoCollectListSubscriber
monitor in the methodonNext()
default-nioEventLoopGroup-1-2
which is running on an external event (in this case it's netty publishing a channel close which in turn makes Micronaut to cancel the reactive chain) and locks theMonoCollectListSubscriber
while runningcancel()
and tries to lockBasicPullResponseHandler
in thecancel()
methodthe threads happen to end up in a deadlock.
Steps to reproduce
import java.time.Duration;
/**
You have as much time as you need to collect the thread dump. */ class Neo4jDeadlockTest {
@Test void neo4j_deadlock_withMonoList() { // Given Config config = Config.builder() .withoutEncryption() .withLogging(Logging.slf4j()) .withEventLoopThreads(5) .build();
} }
main
method tries to cancel the subscription via theblock(Duration.ofSeconds(5))
main
thread and theNeo4jDriver
thread got into a deadlockThis reproducer is just a demonstration of how one can reproduce the problem completely synthetically. By using the breakpoint you're causing the threads to interleave in the exact wrong order and ending in a deadlock.
We hit this deadlock while running a standard business logic in our production on AWS. (No breakpoints needed :) )
Expected behavior
The neo4j
BasicPullResponseHandler
should not get into a deadlock with Reactor primitives.