Open alievmirza opened 1 month ago
Thank you for your feedback.
Could you provide the test code for reproduction? I couldn't replicate the issue on my machine. So, could please add a related test and send a PR?
Oh, sorry, I've sent the wrong name for the base test class (fixed that in the description), the test is com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload
. To reproduce, just run the test in the loop until failure. For that purposes I've used Edit Run Configuration
in Intellij IDEA for the test, Modify options
-> Repeat
-> Until Failure
. Also you need to comment this code in com.alipay.sofa.jraft.core.NodeTest#setupNodeTest
, so it will be possible then to run the test several times.
StorageOptionsFactory.registerRocksDBTableFormatConfig(GROUP_ID, RocksDBLogStorage.class, StorageOptionsFactory
.getDefaultRocksDBTableConfig().setBlockCacheSize(256 * SizeUnit.MB));
Also I can provide the thread dump https://gist.github.com/alievmirza/07443fd2ae4e3f7db70fadaa06f4cc1c
Thank you, I have reproduced it on my machine.
The producer thread who handle the com.alipay.sofa.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1409)
method owned the write lock
And then the consumer thread try to get the same write lock:
Deadlock occurred like this, due to the disruptor
overload.
Good catch! It's a known bug. Related issues #138 and #720 have been addressed and a fix has been attempted at https://github.com/sofastack/sofa-jraft/pull/764.
It seems that we still have some work to do.
Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?
Do you think that this issue is quite dangerous, or in the real scenarios it is quite rare? Do you have any plans or ideas how to properly fix the issue?
In real scenarios, it is quite uncommon. It occurs only when the node is overloaded, and in such circumstances, manual intervention like adding new nodes or resources may be the only viable solution.
How can it be fixed? I have not considered it carefully. One potential solution is to replace some lock
places on Node#writeLock
with a tryLock
, enabling other tasks to progress.
I tried a fix at #1109, which makes it fail fast when log manager is overloaded while applying tasks.
Describe the bug
There is a deadlock in
NodeImpl
when working with fullLogManagerImpl#diskQueue
,FSMCallerImpl#taskQueue
andNodeImpl#writeLock
.NodeImpl#executeApplyingTasks()
takesNodeImpl.writeLock
and callsLogManager.appendEntries()
LogManager
tries to enqueue a task todiskQueue
which is full, hence it blocks until a task gets consumed fromdiskQueue
diskQueue
is consumed byStableClosureEventHandler
StableClosureEventHandler
tries to enqueue a task toFSMCallerImpl#taskQueue
, which is also full, so this also blocks until a task gets consumed fromFSMCallerImpl#taskQueue
FSMCallerImpl#taskQueue
is consumed byApplyTaskHandler
ApplyTaskHandler
callsNodeImpl#onConfigurationChangeDone()
, which tries to takeNodeImpl#writeLock
As a result, there is a deadlock:
NodeImpl#writeLock
->LogManager#diskQueue
->FSMCallerImpl#taskQueue
->NodeImpl#writeLock
(disruptors are used as blocking queues in JRaft, so, when full, they act like locks).This was caught by
com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload
which uses extremely short disruptors (2 items max each).Steps to reproduce
Run
com.alipay.sofa.jraft.core.NodeTest#testNodeTaskOverload
in a loop several times, for my local machine it is reproducible within 50-100 runs.Environment
java -version
): openjdk version "11.0.23"uname -a
): macOs 14.5