sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.57k stars 1.14k forks source link

JRaft死锁BUG。。。 #138

Closed shiftyman closed 5 years ago

shiftyman commented 5 years ago

当LogManagerImpl的diskQueue满, NodeImpl会阻塞在executeApplyingTasking方法的:this.logManager.appendEntires行,此时: Jraft-NodeImpl-Disruptor线程持有LogManagerImpl的writelock,

Jraft-LogManager-Disruptor调用到onEvent的setDisk方法时,试图获得上面的writelock,然后,deadlock。。。。。。。。。。。!~

附线程堆栈: "Jraft-LogManager-Disruptor-0" #40 daemon prio=5 os_prio=0 tid=0x00007fc1bb149000 nid=0x56a7e waiting on condition [0x00007fc06ccf8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

"Jraft-NodeImpl-Disruptor-0" #39 daemon prio=5 os_prio=0 tid=0x00007fc1bab76000 nid=0x56a58 runnable [0x00007fc06fefd000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:137) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105) at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:450) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.offerEvent(LogManagerImpl.java:319) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:302) at com.alipay.sofa.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1083) at com.alipay.sofa.jraft.core.NodeImpl.access$200(NodeImpl.java:116) at com.alipay.sofa.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:223) at com.alipay.sofa.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:206) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129) at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:

killme2008 commented 5 years ago

是的,目前有这个问题, 本质上是 jraft 缺少一个流量反压机制,在过载的情况下, disruptor 的策略都是默认的 blocking 模式。如果你现在着急解决,可以先通过临时调整 RaftOptions#disruptorBufferSize 增大来解决,默认 16384 的参数,在我们压测中还没有碰到过载的情况。

核心问题还是需要梳理一个完善的反压策略出来,另外需要系统调整 disruptor 的阻塞策略,灵活采用超时等机制更合适,感谢反馈。

shiftyman commented 5 years ago

个人感受:jraft的处理流程像流水线,但是锁有点多,而且流水线上的不同线程间还会有锁的争用,这样一来容易死锁,二来性能上会打折扣。建议看看能不能理顺一下里面的锁,把流水线上的各环节在锁方面解耦开来,顺畅地流转,除了“传输纽带”的交互,不要有更多的相互牵制。

killme2008 commented 5 years ago

想做到无锁化相当困难, jraft 已经尽量做了锁的优化,比如读写锁,锁的范围尽量缩小等,但是 raft 算法本身的正确性就要求一定的同步块来保证。 这个只能暂时作为长期优化的方向来考虑。