relab / hotstuff

MIT License
172 stars 53 forks source link

liveness problem when encountering view changes #24

Closed joe-zxh closed 3 years ago

joe-zxh commented 3 years ago

Dear authors,

Hi, I think this reposistory is an excellent work. However, I encounter a liveness problem using the round-robin mode.

As the default setting, I run a 4 nodes cluster and a client. Normally, the CPU of the hotstuff server is about 200% per insistance. But after running the hotstuff for sometime, the CPU of the hotstuff server drops to lower than 10%, and the hotstuff servers do not serve any more.

I think this is because the view-change is happenning, and some blocks could not get enough certifications and the blockchain deviates from one server to another.

Setting the view-timeout to a lower value(e.g. 10ms) increases the probability of encoutering this liveness problem.

This disturbs me a lot. and I am looking forward to hearing from you. Thanks!

joe-zxh commented 3 years ago

That is a deadlock problem. I fixed it by changing the structure of the data.Block, using the mutex of the HotStuffCore to lock, rather than its own mutex.

The latter block should wait until all the parent blocks has been arrive.

// should lock outside by hs.Mut.Lock()
func (hs *HotStuffCore) PutBlock(b *data.Block) {
    for {
        if _, ok := hs.Blocks.Get(b.ParentHash); ok { //Put until its parent has been put
            hs.Blocks.Put(b)
            hs.waitProposal.Broadcast()
            return
        } else {
            hs.waitProposal.Wait()
        }
    }
}

// should lock outside by hs.Mut.Lock()
func (hs *HotStuffCore) GetBlock(hash data.BlockHash) *data.Block {
    for {
        if block, ok := hs.Blocks.Get(hash); ok {
            return block
        } else {
            hs.waitProposal.Wait()
        }
    }
}

This modification will success in the normal operation without liveness problem. But it might contain safety problem when in Byzantine cases.

johningve commented 3 years ago

Likely fixed in the new version.