shard1 node can't catch up with shard1 on stn

LeoHChen commented 3 years ago

Describe the bug I was doing testing on STN. One node on shard1 after epoch change, can't catch up with peers anymore. https://watchdog.hmny.io/report-stn#version-v6760-v3.1.1-15-gefb107e9

54.202.26.122

The consensus was ongoing in shard1, but this node can never catch up. It seems it is stuck in some state.

LeoHChen commented 3 years ago

you checked wrong node?

The node out of sync is: 54.202.26.122

LeoHChen commented 3 years ago

$ hmy blockchain latest-headers --node=54.202.26.122:9500
{
  "id": "0",
  "jsonrpc": "2.0",
  "result": {
    "beacon-chain-header": {
      "block-header-hash": "0x33dd8b09fc6dcca6e0a16a14673402b55343b142d6af723e6b66f59ea5f30278",
      "block-number": 6074,
      "epoch": 5,
      "shard-id": 0,
      "view-id": 6074
    },
    "shard-chain-header": {
      "block-header-hash": "0xae4638ea915d9ae3d113317d07e1b224610636d7a1e5bf03190a8bc8229eecea",
      "block-number": 5865,
      "epoch": 5,
      "shard-id": 1,
      "view-id": 5865
    }
  }
}

JackyWYX commented 3 years ago

Ah, my Bad. So this seems to be an unexpected fork issue -- this stuck node (also a leader) have an unexpected fork at height 5865.

From the sync log

{"level":"error","error":"rpc error: code = Unknown desc = [SYNC] GetBlockHashes Request cannot find startHash 0x8cb7cac49a334011ada7bde77a21d07113583b84e652031a035509ecc954b2ce","target":"54.70.143.184:6000","caller":"/home/lc/go/src/github.com/harmony-one/harmony/api/service/syncing/downloader/client.go:55","time":"2021-02-24T08:17:46.428598422Z","message":"[SYNC] GetBlockHashes query failed"}

And we can see the block forked is exactly the block being jammed

$ ./hmy --node=54.202.26.122:9500 blockchain block-by-number 5865
{
  "id": "0",
  "jsonrpc": "2.0",
  "result": {
    "difficulty": 0,
    "epoch": "0x5",
    "extraData": "0x",
    "gasLimit": "0x4c4b400",
    "gasUsed": "0x0",
    "hash": "0x8cb7cac49a334011ada7bde77a21d07113583b84e652031a035509ecc954b2ce",
    "logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
    "miner": "one1wh4p0kuc7unxez2z8f82zfnhsg4ty6dupqyjt2",
    "mixHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
    "nonce": 0,
    "number": "0x16e9",
    "parentHash": "0x9669930636b2136e342081f16328a0fa62433f2473b3b19f85abbe1bd43c881e",
    "receiptsRoot": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
    "size": "0x2a3",
    "stakingTransactions": [],
    "stateRoot": "0x0dc16459518d48adfdc7f9c649548f228e47ad01a2e7cfd8505688b436d7b9d1",
    "timestamp": "0x6035f59b",
    "transactions": [],
    "transactionsRoot": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
    "uncles": [],
    "viewID": "0x16e9"
  }
}

JackyWYX commented 3 years ago

Seems the fork is related to how the network is stopped on last run. But there is lack of evidence from last run since every time harmony node restarts, it will destroy the existing log. Would better to change zerolog to mode append instead of creating a new logger.

{"level":"info","blockNum":5865,"numPubKeys":10,"mode":"Listening","caller":"/home/lc/go/src/github.com/harmony-one/harmony/node/node.go:1111","time":"2021-02-24T06:44:28.041215909Z","message":"[InitConsensusWithValidators] Successfully updated public keys"}

@rlan35 Please take a look at this issue. Thanks

LeoHChen commented 3 years ago

The log is still available at latest/zerolog-harmony.log.2021-02-24T01:16:06Z

JackyWYX commented 3 years ago

One possible scenario from my guess could be:

On last run:

The leader node commit block to db and send out onCommitted message.
Rest of the nodes stopped before processing onCommitted message thus the block is not inserted.
The network end in state where only leader node has the block 1 higher than others.

On next run:

Nodes started. Only leader has block 5865 and other has 5864.
This block will never be synced in gRPC sync logic since block hashes query uses a max vote algorithm. And since we have 9 DNS nodes for stn, block 5865 apparently does not holds the authority. The block is not synced.
Leader propose 5866 while others think they are missing one block. But this missing piece can not be retrievedby any means.
The leader timed out and a view change is triggered, leaving the leader node along looping on this forked block.

If this is the case, I would be proud to say this issue will not happen in decentralized sync :)

JackyWYX commented 3 years ago

Evidence from logs of another side chain node

When start, node is working on consensus 5865 while the leader is working on 5866. So the leader node and the rest nodes do not agree on each other.

{"level":"debug","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","leaderKey":"4bf54264c1bfa68ca201f756e882f49e1e8aaa5ddf42deaf4690bc3977497e245af40f3ad4003d7a6121614f13033b0b","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/checks.go:100","time":"2021-02-24T06:45:13.384376274Z","message":"[OnAnnounce] Announce message received again"}
{"level":"debug","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","MsgViewID":5866,"MsgBlockNum":5866,"caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/validator.go:37","time":"2021-02-24T06:45:20.386683576Z","message":"[OnAnnounce] Announce message Added"}

As view change timed out, a view change is triggered.

{"level":"warn","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/consensus_v2.go:328","time":"2021-02-24T06:46:09.289620263Z","message":"[ConsensusMainLoop] Ops Consensus Timeout!!!"}
{"level":"warn","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"ViewChanging","nextViewID":5869,"viewChangingID":5869,"timeoutDuration":27000,"NextLeader":"2e9aa982036860eccb0880702c5d71665761f8d4e6ab5f3d8c3aee25b3e68a2c7eaa3cd85972c7f9a3c19d3fed3d5d01","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/view_change.go:254","time":"2021-02-24T06:46:09.28976685Z","message":"[startViewChange]"}

This can be the proof of my last guess.

JackyWYX commented 3 years ago

Solution

This is an edge case where minority holds the latest block but DNS sync always sync blockchains with max vote. So currently there is no easy fix on the DNS sync design.

But this problem can be solved in decentralized sync with sync stream (https://github.com/harmony-one/harmony/pull/3535) where the sync always sync to the highest block number if the given block data is good. For more code reference in decentralized sync logic, please check out code here.

LeoHChen commented 3 years ago

We shall leave this ticket open until we fixed it. It has in-depth analysis and we roughly know now on how to reproduce this situation. We've seen this issue on testnet and mainnet for one/two times.

Once we have the decentralized syncing feature, we shall test this scenario on STN.

harmony-one / harmony

shard1 node can't catch up with shard1 on stn #3548

Evidence from logs of another side chain node

Solution