Open LeoHChen opened 3 years ago
you checked wrong node?
The node out of sync is: 54.202.26.122
$ hmy blockchain latest-headers --node=54.202.26.122:9500
{
"id": "0",
"jsonrpc": "2.0",
"result": {
"beacon-chain-header": {
"block-header-hash": "0x33dd8b09fc6dcca6e0a16a14673402b55343b142d6af723e6b66f59ea5f30278",
"block-number": 6074,
"epoch": 5,
"shard-id": 0,
"view-id": 6074
},
"shard-chain-header": {
"block-header-hash": "0xae4638ea915d9ae3d113317d07e1b224610636d7a1e5bf03190a8bc8229eecea",
"block-number": 5865,
"epoch": 5,
"shard-id": 1,
"view-id": 5865
}
}
}
Ah, my Bad. So this seems to be an unexpected fork issue -- this stuck node (also a leader) have an unexpected fork at height 5865.
From the sync log
{"level":"error","error":"rpc error: code = Unknown desc = [SYNC] GetBlockHashes Request cannot find startHash 0x8cb7cac49a334011ada7bde77a21d07113583b84e652031a035509ecc954b2ce","target":"54.70.143.184:6000","caller":"/home/lc/go/src/github.com/harmony-one/harmony/api/service/syncing/downloader/client.go:55","time":"2021-02-24T08:17:46.428598422Z","message":"[SYNC] GetBlockHashes query failed"}
And we can see the block forked is exactly the block being jammed
$ ./hmy --node=54.202.26.122:9500 blockchain block-by-number 5865
{
"id": "0",
"jsonrpc": "2.0",
"result": {
"difficulty": 0,
"epoch": "0x5",
"extraData": "0x",
"gasLimit": "0x4c4b400",
"gasUsed": "0x0",
"hash": "0x8cb7cac49a334011ada7bde77a21d07113583b84e652031a035509ecc954b2ce",
"logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
"miner": "one1wh4p0kuc7unxez2z8f82zfnhsg4ty6dupqyjt2",
"mixHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"nonce": 0,
"number": "0x16e9",
"parentHash": "0x9669930636b2136e342081f16328a0fa62433f2473b3b19f85abbe1bd43c881e",
"receiptsRoot": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
"size": "0x2a3",
"stakingTransactions": [],
"stateRoot": "0x0dc16459518d48adfdc7f9c649548f228e47ad01a2e7cfd8505688b436d7b9d1",
"timestamp": "0x6035f59b",
"transactions": [],
"transactionsRoot": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
"uncles": [],
"viewID": "0x16e9"
}
}
Seems the fork is related to how the network is stopped on last run. But there is lack of evidence from last run since every time harmony node restarts, it will destroy the existing log. Would better to change zerolog to mode append instead of creating a new logger.
{"level":"info","blockNum":5865,"numPubKeys":10,"mode":"Listening","caller":"/home/lc/go/src/github.com/harmony-one/harmony/node/node.go:1111","time":"2021-02-24T06:44:28.041215909Z","message":"[InitConsensusWithValidators] Successfully updated public keys"}
@rlan35 Please take a look at this issue. Thanks
The log is still available at latest/zerolog-harmony.log.2021-02-24T01:16:06Z
One possible scenario from my guess could be:
onCommitted
message thus the block is not inserted.If this is the case, I would be proud to say this issue will not happen in decentralized sync :)
{"level":"debug","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","leaderKey":"4bf54264c1bfa68ca201f756e882f49e1e8aaa5ddf42deaf4690bc3977497e245af40f3ad4003d7a6121614f13033b0b","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/checks.go:100","time":"2021-02-24T06:45:13.384376274Z","message":"[OnAnnounce] Announce message received again"}
{"level":"debug","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","MsgViewID":5866,"MsgBlockNum":5866,"caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/validator.go:37","time":"2021-02-24T06:45:20.386683576Z","message":"[OnAnnounce] Announce message Added"}
{"level":"warn","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"Normal","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/consensus_v2.go:328","time":"2021-02-24T06:46:09.289620263Z","message":"[ConsensusMainLoop] Ops Consensus Timeout!!!"}
{"level":"warn","myBlock":5865,"myViewID":5865,"phase":"Announce","mode":"ViewChanging","nextViewID":5869,"viewChangingID":5869,"timeoutDuration":27000,"NextLeader":"2e9aa982036860eccb0880702c5d71665761f8d4e6ab5f3d8c3aee25b3e68a2c7eaa3cd85972c7f9a3c19d3fed3d5d01","caller":"/home/lc/go/src/github.com/harmony-one/harmony/consensus/view_change.go:254","time":"2021-02-24T06:46:09.28976685Z","message":"[startViewChange]"}
This can be the proof of my last guess.
This is an edge case where minority holds the latest block but DNS sync always sync blockchains with max vote. So currently there is no easy fix on the DNS sync design.
But this problem can be solved in decentralized sync with sync stream (https://github.com/harmony-one/harmony/pull/3535) where the sync always sync to the highest block number if the given block data is good. For more code reference in decentralized sync logic, please check out code here.
We shall leave this ticket open until we fixed it. It has in-depth analysis and we roughly know now on how to reproduce this situation. We've seen this issue on testnet and mainnet for one/two times.
Once we have the decentralized syncing feature, we shall test this scenario on STN.
Describe the bug I was doing testing on STN. One node on shard1 after epoch change, can't catch up with peers anymore. https://watchdog.hmny.io/report-stn#version-v6760-v3.1.1-15-gefb107e9
54.202.26.122
The consensus was ongoing in shard1, but this node can never catch up. It seems it is stuck in some state.