Open ggarri opened 5 years ago
The population of Tendermint Headers happen in the following part of Tendermint code.
github.com/tendermint/tendermint/state/state.go:MakeBlock()
if height == 1 {
timestamp = state.LastBlockTime // genesis time
} else {
timestamp = MedianTime(commit, state.LastValidators)
}
// Fill rest of header with state data.
block.Header.Populate(
state.Version.Consensus, state.ChainID,
timestamp, state.LastBlockID, state.LastBlockTotalTx+block.NumTxs,
state.Validators.Hash(), state.NextValidators.Hash(),
state.ConsensusParams.Hash(), state.AppHash, state.LastResultsHash,
proposerAddress,
)
Due to the implementation of Tendermint ABCI Application, it is not possible to reject/deny a block for consensus when the application layer does not consider a block valid.
IMO that is a clear limitation on Tendermint code I created an issue to request that feature https://github.com/tendermint/tendermint/issues/3755
The block time is being populated by the MedianTIme of every validator preCommit using as weight the voting power of each of the validators, as we see in the following impl:
func MedianTime(commit *types.Commit, validators *types.ValidatorSet) time.Time {
weightedTimes := make([]*tmtime.WeightedTime, len(commit.Precommits))
totalVotingPower := int64(0)
for i, vote := range commit.Precommits {
if vote != nil {
_, validator := validators.GetByIndex(vote.ValidatorIndex)
totalVotingPower += validator.VotingPower
weightedTimes[i] = tmtime.NewWeightedTime(vote.Timestamp, validator.VotingPower)
}
}
return tmtime.WeightedMedian(weightedTimes, totalVotingPower)
}
According to the answer given by Tendermint team at created issue https://github.com/tendermint/tendermint/issues/3755 IT IS IMPOSSIBLE to generate two blocks within the same second if the time_iota_ms >= 1000
BUT we were capable to reproduce it even with the "right setup".
The time_iota_ms
is used in the _voteTime() methos as an addition to the proposal block time as we see from the following code:
func (cs *ConsensusState) voteTime() time.Time {
now := tmtime.Now()
minVoteTime := now
// TODO: We should remove next line in case we don't vote for v in case cs.ProposalBlock == nil,
// even if cs.LockedBlock != nil. See https://github.com/tendermint/spec.
timeIotaMs := time.Duration(cs.state.ConsensusParams.Block.TimeIotaMs) * time.Millisecond
if cs.LockedBlock != nil {
// See the BFT time spec https://tendermint.com/docs/spec/consensus/bft-time.html
minVoteTime = cs.LockedBlock.Time.Add(timeIotaMs)
} else if cs.ProposalBlock != nil {
minVoteTime = cs.ProposalBlock.Time.Add(timeIotaMs)
}
if now.After(minVoteTime) {
return now
}
return minVoteTime
}
At the moment the invalid block was created the consensus setup was the one as follow:
timeout_propose = "2s"
timeout_propose_delta = "500ms"
timeout_prevote = "2s"
timeout_prevote_delta = "500ms"
timeout_precommit = "2s"
timeout_precommit_delta = "500ms"
timeout_commit = "2s"
blocktime_iota = "1s"
Therefore it was possible that a timeout situation happened overwritting the "default behavour" mentioned above abouve time_iota_ms = 1000
, BUT according to the Tendermint documentation those values are only considered once the +2/3 votes are achieved. Verified on the following code:
github.com/tendermint/tendermint/config
// Prevote returns the amount of time to wait for straggler votes after receiving any +2/3 prevotes
// Propose returns the amount of time to wait for a proposal
func (cfg *ConsensusConfig) Propose(round int) time.Duration {
return time.Duration(
cfg.TimeoutPropose.Nanoseconds()+cfg.TimeoutProposeDelta.Nanoseconds()*int64(round),
) * time.Nanosecond
}
// Prevote returns the amount of time to wait for straggler votes after receiving any +2/3 prevotes
func (cfg *ConsensusConfig) Prevote(round int) time.Duration {
return time.Duration(
cfg.TimeoutPrevote.Nanoseconds()+cfg.TimeoutPrevoteDelta.Nanoseconds()*int64(round),
) * time.Nanosecond
}
// Precommit returns the amount of time to wait for straggler votes after receiving any +2/3 precommits
func (cfg *ConsensusConfig) Precommit(round int) time.Duration {
return time.Duration(
cfg.TimeoutPrecommit.Nanoseconds()+cfg.TimeoutPrecommitDelta.Nanoseconds()*int64(round),
) * time.Nanosecond
}
// Commit returns the amount of time to wait for straggler votes after receiving +2/3 precommits for a single block (ie. a commit).
func (cfg *ConsensusConfig) Commit(t time.Time) time.Time {
return t.Add(cfg.TimeoutCommit)
}
Therefore we could discard that the issue was caused due to low values on the timeout_*
After a team discussion around how to solve this issue we achieved the following statements:
According to what was written on above points, the solution cannot involve a complex or hacky code, and it has to rely on Tendermint protocol. We cannot either remove the entire ethereum verify header.
The proposed solution is:
After the fix was applied we have identified another correction of the time on ethereum code which enforce nodes to sleep when blocks are persisted in the future, that improve the proposed solution as it will slow down the node and adjust block times without letting the block time goes to far in future, max of 4 seconds
/github.com/ethereum/go-ethereum/miner/worker.go
func (w *worker) commitNewWork(interrupt *int32, noempty bool, timestamp int64) {
w.mu.RLock()
defer w.mu.RUnlock()
tstart := time.Now()
parent := w.chain.CurrentBlock()
if parent.Time() >= uint64(timestamp) {
timestamp = int64(parent.Time() + 1)
}
// this will ensure we're not going off too far in the future
if now := time.Now().Unix(); timestamp > now+1 {
wait := time.Duration(timestamp-now) * time.Second
log.Info("Mining too far in the future", "wait", common.PrettyDuration(wait))
time.Sleep(wait)
}
Lightchain uses Ethereum as the blockchain storage and due to that our application need to compliant with Ethereum restrictions. One of them is
timestamp equals parent's
, two consecutive blocks cannot be done within the same second otherwise it fells in an error which causes the following log output and causes a consensus failure which cannot recoverLog output
How to reproduce it
To reproduce this issue in a safe manner we are going to use
Standalone
network and apply the consensus config values:Change the genesis or the consensus params to use less than 1000ms
After that we will run the workload test as follow. Maybe it requires more than one try-out
Sample wal logs of failure