skalenetwork / skale-consensus

Running the very core of SKL network, SKALE BFT consensus is universal, modern, modular, high-performance, asynchronous, provably-secure, agent-based Proof-of-Stake blockchain consensus engine in C++ 17. Includes provably secure embedded Oracle. Used by SKALE elastic blockchains. Easy and flexible enough to implement your own blockchain or smart contract platform. BLS signatures and Binary Asynchronous Consensus are main building blocks.
https://docs.skale.network/technology/consensus-spec
GNU Affero General Public License v3.0
78 stars 32 forks source link

Сonsensus restarts after 25 minutes instead of the 3-hour interval after start #828

Open oleksandrSydorenkoJ opened 6 months ago

oleksandrSydorenkoJ commented 6 months ago

Describe the bug The consensus has 2 built-in timers for automatic restart in case of disconnection from the majority of nodes.

  1. STUCK_RESTART_INTERVAL_MS - triggers after 3 hours from the last mined block.
  2. HEALTHCHECK_ON_START_RETRY_TIME_SEC - starts after the Skaled launch and lasts for 1500 seconds.

If the consensus loses the majority and restarts after 3 hours, the second Skaled start will be triggered by HEALTHCHECK_ON_START_RETRY_TIME_SEC. This complicates the chain recovery procedure in case of a crash - it may happen that downloading a large snapshot physically becomes impossible within 25 minutes, but it is possible within 3 hours. The result of https://github.com/skalenetwork/internal-support/issues/51

Note: All Skaled, that have been restarted without majority of nodes automatically will be restarted in 3 hours. All Skaled, that have been restarted with the majority of nodes after /issues/51 - will be restarted every 25 minutes

Preconditions: Active schain medium type (16 nodes) At least 1 chain on node

Version skalenetwork/schain:3.17.1 skalenetwork/schain:3.18.0-beta.0

Steps to reproduce

  1. Stop 6 containers on schain
  2. Wait for 3 hours and restart the one of 10 active container on Node A
  3. Wait for 25 minutes and check skaled logs on the restarted container from node A

Expected behavior Consensus should wait 3 hours before restarting himself if the majority of active nodes.

Actual state: Consensus restarts after 25 minutes on node A when no majority on nodes.

message (40).txt

kladkogex commented 5 months ago

Moving to 2.5 as we dont have time for it in 2.4