skalenetwork / skale-consensus

Running the very core of SKL network, SKALE BFT consensus is universal, modern, modular, high-performance, asynchronous, provably-secure, agent-based Proof-of-Stake blockchain consensus engine in C++ 17. Includes provably secure embedded Oracle. Used by SKALE elastic blockchains. Easy and flexible enough to implement your own blockchain or smart contract platform. BLS signatures and Binary Asynchronous Consensus are main building blocks.
https://docs.skale.network/technology/consensus-spec
GNU Affero General Public License v3.0
78 stars 32 forks source link

Skaled on the lagging node waits only 6 seconds to receive the batch of block #830

Open oleksandrSydorenkoJ opened 5 months ago

oleksandrSydorenkoJ commented 5 months ago

Describe the bug Related to https://github.com/skalenetwork/skaled/issues/1669 During the spinning up of an archival node, the startup consistently begins with catchup. In the case of a large number of blocks, the binary block batch on the sending node may take around 30 seconds to form. But, the skaled on the archival node (receiver) expects to receive blocks in just 6 seconds.

Versions: skaled:3.18.0

Environment: Active Schain MEDIUM type with at least 1 million blocks (almost without load, near 20k transactions total) enabled debug-behavior-apis log-level - DEBUG Archival node with whitelisted IP for the chain and debug log level

To Reproduce

  1. Init the archival node
    skale sync-node init --archive --catchup --historic-state init-env
  2. Check the skaled logs on the archival node
  3. Check the skaled logs on the core node where the archival node requested the batch of blocks

Expected behavior Skaled should wait for the default 2 minutes to receive a large block batch.

Actual state: Skaled on the archival node (receiver) waits only 10 seconds, and the core node (sender) hangs for 30 seconds during the serializing binary batch of blocks

Logs: The archival node:

[2024-03-04 15:25:41.425] [16:main] [error] 0:!Exception: CatchupClientAgent:Catchupc step 2: can not read catchup response
[2024-03-04 15:25:41.426] [16:main] [error] 0: !Caused by: nlohmann:Read catchup response:Could not read header len from:1.1.1.1
[2024-03-04 15:25:41.426] [16:main] [error] 0:  !Caused by: IO:Peer read timeout
[2024-03-04 15:25:52.689] [16:main] [error] 0:!Exception: CatchupClientAgent:Catchupc step 2: can not read catchup response
[2024-03-04 15:25:52.689] [16:main] [error] 0: !Caused by: nlohmann:Read catchup response:Could not read header len from:2.2.2.2
[2024-03-04 15:25:52.689] [16:main] [error] 0:  !Caused by: IO:Peer read timeout
[2024-03-04 15:26:03.697] [16:main] [error] 0:!Exception: CatchupClientAgent:Catchupc step 2: can not read catchup response
[2024-03-04 15:26:03.697] [16:main] [error] 0: !Caused by: nlohmann:Read catchup response:Could not read header len from:3.3.3.3
[2024-03-04 15:26:03.697] [16:main] [error] 0:  !Caused by: IO:Peer read timeout

The core node: 3_18_0_skaled_core_prepare_serialized_batch.txt

PolinaKiporenko commented 5 months ago

@oleksandrSydorenkoJ please check on 3.17.1 version

oleksandrSydorenkoJ commented 5 months ago

the same result for 3.17.1

[2024-03-05 12:14:14.354] [config] [warning] Node:21:Thread:140545280173824:ptr<std::createBlockCatchupResponse has been stuck for 24211 ms
[2024-03-05 12:14:15.354] [config] [warning] Node:21:Thread:140545280173824:CatchupServerAgent::processNextAvailableConnection has been stuck for 25264 ms
[2024-03-05 12:14:15.354] [config] [warning] Node:21:Thread:140545280173824:ptr<std::createBlockCatchupResponse has been stuck for 25211 ms
[2024-03-05 12:14:15.668] [21:main] [info] 1069569:RETURNED_CATCHUP_BLOCKS:15062:CRT:25525
[2024-03-05 12:14:15.726] [21:main] [error] 1069569:!Exception: CatchupServerAgent:Could not send serialized binary
[2024-03-05 12:14:15.726] [21:main] [error] 1069569: !Caused by: IO:Destination unexpectedly closed connection
PolinaKiporenko commented 5 months ago

to unblock QA - prepare custom build with increase timeout to 8