skalenetwork / skaled

Running more than 20 production blockchains, SKALED is Ethereum-compatible, high performance C++ Proof-of-Stake client, tools and libraries. Uses SKALE consensus as a blockchain consensus core. Includes dynamic Oracle. Implements file storage and retrieval as an EVM extension.
https://skale.network
GNU General Public License v3.0
84 stars 40 forks source link

Skaled stucks without load #1558

Closed oleksandrSydorenkoJ closed 1 year ago

oleksandrSydorenkoJ commented 1 year ago

Preconditions: Active schain medium type with MTM and IMA storageDestructionPatchTimestamp not reached

Version: 15 nodes with 3.16.1 (2.1.1 release) 1 node with schain:3.17.0-beta.1 (2.2 release)

To Reproduce

  1. Update existing skaled to version 3.17.0-beta.1
  2. wait for the 12 hours without load
  3. Check skaled container

Expected behavior Skaled should mine blocks without any errors or stuck

Actual state: Skaled works for 1-12 hours after updating on existing chain and stuck. Then after 3 hours of stuck skaled restarts themself and Skale_admin restarts the skaled container

Logs: skaled_stuck_without_load.txt

DmytroNazarenko commented 1 year ago

skaled: 3.17.0-beta.3

oleksandrSydorenkoJ commented 1 year ago

Actual for skalenetwork/schain:3.17.0-beta.3 Network: Regression skale chains:

skale_schain_howling-hoedus skale_schain_wet-third-darling-skale skale_schain_portly-passionate-sirius skale_schain_wan-tasty-kitalpha

Logs are the same as in the issue description

dimalit commented 1 year ago

In stability tests, skaled is hanging in:

(gdb) bt
#0  __lll_clocklock_wait (futex=futex@entry=0x563050e34a78, val=val@entry=2, clockid=clockid@entry=0, abstime=abstime@entry=0x7f8167fedf80, private=0) at lll_timedlock_wait.c:59
#1  0x00007f81bc5619ed in __pthread_mutex_clocklock_common (mutex=0x563050e34a78, clockid=0, abstime=0x7f8167fedf80) at ../nptl/pthread_mutex_timedlock.c:102
#2  0x0000563049dbe185 in Schain::lockWithDeadLockCheck(char const*) ()
#3  0x0000563049dcba0a in Schain::blockCommitArrived(block_id, schain_index, std::shared_ptr<ThresholdSignature> const&, std::shared_ptr<ThresholdSignature>) ()
#4  0x0000563049dcd345 in Schain::finalizeDecidedAndSignedBlock(block_id, schain_index, std::shared_ptr<ThresholdSignature> const&) ()
#5  0x0000563049d942e8 in BlockConsensusAgent::processBlockSignMessage(std::shared_ptr<BlockSignBroadcastMessage> const&) ()
#6  0x0000563049d99743 in BlockConsensusAgent::routeAndProcessMessage(std::shared_ptr<MessageEnvelope> const&) ()
#7  0x0000563049dbfa56 in Schain::messageThreadProcessingLoop(Schain*) ()
#8  0x00007f81bbe57de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f81bc55e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f81bbb44133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Investigation shows that mutex is already held by the same thread that is trying to aquire it (msgThreadProcLoop)

dimalit commented 1 year ago

According to logs, seems that block 974 is being finalized at the time, when consensus for block 980 is in progress:

[2023-06-26 16:04:02.872] [14:main] [info] 978:BLOCK_COMMITED: PRPSR:10:BID: 979:ROOT:36577835925986395181903272758699247291699113671171082077337267100898101223037:HASH:f847a0a4:BLOCK_TXS:1:DMSG:3336:TPRPS:18:MPRPS:4:RPRPS:0:TXS:80:TXLS:18:MGS:6863:INSTS:178:BPS:0:HDRS:6836:SOCK:2:FDS:339:PRT:0:BTA:1096:BSA:0:TPS:0:LWT:0:LRT:0:LWC:2660:LRC:2468:KNWN:1:CONS:0:DSDS:256:SET:0:SBT:0:SEC:24:SBC:8:ZSC:1:EPT:447:STAMP:1687795436.775
2023-06-26 16:04:03.266856   Block sealed #979 (#06dd78b7…)
[2023-06-26 16:04:13.503] [14:main] [debug] 979:PROPOSING BLOCK NUMBER:980
[2023-06-26 16:04:13.953] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:974:SIG:20892137513370084295824968989061188211391615941869555962928829914852891107424:8856190432790389831680910683801609919568622451199713342913232356240048977533:20490500875327635338370796894778739133464062598808202104049555347224932396379:0
[2023-06-26 16:04:14.294] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:975:SIG:2952194333437052141251647092380833939100918507124520684088091640936555683068:1674420779825680358564702022181517551877178729194593672321596027415538175538:20137123426698495255190511070911827066664925687651983158680364777219850998448:0
[2023-06-26 16:04:14.614] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:976:SIG:3868667100210540792425723086730981975074261162357015439958527438236511666800:6714417471158721971694694564736902903759571342154901373511771098966496232359:17540808435982603921592013175894781330057869400733811994488892245707331161328:2
[2023-06-26 16:04:14.953] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:9:BID:977:SIG:17374596522146186259209972467124310425481132788499160144058703737094175324505:4323158809886881107092045657799011426046423185742289601005073916157381238441:12664131547456714996303432010995275417537913449798973089526879041411657779941:13
[2023-06-26 16:04:15.290] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:7:BID:978:SIG:7474791963182931724862399790060646907714757952299654171236525528377423338223:12337842090621270094487375796522891242371204571755673525000930502017457701627:12020284420819102636365187929448113983196205656618692970939226169587519239308:0
[2023-06-26 16:04:15.600] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:10:BID:979:SIG:21773760400719147695814554242925889601661503566397950795653597233411276829414:8291724388374780661798935221582983411686489382427032488474917455652321530646:16827152828499902221208043224260095130029370207576789499642265294819702938160:1
[2023-06-26 16:04:15.938] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:9:BID:980:SIG:15985519099672012041283088925611037962361475350059241602711649994409383392745:9890396918213389382988277775877196306870888604445491833285181950589066113230:19747462843521696003770892560394005104520742583388032376233823652099866951722:1
[2023-06-26 16:04:15.938] [14:main] [info] 979:FINALIZING_BLOCK:BID:980
[2023-06-26 16:04:34.047] [14:main] [info] 979:CONSENSUS_STARTED:PROPOSING: 0000000000000000
...
[2023-06-26 16:04:39.212] [config] [warning] Node:14:Thread:139790531925760:Schain::finalizeDecidedAndSignedBlock has been stuck for 23274 ms
[2023-06-26 16:04:39.212] [config] [warning] Node:14:Thread:139790531925760:Schain::blockCommitArrived has been stuck for 20251 ms
DmytroNazarenko commented 1 year ago

3.17.0-beta.4

oleksandrSydorenkoJ commented 1 year ago

Verified on Regression network Version: 3.17.0-beta.4 Skale chain name: wan-tasty-kitalpha Skaled works without stuck at least 5 hours

2023-06-30 11:19:00.518172   Skaled status: setExitState: StartAgain to true
2023-06-30 11:19:55.653122   Opened blockchain DB. Latest: #cfc07e6f… 1277875
2023-06-30 11:19:55.668743   Latest snapshots init: 1277584 1277875 -> 1277584
2023-06-30 11:19:55.668788   Init last snapshot creation time: 1688115602
2023-06-30 11:19:55.676204   Block sealed #1278072 (#de39c584…)

[2023-06-30 16:38:05.355] [7:main] [info] 1282249:CWT:2470:TLWT:10162:SBPT:266
2023-06-30 16:38:05.359461   createBlock ID = #1282250