Closed oleksandrSydorenkoJ closed 1 year ago
skaled: 3.17.0-beta.3
Actual for skalenetwork/schain:3.17.0-beta.3 Network: Regression skale chains:
skale_schain_howling-hoedus skale_schain_wet-third-darling-skale skale_schain_portly-passionate-sirius skale_schain_wan-tasty-kitalpha
Logs are the same as in the issue description
In stability tests, skaled is hanging in:
(gdb) bt
#0 __lll_clocklock_wait (futex=futex@entry=0x563050e34a78, val=val@entry=2, clockid=clockid@entry=0, abstime=abstime@entry=0x7f8167fedf80, private=0) at lll_timedlock_wait.c:59
#1 0x00007f81bc5619ed in __pthread_mutex_clocklock_common (mutex=0x563050e34a78, clockid=0, abstime=0x7f8167fedf80) at ../nptl/pthread_mutex_timedlock.c:102
#2 0x0000563049dbe185 in Schain::lockWithDeadLockCheck(char const*) ()
#3 0x0000563049dcba0a in Schain::blockCommitArrived(block_id, schain_index, std::shared_ptr<ThresholdSignature> const&, std::shared_ptr<ThresholdSignature>) ()
#4 0x0000563049dcd345 in Schain::finalizeDecidedAndSignedBlock(block_id, schain_index, std::shared_ptr<ThresholdSignature> const&) ()
#5 0x0000563049d942e8 in BlockConsensusAgent::processBlockSignMessage(std::shared_ptr<BlockSignBroadcastMessage> const&) ()
#6 0x0000563049d99743 in BlockConsensusAgent::routeAndProcessMessage(std::shared_ptr<MessageEnvelope> const&) ()
#7 0x0000563049dbfa56 in Schain::messageThreadProcessingLoop(Schain*) ()
#8 0x00007f81bbe57de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007f81bc55e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f81bbb44133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Investigation shows that mutex is already held by the same thread that is trying to aquire it (msgThreadProcLoop)
According to logs, seems that block 974 is being finalized at the time, when consensus for block 980 is in progress:
[2023-06-26 16:04:02.872] [14:main] [info] 978:BLOCK_COMMITED: PRPSR:10:BID: 979:ROOT:36577835925986395181903272758699247291699113671171082077337267100898101223037:HASH:f847a0a4:BLOCK_TXS:1:DMSG:3336:TPRPS:18:MPRPS:4:RPRPS:0:TXS:80:TXLS:18:MGS:6863:INSTS:178:BPS:0:HDRS:6836:SOCK:2:FDS:339:PRT:0:BTA:1096:BSA:0:TPS:0:LWT:0:LRT:0:LWC:2660:LRC:2468:KNWN:1:CONS:0:DSDS:256:SET:0:SBT:0:SEC:24:SBC:8:ZSC:1:EPT:447:STAMP:1687795436.775
2023-06-26 16:04:03.266856 Block sealed #979 (#06dd78b7…)
[2023-06-26 16:04:13.503] [14:main] [debug] 979:PROPOSING BLOCK NUMBER:980
[2023-06-26 16:04:13.953] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:974:SIG:20892137513370084295824968989061188211391615941869555962928829914852891107424:8856190432790389831680910683801609919568622451199713342913232356240048977533:20490500875327635338370796894778739133464062598808202104049555347224932396379:0
[2023-06-26 16:04:14.294] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:975:SIG:2952194333437052141251647092380833939100918507124520684088091640936555683068:1674420779825680358564702022181517551877178729194593672321596027415538175538:20137123426698495255190511070911827066664925687651983158680364777219850998448:0
[2023-06-26 16:04:14.614] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:1:BID:976:SIG:3868667100210540792425723086730981975074261162357015439958527438236511666800:6714417471158721971694694564736902903759571342154901373511771098966496232359:17540808435982603921592013175894781330057869400733811994488892245707331161328:2
[2023-06-26 16:04:14.953] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:9:BID:977:SIG:17374596522146186259209972467124310425481132788499160144058703737094175324505:4323158809886881107092045657799011426046423185742289601005073916157381238441:12664131547456714996303432010995275417537913449798973089526879041411657779941:13
[2023-06-26 16:04:15.290] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:7:BID:978:SIG:7474791963182931724862399790060646907714757952299654171236525528377423338223:12337842090621270094487375796522891242371204571755673525000930502017457701627:12020284420819102636365187929448113983196205656618692970939226169587519239308:0
[2023-06-26 16:04:15.600] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:10:BID:979:SIG:21773760400719147695814554242925889601661503566397950795653597233411276829414:8291724388374780661798935221582983411686489382427032488474917455652321530646:16827152828499902221208043224260095130029370207576789499642265294819702938160:1
[2023-06-26 16:04:15.938] [14:consensus] [info] 979:BLOCK_DECIDED_AND_SIGNED:PRPSR:9:BID:980:SIG:15985519099672012041283088925611037962361475350059241602711649994409383392745:9890396918213389382988277775877196306870888604445491833285181950589066113230:19747462843521696003770892560394005104520742583388032376233823652099866951722:1
[2023-06-26 16:04:15.938] [14:main] [info] 979:FINALIZING_BLOCK:BID:980
[2023-06-26 16:04:34.047] [14:main] [info] 979:CONSENSUS_STARTED:PROPOSING: 0000000000000000
...
[2023-06-26 16:04:39.212] [config] [warning] Node:14:Thread:139790531925760:Schain::finalizeDecidedAndSignedBlock has been stuck for 23274 ms
[2023-06-26 16:04:39.212] [config] [warning] Node:14:Thread:139790531925760:Schain::blockCommitArrived has been stuck for 20251 ms
3.17.0-beta.4
Verified on Regression network Version: 3.17.0-beta.4 Skale chain name: wan-tasty-kitalpha Skaled works without stuck at least 5 hours
2023-06-30 11:19:00.518172 Skaled status: setExitState: StartAgain to true
2023-06-30 11:19:55.653122 Opened blockchain DB. Latest: #cfc07e6f… 1277875
2023-06-30 11:19:55.668743 Latest snapshots init: 1277584 1277875 -> 1277584
2023-06-30 11:19:55.668788 Init last snapshot creation time: 1688115602
2023-06-30 11:19:55.676204 Block sealed #1278072 (#de39c584…)
[2023-06-30 16:38:05.355] [7:main] [info] 1282249:CWT:2470:TLWT:10162:SBPT:266
2023-06-30 16:38:05.359461 createBlock ID = #1282250
Preconditions: Active schain medium type with MTM and IMA storageDestructionPatchTimestamp not reached
Version: 15 nodes with 3.16.1 (2.1.1 release) 1 node with schain:3.17.0-beta.1 (2.2 release)
To Reproduce
Expected behavior Skaled should mine blocks without any errors or stuck
Actual state: Skaled works for 1-12 hours after updating on existing chain and stuck. Then after 3 hours of stuck skaled restarts themself and Skale_admin restarts the skaled container
Logs: skaled_stuck_without_load.txt