gnolang / gno

Gno: An interpreted, stack-based Go virtual machine to build succinct and composable apps + Gno.land: a blockchain for timeless code and fair open-source
https://gno.land/
Other
875 stars 355 forks source link

make `make _test.pkg.bft` consistent, especially on CI (`tm2/test`) #1320

Closed moul closed 3 months ago

moul commented 10 months ago

This is the test with the highest flappiness/flakiness.

Related with #202

gfanton commented 9 months ago

I've extracted two stack traces from two different CI logs.

stacktrace 1

``` 2023-11-15T01:48:36.1098642Z panic: test timed out after 20m0s 2023-11-15T01:48:36.1099607Z running tests: 2023-11-15T01:48:36.1107254Z TestSetValidBlockOnDelayedPrevote (19m39s) 2023-11-15T01:48:36.1108122Z TestStateProposerSelection0 (19m39s) 2023-11-15T01:48:36.1108524Z 2023-11-15T01:48:36.1108655Z goroutine 27498 [running]: 2023-11-15T01:48:36.1109331Z testing.(*M).startAlarm.func1() 2023-11-15T01:48:36.1109932Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2259 +0x3b9 2023-11-15T01:48:36.1110489Z created by time.goFunc 2023-11-15T01:48:36.1110948Z /opt/hostedtoolcache/go/1.21.3/x64/src/time/sleep.go:176 +0x2d 2023-11-15T01:48:36.1111344Z 2023-11-15T01:48:36.1111493Z goroutine 1 [chan receive, 18 minutes]: 2023-11-15T01:48:36.1111895Z testing.tRunner.func1() 2023-11-15T01:48:36.1112411Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1561 +0x489 2023-11-15T01:48:36.1112989Z testing.tRunner(0xc0001851e0, 0xc00019bbb8) 2023-11-15T01:48:36.1113566Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1601 +0x138 2023-11-15T01:48:36.1114269Z testing.runTests(0xc0001e1040?, {0xe3e840, 0x38, 0x38}, {0xc00022a4c0?, 0xc00022a480?, 0xe44980?}) 2023-11-15T01:48:36.1114990Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2052 +0x445 2023-11-15T01:48:36.1115467Z testing.(*M).Run(0xc0001e1040) 2023-11-15T01:48:36.1116047Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1925 +0x636 2023-11-15T01:48:36.1116761Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9d6e6d?) 2023-11-15T01:48:36.1117440Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xf4 2023-11-15T01:48:36.1117943Z main.main() 2023-11-15T01:48:36.1118196Z _testmain.go:193 +0x1e6 2023-11-15T01:48:36.1118378Z 2023-11-15T01:48:36.1118502Z goroutine 18 [syscall, 19 minutes]: 2023-11-15T01:48:36.1118831Z os/signal.signal_recv() 2023-11-15T01:48:36.1119275Z /opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sigqueue.go:152 +0x29 2023-11-15T01:48:36.1119733Z os/signal.loop() 2023-11-15T01:48:36.1120168Z /opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal_unix.go:23 +0x13 2023-11-15T01:48:36.1120724Z created by os/signal.Notify.func1.1 in goroutine 8 2023-11-15T01:48:36.1121284Z /opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal.go:151 +0x1f 2023-11-15T01:48:36.1121657Z 2023-11-15T01:48:36.1121830Z goroutine 53 [sync.RWMutex.RLock, 19 minutes]: 2023-11-15T01:48:36.1122391Z sync.runtime_SemacquireRWMutexR(0xaaf460?, 0x60?, 0xa?) 2023-11-15T01:48:36.1123096Z /opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25 2023-11-15T01:48:36.1123663Z sync.(*RWMutex).RLock(...) 2023-11-15T01:48:36.1124079Z /opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71 2023-11-15T01:48:36.1124784Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc00017c000) 2023-11-15T01:48:36.1125976Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78 2023-11-15T01:48:36.1126740Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestStateProposerSelection0(0xc0002c5380) 2023-11-15T01:48:36.1127702Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:84 +0x23e 2023-11-15T01:48:36.1128237Z testing.tRunner(0xc0002c5380, 0xa19c80) 2023-11-15T01:48:36.1128744Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff 2023-11-15T01:48:36.1129251Z created by testing.(*T).Run in goroutine 1 2023-11-15T01:48:36.1129779Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad 2023-11-15T01:48:36.1130148Z 2023-11-15T01:48:36.1130307Z goroutine 67 [sync.RWMutex.RLock, 19 minutes]: 2023-11-15T01:48:36.1130773Z sync.runtime_SemacquireRWMutexR(0x20?, 0x1?, 0xc00002ac40?) 2023-11-15T01:48:36.1131333Z /opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25 2023-11-15T01:48:36.1131789Z sync.(*RWMutex).RLock(...) 2023-11-15T01:48:36.1132186Z /opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71 2023-11-15T01:48:36.1132864Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc0000fa000) 2023-11-15T01:48:36.1133644Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78 2023-11-15T01:48:36.1134434Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestSetValidBlockOnDelayedPrevote(0xc0002e0000?) 2023-11-15T01:48:36.1135275Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1093 +0x458 2023-11-15T01:48:36.1135818Z testing.tRunner(0xc0002e0340, 0xa19c00) 2023-11-15T01:48:36.1136444Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff 2023-11-15T01:48:36.1136948Z created by testing.(*T).Run in goroutine 1 2023-11-15T01:48:36.1137471Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad 2023-11-15T01:48:36.1137840Z 2023-11-15T01:48:36.1137974Z goroutine 101 [chan receive, 19 minutes]: 2023-11-15T01:48:36.1138536Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1() 2023-11-15T01:48:36.1139249Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47 2023-11-15T01:48:36.1140031Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 100 2023-11-15T01:48:36.1141070Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5 2023-11-15T01:48:36.1141451Z 2023-11-15T01:48:36.1141556Z goroutine 102 [select]: 2023-11-15T01:48:36.1142054Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc000298d20) 2023-11-15T01:48:36.1142865Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65 2023-11-15T01:48:36.1143607Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 100 2023-11-15T01:48:36.1144297Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196 2023-11-15T01:48:36.1144660Z 2023-11-15T01:48:36.1144763Z goroutine 103 [select]: 2023-11-15T01:48:36.1145260Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc0001e80e0) 2023-11-15T01:48:36.1145921Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d 2023-11-15T01:48:36.1146584Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 100 2023-11-15T01:48:36.1147260Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf 2023-11-15T01:48:36.1147588Z 2023-11-15T01:48:36.1147727Z goroutine 85 [chan receive, 19 minutes]: 2023-11-15T01:48:36.1148438Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1() 2023-11-15T01:48:36.1149439Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47 2023-11-15T01:48:36.1150225Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 84 2023-11-15T01:48:36.1151018Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5 2023-11-15T01:48:36.1151481Z 2023-11-15T01:48:36.1151592Z goroutine 86 [select]: 2023-11-15T01:48:36.1152087Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc000330910) 2023-11-15T01:48:36.1152799Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65 2023-11-15T01:48:36.1153476Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 84 2023-11-15T01:48:36.1154596Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196 2023-11-15T01:48:36.1154964Z 2023-11-15T01:48:36.1155092Z goroutine 104 [select, 19 minutes]: 2023-11-15T01:48:36.1155769Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc000318660) 2023-11-15T01:48:36.1156646Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a 2023-11-15T01:48:36.1157533Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 100 2023-11-15T01:48:36.1158456Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e 2023-11-15T01:48:36.1158903Z 2023-11-15T01:48:36.1159030Z goroutine 105 [select, 10 minutes]: 2023-11-15T01:48:36.1159819Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFiltered.SubscribeFilteredOn.func1({0x95ce80, 0xc0001e0000}) 2023-11-15T01:48:36.1160751Z /home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279 2023-11-15T01:48:36.1162007Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc0001e8070, {0x95ce80, 0xc0001e0000}) 2023-11-15T01:48:36.1162962Z /home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6 2023-11-15T01:48:36.1163896Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc0000fa000, 0xc0001e0000, {0x0, 0x0}) 2023-11-15T01:48:36.1164981Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x6ec 2023-11-15T01:48:36.1166018Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc0000fa000, 0xf?, {0x0?, 0x0?}) 2023-11-15T01:48:36.1166838Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x5c 2023-11-15T01:48:36.1167735Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc0000fa000, {{0xaaaf60, 0xc0003161f0}, {0x0, 0x0}}) 2023-11-15T01:48:36.1168747Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3a5 2023-11-15T01:48:36.1169551Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc0000fa000, 0x0) 2023-11-15T01:48:36.1170359Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6 2023-11-15T01:48:36.1171157Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 100 2023-11-15T01:48:36.1172047Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565 2023-11-15T01:48:36.1172431Z 2023-11-15T01:48:36.1172635Z goroutine 87 [select]: 2023-11-15T01:48:36.1173202Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc00012a1c0) 2023-11-15T01:48:36.1173876Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d 2023-11-15T01:48:36.1174535Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 84 2023-11-15T01:48:36.1175200Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf 2023-11-15T01:48:36.1175515Z 2023-11-15T01:48:36.1175634Z goroutine 88 [select, 19 minutes]: 2023-11-15T01:48:36.1176228Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc00009e420) 2023-11-15T01:48:36.1177000Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a 2023-11-15T01:48:36.1177773Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 84 2023-11-15T01:48:36.1178558Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e 2023-11-15T01:48:36.1179037Z 2023-11-15T01:48:36.1179168Z goroutine 89 [select, 10 minutes]: 2023-11-15T01:48:36.1179877Z github.com/gnolang/gno/tm2/pkg/events.SubscribeToEventOn.SubscribeFilteredOn.func2({0x99d440, 0xc000330af0}) 2023-11-15T01:48:36.1180898Z /home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279 2023-11-15T01:48:36.1181656Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc00012a150, {0x99d440, 0xc000330af0}) 2023-11-15T01:48:36.1182395Z /home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6 2023-11-15T01:48:36.1183229Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addProposalBlockPart(0xc00017c000, 0x20?, {0x0?, 0x0?}) 2023-11-15T01:48:36.1184225Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1446 +0x456 2023-11-15T01:48:36.1185133Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc00017c000, {{0xaaaf80, 0xc00031aab0}, {0x0, 0x0}}) 2023-11-15T01:48:36.1186036Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:679 +0x16f 2023-11-15T01:48:36.1186819Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc00017c000, 0x0) 2023-11-15T01:48:36.1187607Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6 2023-11-15T01:48:36.1188381Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 84 2023-11-15T01:48:36.1189171Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565 2023-11-15T01:48:36.1189797Z FAIL github.com/gnolang/gno/tm2/pkg/bft/consensus 1200.017s 2023-11-15T01:48:36.1190394Z ? github.com/gnolang/gno/tm2/pkg/bft/consensus/config [no test files] 2023-11-15T01:48:36.4972987Z === RUN TestPeerCatchupRounds 2023-11-15T01:48:36.4973521Z === PAUSE TestPeerCatchupRounds 2023-11-15T01:48:36.4973866Z === CONT TestPeerCatchupRounds 2023-11-15T01:48:36.4995471Z --- PASS: TestPeerCatchupRounds (0.00s) 2023-11-15T01:48:36.4995826Z PASS 2023-11-15T01:48:36.4999711Z coverage: 21.7% of statements 2023-11-15T01:48:36.5012639Z ok github.com/gnolang/gno/tm2/pkg/bft/consensus/types 0.009s coverage: 21.7% of statements 2023-11-15T01:48:36.5031561Z ? github.com/gnolang/gno/tm2/pkg/bft/fail [no test files] ```

stacktrace 2

``` 2023-11-15T10:41:50.0083327Z panic: test timed out after 20m0s 2023-11-15T10:41:50.0084063Z running tests: 2023-11-15T10:41:50.0084729Z TestSetValidBlockOnDelayedPrevote (19m39s) 2023-11-15T10:41:50.0085339Z 2023-11-15T10:41:50.0085566Z goroutine 29115 [running]: 2023-11-15T10:41:50.0086193Z testing.(*M).startAlarm.func1() 2023-11-15T10:41:50.0088611Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2259 +0x3b9 2023-11-15T10:41:50.0089187Z created by time.goFunc 2023-11-15T10:41:50.0089609Z /opt/hostedtoolcache/go/1.21.3/x64/src/time/sleep.go:176 +0x2d 2023-11-15T10:41:50.0089964Z 2023-11-15T10:41:50.0090104Z goroutine 1 [chan receive, 19 minutes]: 2023-11-15T10:41:50.0090463Z testing.tRunner.func1() 2023-11-15T10:41:50.0090932Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1561 +0x489 2023-11-15T10:41:50.0091841Z testing.tRunner(0xc000167380, 0xc00019bbb8) 2023-11-15T10:41:50.0092375Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1601 +0x138 2023-11-15T10:41:50.0093090Z testing.runTests(0xc0001e1040?, {0xe3e840, 0x38, 0x38}, {0xc00022a4c0?, 0xc00022a480?, 0xe44980?}) 2023-11-15T10:41:50.0093814Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2052 +0x445 2023-11-15T10:41:50.0094294Z testing.(*M).Run(0xc0001e1040) 2023-11-15T10:41:50.0094757Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1925 +0x636 2023-11-15T10:41:50.0095365Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9d6e6d?) 2023-11-15T10:41:50.0096032Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xf4 2023-11-15T10:41:50.0096532Z main.main() 2023-11-15T10:41:50.0096786Z _testmain.go:193 +0x1e6 2023-11-15T10:41:50.0096964Z 2023-11-15T10:41:50.0097091Z goroutine 10 [syscall, 19 minutes]: 2023-11-15T10:41:50.0097420Z os/signal.signal_recv() 2023-11-15T10:41:50.0097863Z /opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sigqueue.go:152 +0x29 2023-11-15T10:41:50.0098328Z os/signal.loop() 2023-11-15T10:41:50.0098762Z /opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal_unix.go:23 +0x13 2023-11-15T10:41:50.0099311Z created by os/signal.Notify.func1.1 in goroutine 8 2023-11-15T10:41:50.0099863Z /opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal.go:151 +0x1f 2023-11-15T10:41:50.0100234Z 2023-11-15T10:41:50.0100390Z goroutine 57 [sync.RWMutex.RLock, 19 minutes]: 2023-11-15T10:41:50.0100858Z sync.runtime_SemacquireRWMutexR(0x20?, 0x1?, 0xc000304480?) 2023-11-15T10:41:50.0101416Z /opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25 2023-11-15T10:41:50.0101858Z sync.(*RWMutex).RLock(...) 2023-11-15T10:41:50.0102254Z /opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71 2023-11-15T10:41:50.0102947Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc000004300) 2023-11-15T10:41:50.0103755Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78 2023-11-15T10:41:50.0104663Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestSetValidBlockOnDelayedPrevote(0xc00048aea0?) 2023-11-15T10:41:50.0105521Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1093 +0x458 2023-11-15T10:41:50.0106068Z testing.tRunner(0xc00048b1e0, 0xa19c00) 2023-11-15T10:41:50.0106570Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff 2023-11-15T10:41:50.0107073Z created by testing.(*T).Run in goroutine 1 2023-11-15T10:41:50.0107804Z /opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad 2023-11-15T10:41:50.0108411Z 2023-11-15T10:41:50.0108643Z goroutine 101 [chan receive, 19 minutes]: 2023-11-15T10:41:50.0109628Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1() 2023-11-15T10:41:50.0110868Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47 2023-11-15T10:41:50.0112264Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 100 2023-11-15T10:41:50.0114061Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5 2023-11-15T10:41:50.0114757Z 2023-11-15T10:41:50.0115084Z goroutine 102 [select]: 2023-11-15T10:41:50.0115971Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc0002a4320) 2023-11-15T10:41:50.0117203Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65 2023-11-15T10:41:50.0118350Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 100 2023-11-15T10:41:50.0119575Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196 2023-11-15T10:41:50.0120200Z 2023-11-15T10:41:50.0120383Z goroutine 103 [select]: 2023-11-15T10:41:50.0121228Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc0001e80e0) 2023-11-15T10:41:50.0122329Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d 2023-11-15T10:41:50.0123477Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 100 2023-11-15T10:41:50.0124784Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf 2023-11-15T10:41:50.0125358Z 2023-11-15T10:41:50.0125571Z goroutine 104 [select, 19 minutes]: 2023-11-15T10:41:50.0126610Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc0002fc900) 2023-11-15T10:41:50.0127965Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a 2023-11-15T10:41:50.0129349Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 100 2023-11-15T10:41:50.0130734Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e 2023-11-15T10:41:50.0131369Z 2023-11-15T10:41:50.0131553Z goroutine 105 [select, 10 minutes]: 2023-11-15T10:41:50.0132707Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFiltered.SubscribeFilteredOn.func1({0x95ce80, 0xc00018c000}) 2023-11-15T10:41:50.0134178Z /home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279 2023-11-15T10:41:50.0135504Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc0001e8070, {0x95ce80, 0xc00018c000}) 2023-11-15T10:41:50.0136829Z /home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6 2023-11-15T10:41:50.0138226Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc000004300, 0xc00018c000, {0x0, 0x0}) 2023-11-15T10:41:50.0139715Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x6ec 2023-11-15T10:41:50.0141215Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000004300, 0xf?, {0x0?, 0x0?}) 2023-11-15T10:41:50.0142694Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x5c 2023-11-15T10:41:50.0144263Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000004300, {{0xaaaf60, 0xc0000b21e0}, {0x0, 0x0}}) 2023-11-15T10:41:50.0145901Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3a5 2023-11-15T10:41:50.0146959Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000004300, 0x0) 2023-11-15T10:41:50.0147766Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6 2023-11-15T10:41:50.0148945Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 100 2023-11-15T10:41:50.0149760Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565 2023-11-15T10:41:50.0150353Z FAIL github.com/gnolang/gno/tm2/pkg/bft/consensus 1200.017s 2023-11-15T10:41:50.0150959Z ? github.com/gnolang/gno/tm2/pkg/bft/consensus/config [no test files] ```

From these traces, we can deduce that there is a deadlock that can appear in multiple tests from state_test.go:

The first conflicting lock appears when calling ConsensusState.GetRoundState(), which in turn calls ConsensusState.mtx.RLock().

The second lock appears in ConsensusState.handleMsg. After locking ConsensusState.mtx, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.

example on TestSetValidBlockOnDelayedPrevote:

func TestSetValidBlockOnDelayedPrevote(t *testing.T) {
    t.Parallel()

    cs1, vss := randConsensusState(4)
    vs2, vs3, vs4 := vss[1], vss[2], vss[3]
    height, round := cs1.Height, cs1.Round

    partSize := types.BlockPartSizeBytes

    proposalCh := subscribe(cs1.evsw, cstypes.EventCompleteProposal{})
    timeoutWaitCh := subscribe(cs1.evsw, cstypes.EventTimeoutWait{})
    newRoundCh := subscribe(cs1.evsw, cstypes.EventNewRound{})
    validBlockCh := subscribe(cs1.evsw, cstypes.EventNewValidBlock{})
    addr := cs1.privValidator.GetPubKey().Address()

    /// A vote subscriber is created here, which ultimately receive an event that is never consumed by this test.

    voteCh := subscribeToVoter(cs1, addr)

    // ... more code

    ensureNewValidBlock(validBlockCh, height, round)

    /// deadlock here
    rs = cs1.GetRoundState()

    assert.True(t, bytes.Equal(rs.ValidBlock.Hash(), propBlockHash))
    assert.True(t, rs.ValidBlockParts.Header().Equals(propBlockParts.Header()))
    assert.True(t, rs.ValidRound == round)
}
jaekwon commented 8 months ago

After locking ConsensusState.mtx, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.

The most explicit solution is to consume the event, synchronously. And with more events we may want to create helpers to consume events, but always synchronously/deterministically.

So this should be reverted https://github.com/gnolang/gno/pull/1385/files as it masks opportunities to do the above. Eventually solutions like https://github.com/gnolang/gno/pull/1385/files will start to fail in unexpected ways (when the buffer becomes full at high workload).

If you can't figure out the root cause then I can work on it, but in any case we should revert 1385 first.

moul commented 8 months ago

@jaekwon, @gfanton has opened #1441 to revert the PR. I request you to keep it open instead of merging. The situation was causing a blockage for everyone, so I prefer that we handle this case gradually and effectively on the side.

I promise that we will not launch mainnet without resolving the underlying problem.

Let's discuss this further during our next call.

@petar-dambovaliev, could you please add this low-level problem to your list of topics to review? I think it would be a great fit for you.

moul commented 8 months ago

As mentioned here, I have reverted the previous hack.

Now, our priority should be to fix it instead of allowing it to grow over time. Let's seek help from @jaekwon and discuss it during one of the upcoming team meetings.

@petar-dambovaliev, could you take a look in the next few days? I have a feeling that you can identify the issue quickly with your fast debugging skills.

We need to determine the cause of the ratio increase in the failing tests. Did we introduce it by increasing parallelism or is it something else?

Feel free to share tips on the fastest way to reproduce without running the entire lengthy test suite.

cc @zivkovicmilos @gfanton.

gfanton commented 8 months ago

I will also try to give another look, but let's make sure to have a review from @jaekwon this time

piux2 commented 8 months ago

@gfanton Starting from which commit/version did we have this issue on CI? If it did not appear in an earlier version, some change must have caused the testing case to fail.

gfanton commented 8 months ago

@piux2 I didn't attempt to do a bisect; that was on my to-do list. However, since I'm not able to reproduce the bug locally, I want to focus on that first. Based on what others are saying, it seems like this test has been failing for a long time.

piux2 commented 8 months ago

@gfanton

I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.

https://github.com/gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309

The panic error message in the trace above is from following code. go test -v -p 1 -timeout=20m -coverprofile=coverage.out -covermode=atomic ./pkg/bft/..

"test timed out after 20m0s" https://github.com/gnolang/gno/blob/968f89e2bce754e8a11db7a5a9813f4473caefe1/gnovm/cmd/gno/test.go#L198

Basically it panics if the entire CI suite can not complete in 20mins.

If we look at the start time and end time of pkg/bft CI log. It is about 20 mins.

root cause

There is another deeper problem. How could a cmd/gno/test.go leak the error message to tm2/pkg/bft test log and take the input from tm2 test?

It turns out the commands.IO passed in here wrapping a global os.Stdout and os.Stdin which could cause the issue because all go programs share the same os.Stdin and os.Stdout.

https://github.com/gnolang/gno/blob/968f89e2bce754e8a11db7a5a9813f4473caefe1/gnovm/cmd/gno/test.go#L41

Usually Golang os.Stdin and os.Stdout have separate *os.File file descriptor in its own process space. However, if all these instances are deployed on the same docker container, they could all reading from and writing to the same container’s stdin / stidout stream.

the workflow running concurrently on docker image ghcr.io/gnolang/gno

https://github.com/gnolang/gno/blob/968f89e2bce754e8a11db7a5a9813f4473caefe1/.github/workflows/tm2.yml#L20

WDYT?

gfanton commented 8 months ago

@piux2

I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.

gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309

This is definitely coming from bft state_test.go, if you check the log from the PR you sent and extract the panic stack trace you get this:

2023-11-19T03:46:40.2211188Z 2023/11/19 03:46:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 40s
2023-11-19T03:48:00.2219207Z 2023/11/19 03:48:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 1m20s
2023-11-19T03:50:40.2222972Z 2023/11/19 03:50:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 2m40s
2023-11-19T03:56:00.2225112Z 2023/11/19 03:56:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 5m20s
2023-11-19T04:05:06.4032211Z    github.com/gnolang/gno/tm2/pkg/bft/consensus    coverage: 73.4% of statements
2023-11-19T04:05:06.4086930Z panic: test timed out after 20m0s
2023-11-19T04:05:06.4087896Z running tests:
2023-11-19T04:05:06.4088502Z    TestProposeValidBlock (19m37s)
2023-11-19T04:05:06.4088911Z 
2023-11-19T04:05:06.4089117Z goroutine 30501 [running]:
2023-11-19T04:05:06.4089671Z testing.(*M).startAlarm.func1()
2023-11-19T04:05:06.4090605Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2241 +0x3c5
2023-11-19T04:05:06.4091492Z created by time.goFunc
2023-11-19T04:05:06.4092178Z    /opt/hostedtoolcache/go/1.20.10/x64/src/time/sleep.go:176 +0x32
2023-11-19T04:05:06.4092804Z 
2023-11-19T04:05:06.4093032Z goroutine 1 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4093611Z testing.tRunner.func1()
2023-11-19T04:05:06.4094430Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1542 +0x4a5
2023-11-19T04:05:06.4095442Z testing.tRunner(0xc00016b1e0, 0xc000187bf8)
2023-11-19T04:05:06.4096457Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1582 +0x144
2023-11-19T04:05:06.4097915Z testing.runTests(0xc0001df4a0?, {0xe414e0, 0x38, 0x38}, {0xc0002285c0?, 0xc000228580?, 0xe48660?})
2023-11-19T04:05:06.4099383Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2034 +0x489
2023-11-19T04:05:06.4100420Z testing.(*M).Run(0xc0001df4a0)
2023-11-19T04:05:06.4101298Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1906 +0x63a
2023-11-19T04:05:06.4102478Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9f64f2?)
2023-11-19T04:05:06.4103668Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xe5
2023-11-19T04:05:06.4104481Z main.main()
2023-11-19T04:05:06.4104881Z    _testmain.go:190 +0x1ef
2023-11-19T04:05:06.4105162Z 
2023-11-19T04:05:06.4105361Z goroutine 18 [syscall, 19 minutes]:
2023-11-19T04:05:06.4105888Z os/signal.signal_recv()
2023-11-19T04:05:06.4106603Z    /opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sigqueue.go:152 +0x2f
2023-11-19T04:05:06.4107745Z os/signal.loop()
2023-11-19T04:05:06.4108470Z    /opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal_unix.go:23 +0x19
2023-11-19T04:05:06.4109301Z created by os/signal.Notify.func1.1
2023-11-19T04:05:06.4110127Z    /opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal.go:151 +0x2a
2023-11-19T04:05:06.4110747Z 
2023-11-19T04:05:06.4110969Z goroutine 20836 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4111930Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-19T04:05:06.4113176Z    /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x4d
2023-11-19T04:05:06.4114317Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler
2023-11-19T04:05:06.4115505Z    /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xca
2023-11-19T04:05:06.4116290Z 
2023-11-19T04:05:06.4116550Z goroutine 66 [sync.RWMutex.RLock, 19 minutes]:
2023-11-19T04:05:06.4117353Z sync.runtime_SemacquireRWMutexR(0x1?, 0x80?, 0x20?)
2023-11-19T04:05:06.4118294Z    /opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sema.go:82 +0x26
2023-11-19T04:05:06.4119077Z sync.(*RWMutex).RLock(...)
2023-11-19T04:05:06.4119766Z    /opt/hostedtoolcache/go/1.20.10/x64/src/sync/rwmutex.go:71
2023-11-19T04:05:06.4120981Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc000004f00)
2023-11-19T04:05:06.4122302Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x7e
2023-11-19T04:05:06.4123666Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestProposeValidBlock(0xc0002de340)
2023-11-19T04:05:06.4125143Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1059 +0xca5
2023-11-19T04:05:06.4126051Z testing.tRunner(0xc0002de340, 0xa3a478)
2023-11-19T04:05:06.4126974Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1576 +0x10b
2023-11-19T04:05:06.4127805Z created by testing.(*T).Run
2023-11-19T04:05:06.4128612Z    /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1629 +0x3ea
2023-11-19T04:05:06.4129278Z 
2023-11-19T04:05:06.4129499Z goroutine 20840 [select, 11 minutes]:
2023-11-19T04:05:06.4130542Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFilteredOn.func1({0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4131964Z    /home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x28a
2023-11-19T04:05:06.4133329Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc002348c40, {0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4134655Z    /home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xec
2023-11-19T04:05:06.4136059Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc000004f00, 0xc0050712c0, {0x0, 0x0})
2023-11-19T04:05:06.4137565Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x777
2023-11-19T04:05:06.4139029Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000004f00, 0xf?, {0x0?, 0x0?})
2023-11-19T04:05:06.4140505Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x65
2023-11-19T04:05:06.4142118Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000004f00, {{0xac72e0, 0xc0090c1580}, {0x0, 0x0}})
2023-11-19T04:05:06.4143744Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3c5
2023-11-19T04:05:06.4145110Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000004f00, 0x0)
2023-11-19T04:05:06.4146502Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-19T04:05:06.4147921Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart
2023-11-19T04:05:06.4149209Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x598
2023-11-19T04:05:06.4149886Z 
2023-11-19T04:05:06.4150070Z goroutine 20838 [select]:
2023-11-19T04:05:06.4150950Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc000a259d0)
2023-11-19T04:05:06.4152124Z    /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x65
2023-11-19T04:05:06.4153149Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart
2023-11-19T04:05:06.4154366Z    /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xca
2023-11-19T04:05:06.4154964Z 
2023-11-19T04:05:06.4155153Z goroutine 20837 [select]:
2023-11-19T04:05:06.4156010Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc005cce4b0)
2023-11-19T04:05:06.4157231Z    /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x69
2023-11-19T04:05:06.4158261Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile
2023-11-19T04:05:06.4159336Z    /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x18a
2023-11-19T04:05:06.4159972Z 
2023-11-19T04:05:06.4160192Z goroutine 20839 [select, 19 minutes]:
2023-11-19T04:05:06.4160929Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc005b137a0)
2023-11-19T04:05:06.4162187Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x196
2023-11-19T04:05:06.4163033Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart
2023-11-19T04:05:06.4163842Z    /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x96
2023-11-19T04:05:06.4164530Z FAIL   github.com/gnolang/gno/tm2/pkg/bft/consensus    1200.019s
2023-11-19T04:05:06.4165234Z ?      github.com/gnolang/gno/tm2/pkg/bft/consensus/config [no test files]

From this stack trace, we can see that the test has been stuck in TestProposeValidBlock for 19 minutes, more specifically at tm2/pkg/bft/consensus/state_test.go:1059:

panic: test timed out after 20m0s
running tests:
    TestProposeValidBlock (19m37s)

The multiple [WARN] logs indicate that a vote event has been stuck and not consumed at tm2/pkg/events/subscribe.go:60:

[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 1m20s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 2m40s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 5m20s

The panic timeout isn't originating from gno but from the golang test suite, likely from github.com/golang/go/src/testing/testing.go:2366. Because the -timeout=20m parameter is specified in the go test command when running tests from the command you sent.

So it's unlikely to be linked with an stdout/stdin issue, or even related to gnovm.

petar-dambovaliev commented 8 months ago

I am gonna take this over.

moul commented 8 months ago

According to @zivkovicmilos, the short-term solution is #1495. For the long-term solution, Milos wants to collaborate with @petar-dambovaliev on a larger refactoring.

A potential temporary solution to provide more comfort for everyone is to label the tests as "flappy" so that they are checked intermittently instead of always (https://github.com/gnolang/gno/issues/202). However, it is advisable to avoid relying on this temporary system.

jaekwon commented 8 months ago
diff --git a/tm2/pkg/bft/consensus/state_test.go b/tm2/pkg/bft/consensus/state_test.go
index 35877837..713be1df 100644
--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,7 @@ func TestProposeValidBlock(t *testing.T) {
        t.Log("### ONTO ROUND 4")

        ensureNewProposal(proposalCh, height, round)
+       ensurePrevote(voteCh, height, round)

        rs = cs1.GetRoundState()
        assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))

^ that should fix it for the test TestProposeValidBlock. And likewise for similar flappy BFT tests. There are probably no sync/lock bugs in Tendermint core bft; the reason why it is locked on cs1.GetRoundState() is because the consensus engine is locked trying to write an event (in this case a prevote event).

You will find that the deadlock happens toward the end of the test. This is because the race condition usually only manifests at the end where we forgot to read an event from a channel (and "ensure*" consumes these), and these are relatively easy to fix by looking at similar tests (or the same one above) and looking at the order of ensure calls there for each round.

It's also possible that the deadlock happens within a round in the middle of a test. In that case the fix should probably be easy: probably just need to move one or more ensure* calls up before a call like cs1.GetRoundState().

The WARN statements: "[WARN] EventSwitch subscriber test-client blocked on " were put there to help fix these tests. So please use them and also keep the WARN logging because it will come handy in the future.

The goroutine 20840 [select, 11 minutes]: for events.SubscribeFilteredOn is misleading because it's actually for 11m plus the time that produced the prior WARN messages: 40s + 1m20s + 2m40s + 5m20s = 600s = 10m. So this event send was actually stuck for 21 minutes it seems.

In the end, our tests will have precise event consumption testing, which is good. Any solution that tries to mask the problem by buffering the event system is just going to end up encountering some other more complex problem later. So please keep the BFT tests as is and keep adding ensure methods to completely cover all events and the order they are produced.

moul commented 8 months ago

I attempted multiple times to make these tests fail locally, but was unsuccessful, even with -count 100.

Does anyone know how we can write comprehensive tests that can fail locally and not only on the CI? One idea is to use the following pseudocode:

for keyword1 in keywords {
    for keyword2 in keywords {
        setPerTestTimeout(...)
        node.init()
        node.add(keyword1)
        node.add(keyword2)
        node.processEvents()
        assertEmptyQueue()
    }
}
jaekwon commented 8 months ago

The problem stems from performance differences and spurts. That's not something you can generally detect for unless you use some tool to modify the timing of various goroutines randomly and allow for specifying behavior. I'm not suggesting that at all, that is crazy lol.

For the BFT tests we can do this:

--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,9 @@ func TestProposeValidBlock(t *testing.T) {
        t.Log("### ONTO ROUND 4")

        ensureNewProposal(proposalCh, height, round)
+       //ensurePrevote(voteCh, height, round)
+.      longEnough := 10 * time.Second
+       time.Sleep(longEnough)

        rs = cs1.GetRoundState()
        assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))

Wait long enough before any last state reads or assertions on BFT tests for the consensus round. Locally this should get yours to fail too. 10 seconds sounds be enough on a local laptop for BFT state calculations to go through. On CI maybe it needs to be a bit longer. But we can optionally enable this by setting a flag that optionally searchces for these cases rather than it being the norm. Nobody wants BFT test cases to each take 10 seconds longer lol.

zivkovicmilos commented 3 months ago

This has been fixed as part of #1515