Closed moul closed 3 months ago
I've extracted two stack traces from two different CI logs.
From these traces, we can deduce that there is a deadlock that can appear in multiple tests from state_test.go
:
The first conflicting lock appears when calling ConsensusState.GetRoundState(), which in turn calls ConsensusState.mtx.RLock()
.
The second lock appears in ConsensusState.handleMsg. After locking ConsensusState.mtx
, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.
example on TestSetValidBlockOnDelayedPrevote
:
func TestSetValidBlockOnDelayedPrevote(t *testing.T) {
t.Parallel()
cs1, vss := randConsensusState(4)
vs2, vs3, vs4 := vss[1], vss[2], vss[3]
height, round := cs1.Height, cs1.Round
partSize := types.BlockPartSizeBytes
proposalCh := subscribe(cs1.evsw, cstypes.EventCompleteProposal{})
timeoutWaitCh := subscribe(cs1.evsw, cstypes.EventTimeoutWait{})
newRoundCh := subscribe(cs1.evsw, cstypes.EventNewRound{})
validBlockCh := subscribe(cs1.evsw, cstypes.EventNewValidBlock{})
addr := cs1.privValidator.GetPubKey().Address()
/// A vote subscriber is created here, which ultimately receive an event that is never consumed by this test.
voteCh := subscribeToVoter(cs1, addr)
// ... more code
ensureNewValidBlock(validBlockCh, height, round)
/// deadlock here
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ValidBlock.Hash(), propBlockHash))
assert.True(t, rs.ValidBlockParts.Header().Equals(propBlockParts.Header()))
assert.True(t, rs.ValidRound == round)
}
After locking ConsensusState.mtx, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.
The most explicit solution is to consume the event, synchronously. And with more events we may want to create helpers to consume events, but always synchronously/deterministically.
So this should be reverted https://github.com/gnolang/gno/pull/1385/files as it masks opportunities to do the above. Eventually solutions like https://github.com/gnolang/gno/pull/1385/files will start to fail in unexpected ways (when the buffer becomes full at high workload).
If you can't figure out the root cause then I can work on it, but in any case we should revert 1385 first.
@jaekwon, @gfanton has opened #1441 to revert the PR. I request you to keep it open instead of merging. The situation was causing a blockage for everyone, so I prefer that we handle this case gradually and effectively on the side.
I promise that we will not launch mainnet without resolving the underlying problem.
Let's discuss this further during our next call.
@petar-dambovaliev, could you please add this low-level problem to your list of topics to review? I think it would be a great fit for you.
As mentioned here, I have reverted the previous hack.
Now, our priority should be to fix it instead of allowing it to grow over time. Let's seek help from @jaekwon and discuss it during one of the upcoming team meetings.
@petar-dambovaliev, could you take a look in the next few days? I have a feeling that you can identify the issue quickly with your fast debugging skills.
We need to determine the cause of the ratio increase in the failing tests. Did we introduce it by increasing parallelism or is it something else?
Feel free to share tips on the fastest way to reproduce without running the entire lengthy test suite.
cc @zivkovicmilos @gfanton.
I will also try to give another look, but let's make sure to have a review from @jaekwon this time
@gfanton Starting from which commit/version did we have this issue on CI? If it did not appear in an earlier version, some change must have caused the testing case to fail.
@piux2 I didn't attempt to do a bisect; that was on my to-do list. However, since I'm not able to reproduce the bug locally, I want to focus on that first. Based on what others are saying, it seems like this test has been failing for a long time.
@gfanton
I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.
https://github.com/gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309
The panic error message in the trace above is from following code. go test -v -p 1 -timeout=20m -coverprofile=coverage.out -covermode=atomic ./pkg/bft/..
"test timed out after 20m0s" https://github.com/gnolang/gno/blob/968f89e2bce754e8a11db7a5a9813f4473caefe1/gnovm/cmd/gno/test.go#L198
Basically it panics if the entire CI suite can not complete in 20mins.
If we look at the start time and end time of pkg/bft CI log. It is about 20 mins.
There is another deeper problem. How could a cmd/gno/test.go leak the error message to tm2/pkg/bft test log and take the input from tm2 test?
It turns out the commands.IO passed in here wrapping a global os.Stdout and os.Stdin which could cause the issue because all go programs share the same os.Stdin and os.Stdout.
Usually Golang os.Stdin and os.Stdout have separate *os.File file descriptor in its own process space. However, if all these instances are deployed on the same docker container, they could all reading from and writing to the same container’s stdin / stidout stream.
the workflow running concurrently on docker image ghcr.io/gnolang/gno
WDYT?
@piux2
I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.
This is definitely coming from bft state_test.go
, if you check the log from the PR you sent and extract the panic stack trace you get this:
2023-11-19T03:46:40.2211188Z 2023/11/19 03:46:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 40s
2023-11-19T03:48:00.2219207Z 2023/11/19 03:48:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 1m20s
2023-11-19T03:50:40.2222972Z 2023/11/19 03:50:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 2m40s
2023-11-19T03:56:00.2225112Z 2023/11/19 03:56:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 5m20s
2023-11-19T04:05:06.4032211Z github.com/gnolang/gno/tm2/pkg/bft/consensus coverage: 73.4% of statements
2023-11-19T04:05:06.4086930Z panic: test timed out after 20m0s
2023-11-19T04:05:06.4087896Z running tests:
2023-11-19T04:05:06.4088502Z TestProposeValidBlock (19m37s)
2023-11-19T04:05:06.4088911Z
2023-11-19T04:05:06.4089117Z goroutine 30501 [running]:
2023-11-19T04:05:06.4089671Z testing.(*M).startAlarm.func1()
2023-11-19T04:05:06.4090605Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2241 +0x3c5
2023-11-19T04:05:06.4091492Z created by time.goFunc
2023-11-19T04:05:06.4092178Z /opt/hostedtoolcache/go/1.20.10/x64/src/time/sleep.go:176 +0x32
2023-11-19T04:05:06.4092804Z
2023-11-19T04:05:06.4093032Z goroutine 1 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4093611Z testing.tRunner.func1()
2023-11-19T04:05:06.4094430Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1542 +0x4a5
2023-11-19T04:05:06.4095442Z testing.tRunner(0xc00016b1e0, 0xc000187bf8)
2023-11-19T04:05:06.4096457Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1582 +0x144
2023-11-19T04:05:06.4097915Z testing.runTests(0xc0001df4a0?, {0xe414e0, 0x38, 0x38}, {0xc0002285c0?, 0xc000228580?, 0xe48660?})
2023-11-19T04:05:06.4099383Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2034 +0x489
2023-11-19T04:05:06.4100420Z testing.(*M).Run(0xc0001df4a0)
2023-11-19T04:05:06.4101298Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1906 +0x63a
2023-11-19T04:05:06.4102478Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9f64f2?)
2023-11-19T04:05:06.4103668Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xe5
2023-11-19T04:05:06.4104481Z main.main()
2023-11-19T04:05:06.4104881Z _testmain.go:190 +0x1ef
2023-11-19T04:05:06.4105162Z
2023-11-19T04:05:06.4105361Z goroutine 18 [syscall, 19 minutes]:
2023-11-19T04:05:06.4105888Z os/signal.signal_recv()
2023-11-19T04:05:06.4106603Z /opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sigqueue.go:152 +0x2f
2023-11-19T04:05:06.4107745Z os/signal.loop()
2023-11-19T04:05:06.4108470Z /opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal_unix.go:23 +0x19
2023-11-19T04:05:06.4109301Z created by os/signal.Notify.func1.1
2023-11-19T04:05:06.4110127Z /opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal.go:151 +0x2a
2023-11-19T04:05:06.4110747Z
2023-11-19T04:05:06.4110969Z goroutine 20836 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4111930Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-19T04:05:06.4113176Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x4d
2023-11-19T04:05:06.4114317Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler
2023-11-19T04:05:06.4115505Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xca
2023-11-19T04:05:06.4116290Z
2023-11-19T04:05:06.4116550Z goroutine 66 [sync.RWMutex.RLock, 19 minutes]:
2023-11-19T04:05:06.4117353Z sync.runtime_SemacquireRWMutexR(0x1?, 0x80?, 0x20?)
2023-11-19T04:05:06.4118294Z /opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sema.go:82 +0x26
2023-11-19T04:05:06.4119077Z sync.(*RWMutex).RLock(...)
2023-11-19T04:05:06.4119766Z /opt/hostedtoolcache/go/1.20.10/x64/src/sync/rwmutex.go:71
2023-11-19T04:05:06.4120981Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc000004f00)
2023-11-19T04:05:06.4122302Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x7e
2023-11-19T04:05:06.4123666Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestProposeValidBlock(0xc0002de340)
2023-11-19T04:05:06.4125143Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1059 +0xca5
2023-11-19T04:05:06.4126051Z testing.tRunner(0xc0002de340, 0xa3a478)
2023-11-19T04:05:06.4126974Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1576 +0x10b
2023-11-19T04:05:06.4127805Z created by testing.(*T).Run
2023-11-19T04:05:06.4128612Z /opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1629 +0x3ea
2023-11-19T04:05:06.4129278Z
2023-11-19T04:05:06.4129499Z goroutine 20840 [select, 11 minutes]:
2023-11-19T04:05:06.4130542Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFilteredOn.func1({0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4131964Z /home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x28a
2023-11-19T04:05:06.4133329Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc002348c40, {0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4134655Z /home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xec
2023-11-19T04:05:06.4136059Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc000004f00, 0xc0050712c0, {0x0, 0x0})
2023-11-19T04:05:06.4137565Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x777
2023-11-19T04:05:06.4139029Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000004f00, 0xf?, {0x0?, 0x0?})
2023-11-19T04:05:06.4140505Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x65
2023-11-19T04:05:06.4142118Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000004f00, {{0xac72e0, 0xc0090c1580}, {0x0, 0x0}})
2023-11-19T04:05:06.4143744Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3c5
2023-11-19T04:05:06.4145110Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000004f00, 0x0)
2023-11-19T04:05:06.4146502Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-19T04:05:06.4147921Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart
2023-11-19T04:05:06.4149209Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x598
2023-11-19T04:05:06.4149886Z
2023-11-19T04:05:06.4150070Z goroutine 20838 [select]:
2023-11-19T04:05:06.4150950Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc000a259d0)
2023-11-19T04:05:06.4152124Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x65
2023-11-19T04:05:06.4153149Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart
2023-11-19T04:05:06.4154366Z /home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xca
2023-11-19T04:05:06.4154964Z
2023-11-19T04:05:06.4155153Z goroutine 20837 [select]:
2023-11-19T04:05:06.4156010Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc005cce4b0)
2023-11-19T04:05:06.4157231Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x69
2023-11-19T04:05:06.4158261Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile
2023-11-19T04:05:06.4159336Z /home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x18a
2023-11-19T04:05:06.4159972Z
2023-11-19T04:05:06.4160192Z goroutine 20839 [select, 19 minutes]:
2023-11-19T04:05:06.4160929Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc005b137a0)
2023-11-19T04:05:06.4162187Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x196
2023-11-19T04:05:06.4163033Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart
2023-11-19T04:05:06.4163842Z /home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x96
2023-11-19T04:05:06.4164530Z FAIL github.com/gnolang/gno/tm2/pkg/bft/consensus 1200.019s
2023-11-19T04:05:06.4165234Z ? github.com/gnolang/gno/tm2/pkg/bft/consensus/config [no test files]
From this stack trace, we can see that the test has been stuck in TestProposeValidBlock
for 19 minutes, more specifically at tm2/pkg/bft/consensus/state_test.go:1059:
panic: test timed out after 20m0s
running tests:
TestProposeValidBlock (19m37s)
The multiple [WARN]
logs indicate that a vote event has been stuck and not consumed at tm2/pkg/events/subscribe.go:60:
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 1m20s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 2m40s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 5m20s
The panic timeout isn't originating from gno
but from the golang
test suite, likely from github.com/golang/go/src/testing/testing.go:2366. Because the -timeout=20m
parameter is specified in the go test
command when running tests from the command you sent.
So it's unlikely to be linked with an stdout/stdin
issue, or even related to gnovm
.
I am gonna take this over.
According to @zivkovicmilos, the short-term solution is #1495. For the long-term solution, Milos wants to collaborate with @petar-dambovaliev on a larger refactoring.
A potential temporary solution to provide more comfort for everyone is to label the tests as "flappy" so that they are checked intermittently instead of always (https://github.com/gnolang/gno/issues/202). However, it is advisable to avoid relying on this temporary system.
diff --git a/tm2/pkg/bft/consensus/state_test.go b/tm2/pkg/bft/consensus/state_test.go
index 35877837..713be1df 100644
--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,7 @@ func TestProposeValidBlock(t *testing.T) {
t.Log("### ONTO ROUND 4")
ensureNewProposal(proposalCh, height, round)
+ ensurePrevote(voteCh, height, round)
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))
^ that should fix it for the test TestProposeValidBlock. And likewise for similar flappy BFT tests. There are probably no sync/lock bugs in Tendermint core bft; the reason why it is locked on cs1.GetRoundState() is because the consensus engine is locked trying to write an event (in this case a prevote event).
You will find that the deadlock happens toward the end of the test. This is because the race condition usually only manifests at the end where we forgot to read an event from a channel (and "ensure*" consumes these), and these are relatively easy to fix by looking at similar tests (or the same one above) and looking at the order of ensure calls there for each round.
It's also possible that the deadlock happens within a round in the middle of a test. In that case the fix should probably be easy: probably just need to move one or more ensure* calls up before a call like cs1.GetRoundState().
The WARN statements: "[WARN] EventSwitch subscriber test-client blocked on " were put there to help fix these tests. So please use them and also keep the WARN logging because it will come handy in the future.
The goroutine 20840 [select, 11 minutes]:
for events.SubscribeFilteredOn
is misleading because it's actually for 11m plus the time that produced the prior WARN messages: 40s + 1m20s + 2m40s + 5m20s = 600s = 10m. So this event send was actually stuck for 21 minutes it seems.
In the end, our tests will have precise event consumption testing, which is good. Any solution that tries to mask the problem by buffering the event system is just going to end up encountering some other more complex problem later. So please keep the BFT tests as is and keep adding ensure methods to completely cover all events and the order they are produced.
I attempted multiple times to make these tests fail locally, but was unsuccessful, even with -count 100
.
Does anyone know how we can write comprehensive tests that can fail locally and not only on the CI? One idea is to use the following pseudocode:
for keyword1 in keywords {
for keyword2 in keywords {
setPerTestTimeout(...)
node.init()
node.add(keyword1)
node.add(keyword2)
node.processEvents()
assertEmptyQueue()
}
}
The problem stems from performance differences and spurts. That's not something you can generally detect for unless you use some tool to modify the timing of various goroutines randomly and allow for specifying behavior. I'm not suggesting that at all, that is crazy lol.
For the BFT tests we can do this:
--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,9 @@ func TestProposeValidBlock(t *testing.T) {
t.Log("### ONTO ROUND 4")
ensureNewProposal(proposalCh, height, round)
+ //ensurePrevote(voteCh, height, round)
+. longEnough := 10 * time.Second
+ time.Sleep(longEnough)
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))
Wait long enough before any last state reads or assertions on BFT tests for the consensus round. Locally this should get yours to fail too. 10 seconds sounds be enough on a local laptop for BFT state calculations to go through. On CI maybe it needs to be a bit longer. But we can optionally enable this by setting a flag that optionally searchces for these cases rather than it being the norm. Nobody wants BFT test cases to each take 10 seconds longer lol.
This has been fixed as part of #1515
This is the test with the highest flappiness/flakiness.
Related with #202