relab / raft

Raft Consensus with Gorums
MIT License
1 stars 1 forks source link

Tests in raft/raftgorums deadlocks #11

Open meling opened 6 years ago

meling commented 6 years ago

I get the following when running the tests in raftgorums:

make test
go test -v
=== RUN   TestMinMax
=== RUN   TestMinMax/0,0
=== RUN   TestMinMax/0,1
=== RUN   TestMinMax/1,0
--- PASS: TestMinMax (0.00s)
    --- PASS: TestMinMax/0,0 (0.00s)
    --- PASS: TestMinMax/0,1 (0.00s)
    --- PASS: TestMinMax/1,0 (0.00s)
=== RUN   TestRandomTimeout
--- PASS: TestRandomTimeout (0.00s)
=== RUN   TestHandleAppendEntriesRequest
=== RUN   TestHandleAppendEntriesRequest/reject_lower_term
=== RUN   TestHandleAppendEntriesRequest/successfully_append_entry
=== RUN   TestHandleAppendEntriesRequest/successfully_overwrite_entry
=== RUN   TestHandleAppendEntriesRequest/successfully_overwrite_entries
=== RUN   TestHandleAppendEntriesRequest/successful_on_already_committed_but_ignore_entries
=== RUN   TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_a
=== RUN   TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_b
=== RUN   TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_c
=== RUN   TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_d
=== RUN   TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_e
--- PASS: TestHandleAppendEntriesRequest (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/reject_lower_term (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/successfully_append_entry (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/successfully_overwrite_entry (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/successfully_overwrite_entries (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/successful_on_already_committed_but_ignore_entries (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_a (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_b (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_c (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_d (0.00s)
    --- PASS: TestHandleAppendEntriesRequest/raft_paper_fig_7_follower_e (0.00s)
=== RUN   TestLeaderElection
=== RUN   TestLeaderElection/elect,_leader:_7,_n:_7

(it does not finish the test... )

meling commented 6 years ago

Just adding some more data points:

WARNING: DATA RACE
Write at 0x00c4201cf131 by goroutine 53:
  testing.(*common).FailNow()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:543 +0x4a
  github.com/relab/raft/raftgorums_test.(*testServer).Run()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:157 +0x143

Previous write at 0x00c4201cf131 by goroutine 52:
  testing.(*common).FailNow()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:543 +0x4a
  github.com/relab/raft/raftgorums_test.(*testServer).Run()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:157 +0x143

Goroutine 53 (running) created at:
  github.com/relab/raft/raftgorums_test.testElectLeader()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:184 +0x4e9
  github.com/relab/raft/raftgorums_test.TestLeaderElection.func1()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:43 +0x8f
  testing.tRunner()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746 +0x16c

Goroutine 52 (finished) created at:
  github.com/relab/raft/raftgorums_test.testElectLeader()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:184 +0x4e9
  github.com/relab/raft/raftgorums_test.TestLeaderElection.func1()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:43 +0x8f
  testing.tRunner()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746 +0x16c
==================
==================
WARNING: DATA RACE
Write at 0x00c4201cf131 by goroutine 58:
  testing.(*common).FailNow()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:543 +0x4a
  github.com/relab/raft/raftgorums_test.(*testServer).Run()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:157 +0x143

Previous write at 0x00c4201cf131 by goroutine 52:
  testing.(*common).FailNow()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:543 +0x4a
  github.com/relab/raft/raftgorums_test.(*testServer).Run()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:157 +0x143

Goroutine 58 (running) created at:
  github.com/relab/raft/raftgorums_test.testElectLeader()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:184 +0x4e9
  github.com/relab/raft/raftgorums_test.TestLeaderElection.func1()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:43 +0x8f
  testing.tRunner()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746 +0x16c

Goroutine 52 (finished) created at:
  github.com/relab/raft/raftgorums_test.testElectLeader()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:184 +0x4e9
  github.com/relab/raft/raftgorums_test.TestLeaderElection.func1()
      /Users/meling/Dropbox/work/go/src/github.com/relab/raft/raftgorums/integration_test.go:43 +0x8f
  testing.tRunner()
      /usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746 +0x16c
meling commented 6 years ago

The culprit is line 190 in the integration test (and similar line for the other subtests of TestLeaderElection). It appears that the Apply() function is never called on the state machine, and hence the commit index isn't advanced beyond its initial zero value. @s111 could you take a look and let me know where the workload is applied to advance the commits. I'm still learning the code base. Thanks.

meling commented 6 years ago

Just logging the issues that I come across as I debug this:

FATA[0000] grpc: Server.RegisterService after Server.Serve for "gorums.Raft"
exit status 1
meling commented 6 years ago

The code in this test setup is horribly convoluted; too many dependent goroutines that needs to synchronize, and I don't understand why it was organized like this. Need to rewrite this stuff. Another problem is that logging was disabled for the tests, and that resulted in only exit status 1, without any real error.

meling commented 6 years ago

I've tried to run the tests both on Linux and macOS, just to see if there was some OS specific issue, but they deadlock on both. Did the leader election tests ever work at all?! If so, how do you run them? Some input would be helpful here @s111.

s111 commented 6 years ago

These integration tests have race conditions which were never resolved. I remember writing them mainly to find bugs and they did help me find and resolve a couple. I'll see if I can find some time to look at these, though, I don't think they are easy to resolve.