goraft / raft

UNMAINTAINED: A Go implementation of the Raft distributed consensus protocol.
MIT License
2.43k stars 480 forks source link

raftd sample cluster reaches un-startable state: panic: assertion failed: leader.elected.at.same.term.345 #240

Closed glycerine closed 10 years ago

glycerine commented 10 years ago

On osx 10.9.4 mavericks with go1.3.1

I started a 3 node raftd cluster, then killed off various servers. After I killed all servers and tried to restart them, I am now wedged and cannot start the 3rd server without getting a panic about leader.elected.at.same.term.345


Jasons-MacBook-Pro:ceptor jasona$ raftd -v -p 4003 localhost:4001 ~/node.3
Verbose logging enabled.
2014/10/25 22:38:24 Initializing Raft Server: localhost:4001
2014/10/25 22:38:24 Recovered from log
2014/10/25 22:38:24 Initializing HTTP server
2014/10/25 22:38:24 Listening at: http://localhost:4003
panic: assertion failed: leader.elected.at.same.term.345

goroutine 20 [running]:
runtime.panic(0x2a80a0, 0xc208090cb0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/runtime/panic.c:279 +0xf5
github.com/goraft/raft._assert(0x399c00, 0x3e1e30, 0x1f, 0x6aad18, 0x1, 0x1)
    /home/jaten/go/src/github.com/goraft/raft/util.go:60 +0xcc
github.com/goraft/raft.(*server).processAppendEntriesRequest(0xc208002fc0, 0xc208018960, 0xc208018960, 0\
xc208018960)
    /home/jaten/go/src/github.com/goraft/raft/server.go:948 +0x367
github.com/goraft/raft.(*server).leaderLoop(0xc208002fc0)
    /home/jaten/go/src/github.com/goraft/raft/server.go:849 +0x67a
github.com/goraft/raft.(*server).loop(0xc208002fc0)
    /home/jaten/go/src/github.com/goraft/raft/server.go:609 +0x3d1
github.com/goraft/raft.func·007()
    /home/jaten/go/src/github.com/goraft/raft/server.go:470 +0x5d
created by github.com/goraft/raft.(*server).Start
    /home/jaten/go/src/github.com/goraft/raft/server.go:471 +0x3b8

goroutine 16 [IO wait]:
net.runtime_pollWait(0x653050, 0x72, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc208089a30, 0x72, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc208089a30, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc2080899d0, 0x43a8c8, 0x0, 0x651418, 0x23)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/fd_unix.go:419 +0x343
net.(*TCPListener).AcceptTCP(0xc208034188, 0x10f1d3, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
net/http.tcpKeepAliveListener.Accept(0xc208034188, 0x0, 0x0, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/http/server.go:1947 +0x4b
net/http.(*Server).Serve(0xc208004a80, 0x653100, 0xc208034188, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/http/server.go:1698 +0x91
net/http.(*Server).ListenAndServe(0xc208004a80, 0x0, 0x0)
    /Users/jasona/pkg/go1.3.1/go/src/pkg/net/http/server.go:1688 +0x14d
github.com/goraft/raftd/server.(*Server).ListenAndServe(0xc208050200, 0x38eb70, 0x0, 0x0, 0x0)
    /home/jaten/go/src/github.com/goraft/raftd/server/server.go:120 +0xc01
main.main()
    /home/jaten/go/src/github.com/goraft/raftd/main.go:66 +0x4a9

goroutine 19 [finalizer wait]:
... and many more goroutine stacks (omitted) ...

full logs and data dirs here: https://drive.google.com/file/d/0BxpncsKbxeGWR2RTb3BOR2tiaWM/view?usp=sharing

smuir commented 10 years ago

You're not restarting the clients with the correct command line arguments. raftd interprets the first non-option argument as the data directory, and ignores subsequent ones silently. In your invocation, raftd -v -p 4003 localhost:4001 ~/node.3 you forgot the -join argument preceding localhost:4001 and thus a directory called 'localhost:4001' is actually used for the configuration. You can confirm this by reading the log files for clients 2 and 3 and seeing that happen, for example:

Jasons-MacBook-Pro:ceptor jasona$ raftd -trace -p 4002 localhost:4001 ~/node.2
Raft trace debugging enabled.
2014/10/25 22:04:01 Initializing Raft Server: localhost:4001
[raft]22:04:01.045543 [40536ee Term:0] readConf.open  localhost:4001/conf
[raft]22:04:01.045557 log.open.open  localhost:4001/log
[raft]22:04:01.045606 log.open.create  localhost:4001/log
[raft]22:04:01.045616 [40536ee Term:0] start as a new raft server

You thus have two clients sharing that configuration directory, and all kinds of weird things will happen.

glycerine commented 10 years ago

Makes sense, thanks. The -join flag was omitted because raftd objects to it, but I didn't realize the address following it was the argument to join. Thanks again. - Jason