tidwall / uhaha

High Availability Raft Framework for Go
MIT License
604 stars 35 forks source link

Leader election issues? #6

Open prologic opened 3 years ago

prologic commented 3 years ago

HI @tidwall 👋

Getting back to playing with your library again and making some updates/fixes to my bitraft I noticed some strange behaviour I'm wondering if you can help figure out.

I form a 3-node cluster:

Node 1:

james@Jamess-MacBook-Pro
Mon Jun 07 11:22:17
~/tmp/bitraft
 (upgrade_finn_uhaha) 0
$ ./bitraft -i 1 -p ./data1 -b 0.0.0.0:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.566 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * synchronized time
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * server listening at [::]:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * server advertising as 0.0.0.0:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.935 * initial configuration: index=0 servers=[]
INFO[0000] 63366:S 07 Jun 2021 11:22:21.935 * bootstrapping new cluster
INFO[0000] 63366:F 07 Jun 2021 11:22:21.935 * entering follower state: follower="Node at [::]:4920 [Follower]" leader=
INFO[0001] 63366:F 07 Jun 2021 11:22:23.366 # heartbeat timeout reached, starting election: last-leader=
INFO[0001] 63366:C 07 Jun 2021 11:22:23.366 * entering candidate state: node="Node at [::]:4920 [Candidate]" term=2
INFO[0001] 63366:C 07 Jun 2021 11:22:23.366 * election won: tally=1
INFO[0001] 63366:L 07 Jun 2021 11:22:23.366 * entering leader state: leader="Node at [::]:4920 [Leader]"
INFO[0002] 63366:L 07 Jun 2021 11:22:23.937 * logs loaded: ready for commands
INFO[0044] 63366:L 07 Jun 2021 11:23:06.027 * updating configuration: command=AddStaging server-id=2 server-addr=0.0.0.0:4921 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921}]"
INFO[0044] 63366:L 07 Jun 2021 11:23:06.027 * added peer, starting replication: peer=2
INFO[0044] 63366:L 07 Jun 2021 11:23:06.029 # appendEntries rejected, sending older logs: peer="{Voter 2 0.0.0.0:4921}" next=1
INFO[0044] 63366:L 07 Jun 2021 11:23:06.030 * pipelining replication: peer="{Voter 2 0.0.0.0:4921}"
INFO[0106] 63366:L 07 Jun 2021 11:24:07.976 * updating configuration: command=AddStaging server-id=3 server-addr=0.0.0.0:4922 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921} {Suffrage:Voter ID:3 Address:0.0.0.0:4922}]"
INFO[0106] 63366:L 07 Jun 2021 11:24:07.976 * added peer, starting replication: peer=3
INFO[0106] 63366:L 07 Jun 2021 11:24:07.978 # appendEntries rejected, sending older logs: peer="{Voter 3 0.0.0.0:4922}" next=1
INFO[0106] 63366:L 07 Jun 2021 11:24:07.982 * pipelining replication: peer="{Voter 3 0.0.0.0:4922}"

INFO[0300] 63366:L 07 Jun 2021 11:27:22.554 * aborting pipeline replication: peer="{Voter 3 0.0.0.0:4922}"
INFO[0300] 63366:L 07 Jun 2021 11:27:22.565 # failed to heartbeat to: peer=0.0.0.0:4922 error=EOF
INFO[0301] 63366:L 07 Jun 2021 11:27:23.055 # failed to contact: server-id=3 time=501.441291ms
INFO[0301] 63366:L 07 Jun 2021 11:27:23.531 # failed to contact: server-id=3 time=977.02433ms
INFO[0302] 63366:L 07 Jun 2021 11:27:23.622 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0302] 63366:L 07 Jun 2021 11:27:23.682 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0302] 63366:L 07 Jun 2021 11:27:24.007 # failed to contact: server-id=3 time=1.453318097s
INFO[0303] 63366:L 07 Jun 2021 11:27:24.633 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0303] 63366:L 07 Jun 2021 11:27:24.867 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0304] 63366:L 07 Jun 2021 11:27:25.644 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0304] 63366:L 07 Jun 2021 11:27:25.672 * pipelining replication: peer="{Voter 3 0.0.0.0:4922}"
INFO[0304] 63366:L 07 Jun 2021 11:27:26.007 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"

Node 2:

james@Jamess-MacBook-Pro
Mon Jun 07 11:22:46
~/tmp/bitraft
 (upgrade_finn_uhaha) 0
$ ./bitraft -i 2 -p ./data2 -b 0.0.0.0:4921 -j 127.0.0.1:4920
INFO[0000] 63379:S 07 Jun 2021 11:23:05.679 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * synchronized time
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * server listening at [::]:4921
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * server advertising as 0.0.0.0:4921
INFO[0000] 63379:S 07 Jun 2021 11:23:06.027 * initial configuration: index=0 servers=[]
INFO[0000] 63379:S 07 Jun 2021 11:23:06.027 * joining existing cluster at 127.0.0.1:4920
INFO[0000] 63379:F 07 Jun 2021 11:23:06.027 * entering follower state: follower="Node at [::]:4921 [Follower]" leader=
INFO[0000] 63379:F 07 Jun 2021 11:23:06.029 # failed to get previous log: previous-index=214 last-index=0 error="log not found"

Node 3:

james@Jamess-MacBook-Pro
Mon Jun 07 11:27:22
~/tmp/bitraft
 (upgrade_finn_uhaha) 130
$ ./bitraft -i 3 -p ./data3 -b 0.0.0.0:4922 -j 127.0.0.1:4920
INFO[0000] 63413:S 07 Jun 2021 11:27:25.137 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * synchronized time
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * server listening at [::]:4922
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * server advertising as 0.0.0.0:4922
INFO[0000] 63413:S 07 Jun 2021 11:27:25.534 * initial configuration: index=522 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921} {Suffrage:Voter ID:3 Address:0.0.0.0:4922}]"
INFO[0000] 63413:S 07 Jun 2021 11:27:25.534 # ignoring join request because server already belongs to a cluster
INFO[0000] 63413:F 07 Jun 2021 11:27:25.534 * entering follower state: follower="Node at [::]:4922 [Follower]" leader=

Then I query the cluster with RAFT LEADER and RAFT SERVER LIST:

$ telnet localhost 4920
Trying ::1...
Connected to localhost.
Escape character is '^]'.
RAFT LEADER
$0

RAFT SERVER LIST
*3
*6
$2
id
$1
1
$7
address
$12
0.0.0.0:4920
$6
leader
$5
false
*6
$2
id
$1
2
$7
address
$12
0.0.0.0:4921
$6
leader
$5
false
*6
$2
id
$1
3
$7
address
$12
0.0.0.0:4922
$6
leader
$5
false

It would appear nobody knows who the leader is? However writes do work as well as reads and there no obvious errors on Node 1 (besides what you see above).

SET foo bar
+OK
GET foo
$3
bar
tidwall commented 3 years ago

Hi @prologic

I just tested on my side and I wasn't able to get a blank result.

I think that this issue was addresses in v0.8.0.

tidwall commented 3 years ago

Though unlikely, perhaps there's an inconsistency in the library due to the platform. I would like to reproduce what you are seeing. Is the bitraft branch that you are seeing the error with publicly available?

tidwall commented 3 years ago

I noticed that your '0.0.0.0:4920' is advertising both as '[::]:4920' and '0.0.0.0:4920' in your logs. I don't know if this is the cause or not, but I just pushed a commit that disables the ipv6 listener. For my use cases, I don't need ipv6. And it should clear up the mismatch.

prologic commented 3 years ago

Is the bitraft branch that you are seeing the error with publicly available?

Yes. https://github.com/prologic/bitraft/tree/upgrade_finn_uhaha

prologic commented 3 years ago

Okay I'll try upgrading the version of uhaha I'm using and report back. Thanks!

pims commented 3 years ago

I noticed that your '0.0.0.0:4920' is advertising both as '[::]:4920' and '0.0.0.0:4920' in your logs. I don't know if this is the cause or not, but I just pushed a commit that disables the ipv6 listener. For my use cases, I don't need ipv6. And it should clear up the mismatch.

That was the issue for me. Upgrading fixed it.