Open viniciusartur opened 5 years ago
Thanks for reporting @viniciusartur! Without having checked the code I think the problem is that Consul defaults to the hostname for the NodeName
. This in turn is being used in Serf. What you describe might be a bug, I am not sure yet. Maybe you can workaround that by explicitly providing a node name with -node
.
I tried to debug this bug. Here are my findings, may be helpful.
As @viniciusartur mentioned, consul operator raft list-peers
shows different output each time.
Things I noted:
1) NodeName is changing
2) ID is same
3) Voter is false (follower is true)
And in leader logs, the above mentioned follower is keep on removing
and adding
.
2019/09/07 18:53:13 [INFO] consul: removing server by ID: "9ba93dec-371e-a43a-fdf0-d3c97685adc9"
2019/09/07 18:53:13 [INFO] raft: Updating configuration with RemoveServer (9ba93dec-371e-a43a-fdf0-d3c97685adc9, ) to [{Suffrage:Voter ID:d7604537-5ad7-5cc9-8deb-34851162e7b5 Address:172.18.0.2:8300} {Suffrage:Voter ID:596724a9-7f24-cdf8-3f66-b0c72d6a006d Address:172.18.0.3:8300} {Suffrage:Voter ID:2207987c-22aa-054e-5ec5-6f934e76ed64 Address:172.18.0.4:8300} {Suffrage:Voter ID:de91d962-ba0b-8b12-4b3e-5469f683b0ae Address:172.18.0.6:8300}]
2019/09/07 18:53:13 [INFO] raft: Removed peer 9ba93dec-371e-a43a-fdf0-d3c97685adc9, stopping replication after 543
2019/09/07 18:53:13 [INFO] raft: aborting pipeline replication to peer {Nonvoter 9ba93dec-371e-a43a-fdf0-d3c97685adc9 172.18.0.5:8300}
2019/09/07 18:53:13 [INFO] raft: Updating configuration with AddNonvoter (9ba93dec-371e-a43a-fdf0-d3c97685adc9, 172.18.0.5:8300) to [{Suffrage:Voter ID:d7604537-5ad7-5cc9-8deb-34851162e7b5 Address:172.18.0.2:8300} {Suffrage:Voter ID:596724a9-7f24-cdf8-3f66-b0c72d6a006d Address:172.18.0.3:8300} {Suffrage:Voter ID:2207987c-22aa-054e-5ec5-6f934e76ed64 Address:172.18.0.4:8300} {Suffrage:Voter ID:de91d962-ba0b-8b12-4b3e-5469f683b0ae Address:172.18.0.6:8300} {Suffrage:Nonvoter ID:9ba93dec-371e-a43a-fdf0-d3c97685adc9 Address:172.18.0.5:8300}]
2019/09/07 18:53:13 [INFO] raft: Added peer 9ba93dec-371e-a43a-fdf0-d3c97685adc9, starting replication
2019/09/07 18:53:13 [WARN] raft: AppendEntries to {Nonvoter 9ba93dec-371e-a43a-fdf0-d3c97685adc9 172.18.0.5:8300} rejected, sending older logs (next: 544)
2019/09/07 18:53:13 [INFO] raft: pipelining replication to peer {Nonvoter 9ba93dec-371e-a43a-fdf0-d3c97685adc9 172.18.0.5:8300}
2019/09/07 18:53:13 [INFO] consul: removing server by address: "172.18.0.5:8300"
2019/09/07 18:53:13 [ERR] consul: failed to remove raft peer '172.18.0.5:8300': operation not supported with current protocol version
2019/09/07 18:53:13 [ERR] consul: failed to reconcile member: {4cb4ea2e3fa9 172.18.0.5 8301 map[acls:0 build:1.6.0:944cc710 dc:dc1 expect:5 id:1c7e8c48-178d-902d-6ea4-78ac4d869904 port:8300 raft_vsn:3 role:consul segment: vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] left 1 5 2 2 5 4}: operation not supported with current protocol version
If the agent with the same ID joins, should we reject ? Or should we delete the old one in consul members
if it is already in left
state ?
Tried to join agent with different node-id using flag node-id
. It joined successfully and became voter/follower. But still NodeName
was changing during consul operator raft list-peers
and in leader we are getting this error
2019/09/07 19:17:35 [INFO] consul: removing server by address: "172.18.0.5:8300"
2019/09/07 19:17:35 [ERR] consul: failed to remove raft peer '172.18.0.5:8300': operation not supported with current protocol version
2019/09/07 19:17:35 [ERR] consul: failed to reconcile member: {4cb4ea2e3fa9 172.18.0.5 8301 map[acls:0 build:1.6.0:944cc710 dc:dc1 expect:5 id:1c7e8c48-178d-902d-6ea4-78ac4d869904 port:8300 raft_vsn:3 role:consul segment: vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] left 1 5 2 2 5 4}: operation not supported with current protocol version
I'm having the same problem currently - a server temporarily changed hostname. I'm seeing the same behaviour for both consul and nomad, so I assume the problem is indeed at the serf layer.
I'm not sure how to recover from this situation, as the affected server is failing to rejoin the cluster (it keeps getting removed).
I reproduced the issue easily in a test cluster.
After attempting a bunch of things to repair the cluster and allow the node to join, I managed to fix it by stopping consul/nomad, and then manually trimming the start of the nomad/server/serf/snapshot
file (and for consul, the serf/*.snapshot
files) to remove any mention of the "bad" hostname. I did this one-by-one on each of the servers in the cluster.
This eventually enabled the problematic node to successfully join (for some reason it took a couple of attempts in production).
Overview of the Issue
When a follower gets restarted after changing the hostname it causes many errors messages in the leader log and its own log along the time and when there are new elections.
Reproduction Steps
Steps to reproduce this issue:
Raft peers flapping
If you watch the list of peers in each node you can see that follower will flap between the old name and the name
it will show one of this 2 outputs:
or
Log Fragments
The server starts throwing many error logs similar to this one:
From time to time both leader and the changed hostname follower throw error messages like this: Leader:
The follower: