force-leave not sufficient for EnsureRegistration failed with reserved node name

pjbakker commented 3 years ago

Overview of the Issue

I have a 3 node Consul cluster with Consul 1.9.4 revision 10bb6cb3b. All nodes come up, but one node (hostname: consul-server-02) is spawning EnsureRegistration warnings.

The Hashicorp Consul-Errors-And-Warnings article tells us we should just do a consul force-leave consul-server-02 on one of the other live nodes and then rejoin.

Running force-leave and then restarting consul-server-02 does not allow it to update the host-id for node name consul-server-02.

Expected behaviour: clean update of Consul Server 02 node name to new node-id.

Reproduction Steps

How to reproduce the current Consul Cluster state I'm not sure. I don't think I did anything utterly weird.

But at this point:

On consul-server-02: consul agent -config-dir /etc/consul.d/ -retry-join 192.168.0.10, resulting in WARN messages as shown below
On consul-server-01: consul force-leave consul-server-02
On consul-server-02: consul agent -config-dir /etc/consul.d/ -retry-join 192.168.0.10, still resulting in WARN messages as shown below

Consul info for both Server 1 and Server 2

Server 1 info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 10bb6cb3 version = 1.9.4 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 192.168.0.10:8300 server = true raft: applied_index = 46001 commit_index = 46001 fsm_pending = 0 last_contact = 0 last_log_index = 46001 last_log_term = 52 last_snapshot_index = 32781 last_snapshot_term = 51 latest_configuration = [{Suffrage:Voter ID:44087325-f32d-acf4-34b5-45aa2ee5bbae Address:192.168.0.12:8300} {Suffrage:Voter ID:89927e54-0654-414f-f0d7-9546655e2f5c Address:192.168.0.10:8300} {Suffrage:Voter ID:598fc162-eab2-1d04-5bd9-9d360fad15c3 Address:192.168.0.11:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 52 runtime: arch = amd64 cpu_count = 2 goroutines = 322 max_procs = 2 os = linux version = go1.15.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 20 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1047 members = 11 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 154 members = 3 query_queue = 0 query_time = 1 ```

Server 2 info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 10bb6cb3 version = 1.9.4 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 192.168.0.10:8300 server = true raft: applied_index = 45988 commit_index = 45988 fsm_pending = 87 last_contact = 69.558881ms last_log_index = 45988 last_log_term = 52 last_snapshot_index = 32785 last_snapshot_term = 51 latest_configuration = [{Suffrage:Voter ID:44087325-f32d-acf4-34b5-45aa2ee5bbae Address:192.168.0.12:8300} {Suffrage:Voter ID:89927e54-0654-414f-f0d7-9546655e2f5c Address:192.168.0.10:8300} {Suffrage:Nonvoter ID:598fc162-eab2-1d04-5bd9-9d360fad15c3 Address:192.168.0.11:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 52 runtime: arch = amd64 cpu_count = 2 goroutines = 84 max_procs = 2 os = linux version = go1.15.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 20 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1047 members = 11 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 154 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

OS: Ubuntu 20.04

$ uname -a
Linux consul-server-01 5.4.0-70-generic #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

This line starts spawning multiple times after start

    2021-03-28T12:26:29.349Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "598fc162-eab2-1d04-5bd9-9d360fad15c3": Node name consul-server-02 is reserved by node cdbe853e-fe92-5668-a995-2ebe091e0500 with name consul-server-02 (192.168.0.11)"

pjbakker commented 3 years ago

It's even weirder than I thought.. At the end of the log messages I get reversed warning:

    2021-03-28T12:44:14.450Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "598fc162-eab2-1d04-5bd9-9d360fad15c3": Node name consul-02 is reserved by node cdbe853e-fe92-5668-a995-2ebe091e0500 with name consul-02 (192.168.1.11)"
    2021-03-28T12:44:14.614Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cdbe853e-fe92-5668-a995-2ebe091e0500": Node name consul-02 is reserved by node 598fc162-eab2-1d04-5bd9-9d360fad15c3 with name consul-02 (192.168.1.11)"

I get the first WARN like a hundred times. And then the warnings end with the last one where the node-id's are reversed...

Pleas help!

dnephin commented 3 years ago

Thank you for reporting this problem! It sounds like the problem is that the catalog still has the node registered, even after the leave.

I think to resolve this problem you can use the /v1/catalog/deregister api endpoint to deregister the node. Once it is removed from the catalog it should be possible to register it again. For the payload specifying just the node should work: {"Node": "consul-02"}.

If there is still an agent running with this node name it will re-register itself, so I would first check that all the nodes have unique names.

pjbakker commented 3 years ago

Thanks. That worked.

All nodes indeed have unique names.

DanielYWoo commented 1 year ago

I run Consul in K8S and they are destroyed and re-sync from a git repo, and pods are rescheduled frequently. I see this kind of errors in all my clusters. Does anybody know how to fix it automatically without human intervention?

hashicorp / consul