Open pjbakker opened 3 years ago
It's even weirder than I thought.. At the end of the log messages I get reversed warning:
2021-03-28T12:44:14.450Z [WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "598fc162-eab2-1d04-5bd9-9d360fad15c3": Node name consul-02 is reserved by node cdbe853e-fe92-5668-a995-2ebe091e0500 with name consul-02 (192.168.1.11)"
2021-03-28T12:44:14.614Z [WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cdbe853e-fe92-5668-a995-2ebe091e0500": Node name consul-02 is reserved by node 598fc162-eab2-1d04-5bd9-9d360fad15c3 with name consul-02 (192.168.1.11)"
I get the first WARN like a hundred times. And then the warnings end with the last one where the node-id's are reversed...
Pleas help!
Thank you for reporting this problem! It sounds like the problem is that the catalog still has the node registered, even after the leave.
I think to resolve this problem you can use the /v1/catalog/deregister
api endpoint to deregister the node. Once it is removed from the catalog it should be possible to register it again. For the payload specifying just the node should work: {"Node": "consul-02"}
.
If there is still an agent running with this node name it will re-register itself, so I would first check that all the nodes have unique names.
Thanks. That worked.
All nodes indeed have unique names.
I run Consul in K8S and they are destroyed and re-sync from a git repo, and pods are rescheduled frequently. I see this kind of errors in all my clusters. Does anybody know how to fix it automatically without human intervention?
Overview of the Issue
I have a 3 node Consul cluster with Consul 1.9.4 revision 10bb6cb3b. All nodes come up, but one node (hostname: consul-server-02) is spawning EnsureRegistration warnings.
The Hashicorp Consul-Errors-And-Warnings article tells us we should just do a
consul force-leave consul-server-02
on one of the other live nodes and then rejoin.Running force-leave and then restarting consul-server-02 does not allow it to update the host-id for node name consul-server-02.
Expected behaviour: clean update of Consul Server 02 node name to new node-id.
Reproduction Steps
How to reproduce the current Consul Cluster state I'm not sure. I don't think I did anything utterly weird.
But at this point:
consul agent -config-dir /etc/consul.d/ -retry-join 192.168.0.10
, resulting in WARN messages as shown belowconsul force-leave consul-server-02
consul agent -config-dir /etc/consul.d/ -retry-join 192.168.0.10
, still resulting in WARN messages as shown belowConsul info for both Server 1 and Server 2
Server 1 info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 10bb6cb3 version = 1.9.4 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 192.168.0.10:8300 server = true raft: applied_index = 46001 commit_index = 46001 fsm_pending = 0 last_contact = 0 last_log_index = 46001 last_log_term = 52 last_snapshot_index = 32781 last_snapshot_term = 51 latest_configuration = [{Suffrage:Voter ID:44087325-f32d-acf4-34b5-45aa2ee5bbae Address:192.168.0.12:8300} {Suffrage:Voter ID:89927e54-0654-414f-f0d7-9546655e2f5c Address:192.168.0.10:8300} {Suffrage:Voter ID:598fc162-eab2-1d04-5bd9-9d360fad15c3 Address:192.168.0.11:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 52 runtime: arch = amd64 cpu_count = 2 goroutines = 322 max_procs = 2 os = linux version = go1.15.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 20 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1047 members = 11 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 154 members = 3 query_queue = 0 query_time = 1 ```Server 2 info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 10bb6cb3 version = 1.9.4 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 192.168.0.10:8300 server = true raft: applied_index = 45988 commit_index = 45988 fsm_pending = 87 last_contact = 69.558881ms last_log_index = 45988 last_log_term = 52 last_snapshot_index = 32785 last_snapshot_term = 51 latest_configuration = [{Suffrage:Voter ID:44087325-f32d-acf4-34b5-45aa2ee5bbae Address:192.168.0.12:8300} {Suffrage:Voter ID:89927e54-0654-414f-f0d7-9546655e2f5c Address:192.168.0.10:8300} {Suffrage:Nonvoter ID:598fc162-eab2-1d04-5bd9-9d360fad15c3 Address:192.168.0.11:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 52 runtime: arch = amd64 cpu_count = 2 goroutines = 84 max_procs = 2 os = linux version = go1.15.8 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 20 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1047 members = 11 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 154 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
OS: Ubuntu 20.04
Log Fragments
This line starts spawning multiple times after start