hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Automated deregistration for dead nodes from catalog #14874

Open jawnsy opened 1 year ago

jawnsy commented 1 year ago

Community Note


Is your feature request related to a problem? Please describe.

When saving a snapshot (consul snapshot save /tmp/abc.snap) and restoring the snapshot into a different Kubernetes cluster (consul snapshot restore /tmp/abc.snap), consul has a catalog entry for dead nodes, despite active members being updated appropriately.

For example, I have some nodes in a 10.62.xx subnet:

{"@level":"warn","@message":"EnsureRegistration failed","@module":"agent.fsm","@timestamp":"2022-10-01T23:02:02.273498Z","error":"failed inserting node: Error while renaming Node ID: \"13d052e0-8176-8bb8-482e-24271958b68f\": Node name consul-consul-server-2 is reserved by node a710f143-23d5-678c-a487-b9db6ea7f98e with name consul-consul-server-2 (10.62.1.53)"}

The catalog shows these nodes:

$ consul catalog nodes
Node                                             ID        Address       DC
consul-consul-server-0                           386fa727  10.62.1.51    dc1
consul-consul-server-1                           71e732a7  10.62.1.52    dc1
consul-consul-server-2                           a710f143  10.62.1.53    dc1

However, the member list does not (because we moved the consul installation from a Kubernetes cluster running in the 10.62 subnet to one in the 10.12 subnet):

$ consul members
Node                                             Address            Status  Type    Build   Protocol  DC   Partition  Segment
consul-consul-server-0                           10.12.48.48:8301   alive   server  1.13.2  2         dc1  default    <all>
consul-consul-server-1                           10.12.48.157:8301  alive   server  1.13.2  2         dc1  default    <all>
consul-consul-server-2                           10.12.49.37:8301   alive   server  1.13.2  2         dc1  default    <all>

The solution to the above error messages is to manually deregister nodes from the catalog, which also only appears possible using the REST API and not through the consul command:

$ curl --request PUT --data '{"Node":"consul-consul-server-0"}' -v http://localhost:8500/v1/catalog/deregister
$ curl --request PUT --data '{"Node":"consul-consul-server-1"}' -v http://localhost:8500/v1/catalog/deregister
$ curl --request PUT --data '{"Node":"consul-consul-server-2"}' -v http://localhost:8500/v1/catalog/deregister

After the deregistration is complete, the new nodes appear in the catalog and the log messages stop:

$ consul catalog nodes
Node                                          ID        Address       DC
consul-consul-server-0                        db1e1511  10.12.33.6    dc1
consul-consul-server-1                        6aa338cf  10.12.33.131  dc1
consul-consul-server-2                        13d052e0  10.12.32.6    dc1

Feature Description

Essentially, the problem is that the snapshots seem to contain the node list, and restoring the snapshot in a different cluster results in some error messages appearing.

There may be some different approaches to solve this, such as:

Use Case(s)

Anyone moving a Consul installation (this one backs a Vault installation) or using the snapshot save/restore capability for backups would be affected by this problem.

Contributions

No

jawnsy commented 1 year ago

Possibly a duplicate of https://github.com/hashicorp/consul/issues/9939 https://github.com/hashicorp/consul-k8s/issues/1266 https://github.com/hashicorp/consul-k8s/issues/319

t-eckert commented 1 year ago

Hi @jawnsy, thank you for opening your first issue on the Consul on Kubernetes repo. Thank you also for the detail in the issue. I think this is a very sensible feature request.

Because the solution will likely be implemented at a Consul level and not a Consul on Kubernetes level, I am going to transfer the issue to the Consul repository so they can have a look at it.