hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.89k stars 1.95k forks source link

Duplicate Consul Service after IP address change #19553

Open eshcheglov opened 10 months ago

eshcheglov commented 10 months ago

Nomad version

Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

Ubuntu 20.04 aarch64

Issue

If the IP address of the node where Nomad is installed changes, Nomad does not remove the old data from Consul. That is, if you had one 'Nomad' service in Consul, after changing the IP address there will be two. If you change the address again, there will be three.

Reproduction steps

Nomad Server logs

Logs is differ for one-node and cluster setup. In case of one-node, Nomad is trying to run autopilot to remove the outdated data, but it can't. In case of multi-node setup, there is no errors at all and no info about autopilot startup

One-node Nomad logs from journalctl ``` -- Logs begin at Thu 2023-12-21 13:49:18 UTC, end at Fri 2023-12-22 13:19:16 UTC. -- Dec 22 13:14:25 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:14:25.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:14:35 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:14:35.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:14:45 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:14:45.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:14:55 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:14:55.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:05 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:05.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:15 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:15.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:16.929Z [ERROR] nomad: failed to reconcile member: member="{cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 map[bootstrap:1 build:1.6.1 dc:cbw-dx4-2450-beta08-0011 expect:1 id:fa9c3934-0917-8bbf-3120-804de3ee560d port:4647 raft_vsn:3 region:global revision:515895c7690cdc72278018dc5dc58aca41204ccc role:nomad rpc_addr:192.168.88.245 vsn:1] alive 1 5 2 2 5 4}" error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:15:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:16.930Z [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:15:25 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:25.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:35 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:35.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:45 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:45.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:15:55 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:15:55.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:05 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:05.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:15 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:15.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:16.926Z [ERROR] nomad: failed to reconcile member: member="{cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 map[bootstrap:1 build:1.6.1 dc:cbw-dx4-2450-beta08-0011 expect:1 id:fa9c3934-0917-8bbf-3120-804de3ee560d port:4647 raft_vsn:3 region:global revision:515895c7690cdc72278018dc5dc58aca41204ccc role:nomad rpc_addr:192.168.88.245 vsn:1] alive 1 5 2 2 5 4}" error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:16:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:16.927Z [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:16:25 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:25.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:35 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:35.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:45 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:45.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:16:55 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:16:55.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:05 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:05.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:15 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:15.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:16.926Z [ERROR] nomad: failed to reconcile member: member="{cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 map[bootstrap:1 build:1.6.1 dc:cbw-dx4-2450-beta08-0011 expect:1 id:fa9c3934-0917-8bbf-3120-804de3ee560d port:4647 raft_vsn:3 region:global revision:515895c7690cdc72278018dc5dc58aca41204ccc role:nomad rpc_addr:192.168.88.245 vsn:1] alive 1 5 2 2 5 4}" error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:17:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:16.926Z [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:17:25 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:25.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:35 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:35.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:45 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:45.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:17:55 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:17:55.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:05 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:05.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:15 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:15.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:16.929Z [ERROR] nomad: failed to reconcile member: member="{cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 map[bootstrap:1 build:1.6.1 dc:cbw-dx4-2450-beta08-0011 expect:1 id:fa9c3934-0917-8bbf-3120-804de3ee560d port:4647 raft_vsn:3 region:global revision:515895c7690cdc72278018dc5dc58aca41204ccc role:nomad rpc_addr:192.168.88.245 vsn:1] alive 1 5 2 2 5 4}" error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:18:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:16.929Z [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:18:25 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:25.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:35 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:35.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:45 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:45.690Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:18:55 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:18:55.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:19:05 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:19:05.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:19:15 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:19:15.689Z [ERROR] nomad.autopilot: Failed to reconcile current state with the desired state Dec 22 13:19:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:19:16.932Z [ERROR] nomad: failed to reconcile member: member="{cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 map[bootstrap:1 build:1.6.1 dc:cbw-dx4-2450-beta08-0011 expect:1 id:fa9c3934-0917-8bbf-3120-804de3ee560d port:4647 raft_vsn:3 region:global revision:515895c7690cdc72278018dc5dc58aca41204ccc role:nomad rpc_addr:192.168.88.245 vsn:1] alive 1 5 2 2 5 4}" error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" Dec 22 13:19:16 cbw-dx4-2450-beta08-0011 nomad[9318]: 2023-12-22T13:19:16.932Z [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"fa9c3934-0917-8bbf-3120-804de3ee560d\": need at least one voter in configuration: {[]}" ```
One-node peer list ``` # nomad operator raft list-peers Node ID Address State Voter RaftProtocol (unknown) fa9c3934-0917-8bbf-3120-804de3ee560d 192.168.99.243:4647 follower true unknown ```
Cluster peer list ``` # nomad operator raft list-peers Node ID Address State Voter RaftProtocol cb-dx2-gamma-1.global 51e7d5f1-9e03-c244-5cd8-ac62b3472273 192.168.0.106:4647 follower true 3 cb-dx2-gamma-2.global 83c6046d-cf19-2beb-72be-d802d4f8245d 192.168.0.245:4647 leader true 3 cb-dx2-gamma-3.global 7d935dc2-f422-9994-bc12-9881657a3569 192.168.0.213:4647 follower true 3 ```
One-node server members ``` # nomad server members Name Address Port Status Leader Raft Version Build Datacenter Region cbw-dx4-2450-beta08-0011.global 192.168.88.245 4648 alive true 3 1.6.1 cbw-dx4-2450-beta08-0011 global ```
Cluster server members ``` # nomad server members Name Address Port Status Leader Raft Version Build Datacenter Region cb-dx2-gamma-1.global 192.168.0.106 4648 alive false 3 1.6.1 gamma_cluster global cb-dx2-gamma-2.global 192.168.0.245 4648 alive true 3 1.6.1 gamma_cluster global cb-dx2-gamma-3.global 192.168.0.213 4648 alive false 3 1.6.1 gamma_cluster global ```
Consul screenshot from Cluster setup ![Screenshot from 2023-12-22 15-28-44](https://github.com/hashicorp/nomad/assets/10085736/f3fdeeb5-abee-47b3-b33d-cc341cbb9e29)
Consul screenshot from One-node setup ![Screenshot from 2023-12-22 15-32-38](https://github.com/hashicorp/nomad/assets/10085736/8e5b33ac-6557-41e4-b381-f93fd163c507)

Probably, same issue: https://discuss.hashicorp.com/t/how-to-recover-from-error-removing-server-with-duplicate-id/43649/5

jrasell commented 9 months ago

Hi @eshcheglov and thanks for raising this issue. Could you explain a little about any steps you take when you change the IP related to Nomad, and whether you're referring to Nomad agents running in client or server mode?

In both screenshots you included of the Consul UI, the services are passing and healthy, which indicates the processes are still responding to check requests. Do these health checks ever fail?

eshcheglov commented 9 months ago

Hi @eshcheglov and thanks for raising this issue. Could you explain a little about any steps you take when you change the IP related to Nomad, and whether you're referring to Nomad agents running in client or server mode?

Hi! I turn on the node, it gets the address A.A.A.A on the eth0 port. Nomad, after starting, registers itself in Consul with this (A.A.A.A) address. I turn off the computer, change the Ethernet port, turn it on again, and it gets the address B.B.B.B on the eth1 port. Nomad, after starting, again registers itself in Consul with this address. However, the old address A.A.A.A still remains in Consul with the note 'All checks passing' even though it is no longer accessible As a result, in Consul, I see several Nomads: one with the address A.A.A.A, another with the address B.B.B.B

Do these health checks ever fail?

No. Address A.A.A.A is inaccessible, I don't know why check is green, but I keep node running for a 15-30 minutes, and check is still green

All Nomad nodes is running in both client and server node. Nomad config file (for one-node setup) is attached to this message.

nomad.hcl ``` data_dir = "/var/lib/nomad" bind_addr = "0.0.0.0" log_level = "INFO" server { enabled = true bootstrap_expect = 1 job_gc_threshold = "730h" } client { enabled = true } consul { address = "127.0.0.1:8500" server_auto_join = true } ui { enabled = true consul { ui_url = "http://:8500/ui" } } plugin "docker" { config { allow_privileged = true volumes { enabled = true } allow_caps = ["all", "NET_RAW"] auth { # Nomad will prepend "docker-credential-" to the helper value helper = "ecr-login" } gc { image = "true" image_delay = "744h" } } } ```