hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.45k stars 4.43k forks source link

Unable to remove/leave Agent with unreachable Node-Address #3261

Open iceman91176 opened 7 years ago

iceman91176 commented 7 years ago

consul version for both Client and Server

Client: 0.8.3 Server: 0.8.3

consul info for both Client and Server

Client:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = ea2a82b
        version = 0.8.3
consul:
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 32
        goroutines = 117
        max_procs = 32
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 219
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 223
        members = 11
        query_queue = 0
        query_time = 18

Server:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 2
build:
        prerelease =
        revision = ea2a82b
        version = 0.8.3
consul:
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = [2a00:1f08:900:1::6]:8300
        server = true
raft:
        applied_index = 464656
        commit_index = 464656
        fsm_pending = 0
        last_contact = 0
        last_log_index = 464656
        last_log_term = 4967
        last_snapshot_index = 464324
        last_snapshot_term = 4966
        latest_configuration = [{Suffrage:Voter ID:[2a00:1f08:0900:3:0:0:0:1]:8300 Address:[2a00:1f08:0900:3:0:0:0:1]:8300} {Suffrage:Voter ID:[2a00:1f08:0900:3:0:0:0:2]:8300 Address:[2a00:1f08:0900:3:0:0:0:2]:8300} {Suffrage:Voter ID:[2a00:1f08:900:1::6]:8300 Address:[2a00:1f08:900:1::6]:8300} {Suffrage:Voter ID:[2a00:1f08:900:3::1]:8300 Address:[2a00:1f08:900:3::1]:8300} {Suffrage:Voter ID:[2a00:1f08:900:3::2]:8300 Address:[2a00:1f08:900:3::2]:8300}]
        latest_configuration_index = 193962
        num_peers = 4
        protocol_version = 2
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 4967
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 275
        max_procs = 8
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 219
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 223
        members = 11
        query_queue = 0
        query_time = 18
serf_wan:
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 46
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

RedHat EL 7.3

Description of the Issue (and unexpected/desired result)

We had to reinstall one of the servers which was running as consul client. We did not leave/force-leave the client. During installation we did configure a wrong bind address (IPv4) for the re-installed agent. Somehow the agent was able to connect to the IPv6-Consul-Servers anyway. Since then all of the cluster agents show this wrong cluster member with the state alive. But is isn't alive - all agents are complaining about not beeing able to connect to the IPv4-Address. We are not able to leave/force-leave that wrong host (bind address has been corrected in the mean-time). When trying a force leave on some other agent B the faulty agent is stuck in state leaving on agent B (but not on Agent C for example, where it is still alive) ,until a restart of the agent B. Then the faulty agent goes back to alive on Agent B.

As a workaround we assigned a different name/id to the faulty node. It was able to join the cluster without problems.

Since the state of the orphaned agent is 'alive' i 'don't think it will be reaped automatically. Any chance to remove it manually ?

Reproduction steps

Log Fragments or Link to gist

Memberlist - the first entry is wrong and definitive not alive ceph-dc-1-01-osd-05 10.60.0.36:8301 alive client 0.8.3 2 dc-witcom-cloud ceph-dc-1-01-osd-05-temp-02 [2aXY:1f08:900:1::ZZZ]:8301 alive client 0.8.3 2 dc-witcom-cloud

Log from same host 2017/07/11 23:04:54 [ERR] memberlist: Failed to send ping: write udp [2aXY:1f08:900:1::ZZY]:8301->10.60.0.36:8301: sendto: network is unreachable 2017/07/11 23:04:58 [ERR] memberlist: Push/Pull with ceph-dc-1-01-osd-05 failed: dial tcp 10.60.0.36:8301: getsockopt: connection refused 2017/07/11 23:05:03 [ERR] memberlist: Failed to send ping: write udp [2aXY:1f08:900:1::ZZY]:8301->10.60.0.36:8301: sendto: network is unreachable

slackpad commented 7 years ago

Hi @iceman91176 when you do the force-leave are you doing it by name? A common mistake is to use the address there but it requires the node name ceph-dc-1-01-osd-05. That should work to kick it. I suspect that what's happening is that the errors are so high up that memberlist isn't registering this as a failed node, so since no node in the cluster can probe it never gets marked as suspect/failed, which would be a bug.