hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul healthcheck surviving long after node destruction #2239

Closed scarby closed 7 years ago

scarby commented 8 years ago

consul version for both Client and Server

Consul v0.6.3

consul info for both Client and Server

Server:

agent:
    check_monitors = 1
    check_ttls = 0
    checks = 3
    services = 3
build:
    prerelease =
    revision = c933efde
    version = 0.6.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    server = true
raft:
    applied_index = 2800517
    commit_index = 2800517
    fsm_pending = 0
    last_contact = 69.352111ms
    last_log_index = 2800517
    last_log_term = 7862
    last_snapshot_index = 2792744
    last_snapshot_term = 7862
    num_peers = 2
    state = Follower
    term = 7862
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 76
    max_procs = 2
    os = linux
    version = go1.5.3
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 147
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 4365
    members = 26
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1```

Operating system and Environment details

Red Hat Enterprise Linux Server release 6.7 (Santiago)

Description of the Issue (and unexpected/desired result)

we have a node in consul which joined the consul cluster on the 19th of july and was terminated more than a week ago, however at some point disappeared and never appeared to cleanly leave the cluster - it seems to have remained with one service registered against it and a critical health check against that (serf health check for example has disappeared)

in this case:

{
    "Node": {
        "Node": "openif-i-5c98cbd0.ci.mot.aws.dvsa",
        "Address": "10.80.30.91",
        "CreateIndex": 2793570,
        "ModifyIndex": 2793570
    },
    "Services": {
        "haproxy_exporter": {
            "ID": "haproxy_exporter",
            "Service": "haproxy_exporter",
            "Tags": [
                "nodetype:openif"
            ],
            "Address": "",
            "Port": 9101,
            "EnableTagOverride": false,
            "CreateIndex": 2793570,
            "ModifyIndex": 2793570
        }
    }
}

however a healthy node that has not dissapeared shows as:

{
    "Node": {
        "Node": "openif-i-71200dfd.ci.mot.aws.dvsa",
        "Address": "10.80.30.203",
        "CreateIndex": 2798909,
        "ModifyIndex": 2800917
    },
    "Services": {
        "haproxy_exporter": {
            "ID": "haproxy_exporter",
            "Service": "haproxy_exporter",
            "Tags": [
                "nodetype:openif"
            ],
            "Address": "",
            "Port": 9101,
            "EnableTagOverride": false,
            "CreateIndex": 2798910,
            "ModifyIndex": 2800691
        },
        "node_exporter": {
            "ID": "node_exporter",
            "Service": "node_exporter",
            "Tags": [
                "nodetype:openif"
            ],
            "Address": "",
            "Port": 9100,
            "EnableTagOverride": false,
            "CreateIndex": 2798911,
            "ModifyIndex": 2800917
        },
        "open-interface": {
            "ID": "open-interface",
            "Service": "open-interface",
            "Tags": [],
            "Address": "",
            "Port": 8090,
            "EnableTagOverride": false,
            "CreateIndex": 2798912,
            "ModifyIndex": 2800883
        }
    }
}

I'm not entirely certain what might have caused this - the only log message that references the affected node is: 2016/07/19 06:26:14 [INFO] serf: EventMemberJoin: openif-i-5c98cbd0.ci.mot.aws.dvsa 10.80.30.91

slackpad commented 8 years ago

Hi @scarby this one is odd. Does that node show up in consul members output? If so you can probably do a consul force-leave to kick it, which should clean up these registrations as well.

slackpad commented 7 years ago

Never heard back so closing this out. Please let us know if you are still having issues.