hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

[bug] Health checks extremely delayed while on service registration #2970

Open sheldonkwok opened 7 years ago

sheldonkwok commented 7 years ago

consul version for both Client and Server

Client: 0.7.5 Server: 0.8.1

consul info for both Client and Server

Client:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 13
        services = 15
build:
        prerelease =
        revision = '21f2d5a
        version = 0.7.5
consul:
        known_servers = 5
        server = false
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 66
        max_procs = 8
        os = linux
        version = go1.7.5
serf_lan:
        encrypted = false
        event_queue = 0
        event_time = 340
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 228
        member_time = 17267
        members = 244
        query_queue = 0
        query_time = 719

Server:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 3
        services = 4
build:
        prerelease =
        revision = 'e9ca44d
        version = 0.8.1
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.202.4.6:8300
        server = true
raft:
        applied_index = 15218042
        commit_index = 15218042
        fsm_pending = 0
        last_contact = 54.379989ms
        last_log_index = 15218042
        last_log_term = 12093
        last_snapshot_index = 15211474
        last_snapshot_term = 12093
        latest_configuration = [{Suffrage:Voter ID:10.0.0.7:8300 Address:10.0.0.7:8300} {Suffrage:Voter ID:10.0.4.6:8300 Address:10.0.4.6:8300} {Suffrage:Voter ID:10.0.5.6:8300 Address:10.0.5.6:8300} {Suffrage:Voter ID:10.0.0.8:8300 Address:10.0.0.8:8300} {Suffrage:Voter ID:10.0.0.6:8300 Address:10.0.0.6:8300}]
        latest_configuration_index = 15178657
        num_peers = 4
        protocol_version = 2
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 12093
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 210
        max_procs = 4
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = false
        event_queue = 0
        event_time = 340
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 101
        member_time = 17267
        members = 117
        query_queue = 0
        query_time = 719
serf_wan:
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 5
        query_queue = 0
        query_time = 1

Operating system and Environment details

Linux 4.4.0-75-generic #96-Ubuntu SMP Thu Apr 20 09:56:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the Issue (and unexpected/desired result)

We are currently migrating to the new consul version (0.7.X to 0.8X) and are experiencing issues registering health checks with nomad. The jobs only have a serf health check and not the http one that is specified. About five minutes later, the http health checks finally register. Normally the additional health checks associated with the service register immediately.

The issue on nomad repo: https://github.com/hashicorp/nomad/issues/2595#issuecomment-297826414

Log Fragments or Link to gist

Client logs are filled with.

    2017/04/27 21:13:51 [ERR] consul: RPC failed to server 10.0.0.6:8300: rpc error: rpc error: Unknown check 'd336b21cd221b366f66baea0e29dbb5782b3e060'
    2017/04/27 21:13:51 [ERR] agent: failed to sync changes: rpc error: rpc error: Unknown check 'd336b21cd221b366f66baea0e29dbb5782b3e060'
slackpad commented 7 years ago

Hi @sheldonkwok do you have ACLs enabled? This looks like it might be a bug where it's trying to verify ACL rights on a check that's already deleted (https://github.com/hashicorp/consul/blob/v0.8.1/consul/acl.go#L774-L776). That should probably allow the deregister since the check is already missing. I'm not sure how you'd get into this state though. While you are in the "stuck" state can you post a gist with the /v1/agent/checks output from the agent and the /v1/catalog/health/service/ for the service?

As a workaround you could set https://www.consul.io/docs/agent/options.html#acl_enforce_version_8 to false to bypass this particular check while we diagnose and fix.

sheldonkwok commented 7 years ago

Hi @slackpad we do have ACLs enabled but it was set to allow as we migrated. We will add the workaround for the old servers before we start the migration again. Will post more info as the bug comes up again. Thanks!