hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul 1.1.0 :Retry join LAN fails to find cluster until agent is fully restarted #4189

Closed harakiri406 closed 6 years ago

harakiri406 commented 6 years ago

Overview of the Issue

retry_join seems defunct since version 1.1.0

Reproduction Steps

Steps to reproduce this issue, eg:

Instance is deployed in AWS EC2/Elastic Beanstalk using self rpm-packaged version with config "retry_join": [ "consul.core.domain" ] consul.core.domain is a round-robin DNS alias for all three consul servers. All records resolve to a working consul server

Consul info for both defunct (1.1.0) as working client (1.0.2)

Consul 1.1.0

Output on a non-working node:

[root@ip-172-22-54-171 ~]# /usr/local/bin/consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 3
    services = 3
build:
    prerelease = 
    revision = 5174058f
    version = 1.1.0
consul:
    known_servers = 0
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 38
    max_procs = 1
    os = linux
    version = go1.10.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

[root@ip-172-22-54-171 ~]# /usr/local/bin/consul members
Node                                             Address             Status  Type    Build  Protocol  DC   Segment
i-0e7dd7c7fc2acdf57.x.y.z.com  172.22.54.171:8301  alive   client  1.1.0  2         dc1  <default>

Output of log:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.1.0'
           Node ID: 'ec98a7a9-50ac-8c82-43ba-7887ceed4d81'
         Node name: 'i-0e7dd7c7fc2acdf57.x.y.z.com'
        Datacenter: 'dc1' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 172.22.54.171 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/06/04 10:37:10 [WARN] agent: Node name "i-0e7dd7c7fc2acdf57.x.y.z.com" will not be discoverable v
ia DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 10:37:10 [INFO] serf: EventMemberJoin: i-0e7dd7c7fc2acdf57.x.y.z.com 172.22.54.171
    2018/06/04 10:37:10 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2018/06/04 10:37:10 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2018/06/04 10:37:10 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2018/06/04 10:37:10 [INFO] agent: started state syncer
    2018/06/04 10:37:10 [WARN] manager: No servers available
    2018/06/04 10:37:10 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:37:10 [INFO] agent: Caught signal:  hangup
    2018/06/04 10:37:10 [INFO] agent: Reloading configuration...
    2018/06/04 10:37:10 [WARN] agent: Service name "node_exporter" will not be discoverable via DNS due to invalid character
s. Valid characters include all alpha-numerics and dashes.
    2018/06/04 10:37:10 [WARN] agent: Service name "node_exporter" will not be discoverable via DNS due to invalid character
s. Valid characters include all alpha-numerics and dashes.
    2018/06/04 10:37:12 [INFO] agent: Caught signal:  hangup
    2018/06/04 10:37:12 [INFO] agent: Reloading configuration...
...

    2018/06/04 10:37:28 [WARN] manager: No servers available
    2018/06/04 10:37:28 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:37:58 [WARN] manager: No servers available
    2018/06/04 10:37:58 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:38:22 [WARN] manager: No servers available
    2018/06/04 10:38:22 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:38:47 [WARN] manager: No servers available
    2018/06/04 10:38:47 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:39:16 [WARN] manager: No servers available
    2018/06/04 10:39:16 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:39:41 [WARN] manager: No servers available
    2018/06/04 10:39:41 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:39:59 [WARN] manager: No servers available
    2018/06/04 10:39:59 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 10:40:16 [WARN] manager: No servers available

After stopping and starting the agent it works fine:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.1.0'
           Node ID: 'ec98a7a9-50ac-8c82-43ba-7887ceed4d81'
         Node name: 'i-0e7dd7c7fc2acdf57.x.y.z.com'
        Datacenter: 'core' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 172.22.54.171 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/06/04 11:34:09 [WARN] agent: Node name "i-0e7dd7c7fc2acdf57.x.y.z.com" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:34:09 [INFO] serf: EventMemberJoin: i-0e7dd7c7fc2acdf57.x.y.z.com 172.22.54.171
    2018/06/04 11:34:09 [WARN] agent: Service name "apache_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:34:09 [WARN] agent: Service name "mysqld_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:34:09 [WARN] agent: Service name "node_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:34:09 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2018/06/04 11:34:09 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2018/06/04 11:34:09 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2018/06/04 11:34:09 [INFO] agent: started state syncer
    2018/06/04 11:34:09 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer triton
    2018/06/04 11:34:09 [INFO] agent: Joining LAN cluster...
    2018/06/04 11:34:09 [INFO] agent: (LAN) joining: [consul.core.domain]
    2018/06/04 11:34:09 [WARN] manager: No servers available
    2018/06/04 11:34:09 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 11:34:09 [INFO] serf: EventMemberJoin: xxxxx.x.y.z.com 172.22.121.180
    2018/06/04 11:34:09 [INFO] serf: EventMemberJoin: yyyyy.x.y.z.com 172.22.52.19

Consul info:

[root@ip-172-22-54-171 ~]# /usr/local/bin/consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 3
    services = 3
build:
    prerelease = 
    revision = 5174058f
    version = 1.1.0
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 42
    max_procs = 1
    os = linux
    version = go1.10.1
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 32
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 63896
    members = 188
    query_queue = 0
    query_time = 4

Consul 1.0.2

Consul agent runs fine in version 1.0.2, so server issues don't seem to be related. Log output from a 1.0.2 version

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.0.2'
           Node ID: '8c4e849a-44ad-bc92-b5bf-09befdda4522'
         Node name: 'i-0d41ada3a164d6327.x.y.z.com'
        Datacenter: 'core' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 172.22.54.103 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/06/04 11:40:22 [INFO] serf: EventMemberJoin: i-0d41ada3a164d6327.x.y.z.com 172.22.54.103
    2018/06/04 11:40:22 [WARN] Service name "apache_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:40:22 [WARN] agent: check "service:apache_exporter" has the 'script' field, which has been deprecated and replaced with the 'args' field. See https://www.consul.io/docs/agent/checks.html
    2018/06/04 11:40:22 [WARN] Service name "mysqld_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:40:22 [WARN] agent: check "service:mysqld_exporter" has the 'script' field, which has been deprecated and replaced with the 'args' field. See https://www.consul.io/docs/agent/checks.html
    2018/06/04 11:40:22 [WARN] Service name "node_exporter" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2018/06/04 11:40:22 [WARN] agent: check "service:node_exporter" has the 'script' field, which has been deprecated and replaced with the 'args' field. See https://www.consul.io/docs/agent/checks.html
    2018/06/04 11:40:22 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2018/06/04 11:40:22 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2018/06/04 11:40:22 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2018/06/04 11:40:22 [INFO] agent: started state syncer
    2018/06/04 11:40:22 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
    2018/06/04 11:40:22 [INFO] agent: Joining LAN cluster...
    2018/06/04 11:40:22 [INFO] agent: (LAN) joining: [consul.core.domain]
    2018/06/04 11:40:22 [WARN] manager: No servers available
    2018/06/04 11:40:22 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/06/04 11:40:22 [INFO] serf: EventMemberJoin: aaaaa.x.y.z.com 172.22.80.205
    2018/06/04 11:40:22 [INFO] serf: EventMemberJoin: bbbbb.x.y.z.com 172.22.4.143

    ...

    2018/06/04 11:40:22 [INFO] serf: EventMemberJoin: consulserver3.consul--server.infra.core.domain 172.22.4.199
    2018/06/04 11:40:22 [INFO] consul: adding server consulserver3.consul--server.infra.core.domain (Addr: tcp/172.22.4.199:8300) (DC: core)

It doesn not seem to be a name resolution problem from initial startup, since version 1.1.0 joins nicely after breaking and fixing resolv.conf (consul data dir emptied)

harakiri406 commented 6 years ago

Update: issue does not seem related to version 1.1.0 but rather to the early startup of consul. So version 1.0.2 has the same problem. Will close this issue for now.