hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Datacenter members are unable to join the WAN pool #7477

Closed mohito83 closed 3 years ago

mohito83 commented 4 years ago

consul version: 0.8.4

Overview of the Issue

We have 3 datacenters out of which members from only 2 forms the WAN gossip pool during the bootstrap stage.

Reproduction Steps

Steps to reproduce this issue, following command it being used to start the consul as server. DC1: consul agent -server -bootstrap-expect 3 --data-dir /opt/data/drconsul --config-dir /opt/web-app/etc/drconsul -client 0.0.0.0 -bind 172.31.7.136 -retry-join 172.31.7.138 172.31.7.137 172.31.7.136 -retry-join-wan 172.31.7.112 172.31.7.191 172.31.7.190 172.31.7.189

DC2: consul agent -server -bootstrap-expect 3 --data-dir /opt/data/drconsul --config-dir /opt/web-app/etc/drconsul -client 0.0.0.0 -bind 172.31.7.189 -retry-join 172.31.7.191 172.31.7.190 172.31.7.189 -retry-join-wan 172.31.7.137 172.31.7.138 172.31.7.136 172.31.7.112

DC3: consul agent -server -bootstrap-expect 1 --data-dir /opt/data/drconsul --config-dir /opt/web-app/etc/drconsul -client 0.0.0.0 -bind 172.31.7.112 -retry-join 172.31.7.112 -retry-join-wan 172.31.7.189 172.31.7.190 172.31.7.191 172.31.7.136 172.31.7.137 172.31.7.138

Sample configs.json file from one of the member node

{
    "addresses": {
        "https": "172.31.7.136"
    },
    "bind_addr": "172.31.7.136",
    "data_dir": "/opt/data/drconsul",
    "datacenter": "DC1",
    "log_level": "INFO",
    "node_name": "vManage-Viptela-DC1-1",
    "ports": {
        "dns": 18600,
        "http": 18500,
        "https": 18501,
        "serf_lan": 18301,
        "serf_wan": 18302,
        "server": 18300
    },
    "server": true
}

Consul info for both Client and Server

Client info ``` output from client 'consul info' command here ```
Server info ``` agent: check_monitors = 5 check_ttls = 0 checks = 10 services = 6 build: prerelease = revision = f436077 version = 0.8.4 consul: bootstrap = false known_datacenters = 3 leader = false leader_addr = 172.31.7.138:18300 server = true raft: applied_index = 68208 commit_index = 68208 fsm_pending = 0 last_contact = 37.953293ms last_log_index = 68208 last_log_term = 57 last_snapshot_index = 65537 last_snapshot_term = 57 latest_configuration = [{Suffrage:Voter ID:172.31.7.138:18300 Address:172.31.7.138:18300} {Suffrage:Voter ID:172.31.7.137:18300 Address:172.31.7.137:18300} {Suffrage:Voter ID:172.31.7.136:18300 Address:172.31.7.136:18300}] latest_configuration_index = 1 num_peers = 2 protocol_version = 2 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 57 runtime: arch = amd64 cpu_count = 2 goroutines = 96 max_procs = 2 os = linux version = go1.8.3 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 7 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 23 members = 3 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 23 members = 7 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Linux 3.10.62-ltsi

Log Fragments

DC2:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.4'
           Node ID: 'a389aabf-d63a-148e-4962-cfd51b8e4bba'
         Node name: 'vManage-Viptela-DC2-2'
        Datacenter: 'dc2'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 18500, HTTPS: 18501, DNS: 18600)
      Cluster Addr: 172.31.7.190 (LAN: 18301, WAN: 18302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2020/03/19 17:40:07 [INFO] raft: Restored from snapshot 21159-8192-1584632625973
    2020/03/19 17:40:07 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:172.31.7.190:18300 Address:172.31.7.190:18300} {Suffrage:Voter ID:172.31.7.189:18300 Address:172.31.7.189:18300} {Suffrage:Voter ID:172.31.7.191:18300 Address:172.31.7.191:18300}]
    2020/03/19 17:40:07 [INFO] raft: Node at 172.31.7.190:18300 [Follower] entering Follower state (Leader: "")
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-2 172.31.7.190
    2020/03/19 17:40:07 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-DC2-1: 172.31.7.189:18301
    2020/03/19 17:40:07 [INFO] consul: Adding LAN server vManage-Viptela-DC2-2 (Addr: tcp/172.31.7.190:18300) (DC: dc2)
    2020/03/19 17:40:07 [INFO] consul: Raft data found, disabling bootstrap mode
    2020/03/19 17:40:07 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-DC2-3: 172.31.7.191:18301
    2020/03/19 17:40:07 [WARN] serf: Failed to re-join any previously known node
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-2.dc2 172.31.7.190
    2020/03/19 17:40:07 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-DC2-1.dc2: 172.31.7.189:18302
    2020/03/19 17:40:07 [INFO] consul: Handled member-join event for server "vManage-Viptela-DC2-2.dc2" in area "wan"
    2020/03/19 17:40:07 [INFO] agent: Started DNS server 0.0.0.0:18600 (udp)
    2020/03/19 17:40:07 [INFO] agent: Started DNS server 0.0.0.0:18600 (tcp)
    2020/03/19 17:40:07 [INFO] agent: Started HTTP server on [::]:18500
    2020/03/19 17:40:07 [INFO] agent: Joining cluster...
    2020/03/19 17:40:07 [INFO] agent: (LAN) joining: [172.31.7.191]
    2020/03/19 17:40:07 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-Arbitrator-1.dc3: 172.31.7.112:18302
    2020/03/19 17:40:07 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:

* Failed to join 172.31.7.191: dial tcp 172.31.7.191:18301: getsockopt: connection refused
    2020/03/19 17:40:07 [WARN] agent: Join failed: <nil>, retrying in 30s
    2020/03/19 17:40:07 [WARN] memberlist: Refuting an alive message
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-Arbitrator-1.dc3 172.31.7.112
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-3.dc2 172.31.7.191
    2020/03/19 17:40:07 [INFO] serf: Re-joined to previously known node: vManage-Viptela-Arbitrator-1.dc3: 172.31.7.112:18302
    2020/03/19 17:40:07 [INFO] consul: Handled member-join event for server "vManage-Viptela-Arbitrator-1.dc3" in area "wan"
    2020/03/19 17:40:07 [INFO] consul: Handled member-join event for server "vManage-Viptela-DC2-3.dc2" in area "wan"
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-1 172.31.7.189
    2020/03/19 17:40:07 [INFO] consul: Adding LAN server vManage-Viptela-DC2-1 (Addr: tcp/172.31.7.189:18300) (DC: dc2)
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-3 172.31.7.191
    2020/03/19 17:40:07 [INFO] consul: Adding LAN server vManage-Viptela-DC2-3 (Addr: tcp/172.31.7.191:18300) (DC: dc2)
    2020/03/19 17:40:07 [INFO] serf: EventMemberJoin: vManage-Viptela-DC2-1.dc2 172.31.7.189
    2020/03/19 17:40:07 [INFO] consul: Handled member-join event for server "vManage-Viptela-DC2-1.dc2" in area "wan"
    2020/03/19 17:40:07 [INFO] serf: EventMemberUpdate: vManage-Viptela-DC2-3.dc2
    2020/03/19 17:40:13 [INFO] consul: New leader elected: vManage-Viptela-DC2-1
"/var/log/nms/vmanage-server-drconsul.log" 5477 lines, 449560 characters

DC1:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.8.4'
           Node ID: '5f7b1167-beed-7214-6749-13b440ee1edf'
         Node name: 'vmanage-Viptela-DC1-2'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 18500, HTTPS: 18501, DNS: 18600)
      Cluster Addr: 172.31.7.137 (LAN: 18301, WAN: 18302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2020/03/19 17:38:24 [INFO] raft: Restored from snapshot 52-24576-1584606695595
    2020/03/19 17:38:24 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:172.31.7.138:18300 Address:172.31.7.138:18300} {Suffrage:Voter ID:172.31.7.137:18300 Address:172.31.7.137:18300} {Suffrage:Voter ID:172.31.7.136:18300 Address:172.31.7.136:18300}]
    2020/03/19 17:38:24 [INFO] raft: Node at 172.31.7.137:18300 [Follower] entering Follower state (Leader: "")
    2020/03/19 17:38:24 [INFO] serf: EventMemberJoin: vmanage-Viptela-DC1-2 172.31.7.137
    2020/03/19 17:38:24 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-DC1-1: 172.31.7.136:18301
    2020/03/19 17:38:24 [INFO] consul: Adding LAN server vmanage-Viptela-DC1-2 (Addr: tcp/172.31.7.137:18300) (DC: dc1)
    2020/03/19 17:38:24 [INFO] consul: Raft data found, disabling bootstrap mode
    2020/03/19 17:38:24 [INFO] serf: EventMemberJoin: vManage-Viptela-DC1-1 172.31.7.136
    2020/03/19 17:38:24 [INFO] serf: Re-joined to previously known node: vManage-Viptela-DC1-1: 172.31.7.136:18301
    2020/03/19 17:38:24 [INFO] consul: Adding LAN server vManage-Viptela-DC1-1 (Addr: tcp/172.31.7.136:18300) (DC: dc1)
    2020/03/19 17:38:24 [INFO] serf: EventMemberJoin: vmanage-Viptela-DC1-2.dc1 172.31.7.137
    2020/03/19 17:38:24 [INFO] serf: Attempting re-join to previously known node: vmanage-Viptela-DC1-3.dc1: 172.31.7.138:18302
    2020/03/19 17:38:24 [INFO] consul: Handled member-join event for server "vmanage-Viptela-DC1-2.dc1" in area "wan"
    2020/03/19 17:38:24 [INFO] agent: Started DNS server 0.0.0.0:18600 (udp)
    2020/03/19 17:38:24 [INFO] serf: Attempting re-join to previously known node: vManage-Viptela-DC1-1.dc1: 172.31.7.136:18302     2020/03/19 17:38:24 [INFO] agent: Started DNS server 0.0.0.0:18600 (tcp)
    2020/03/19 17:38:24 [INFO] agent: Started HTTP server on [::]:18500
    2020/03/19 17:38:24 [INFO] agent: Joining cluster...
    2020/03/19 17:38:24 [INFO] agent: (LAN) joining: [172.31.7.138]
    2020/03/19 17:38:24 [INFO] serf: Re-joined to previously known node: vManage-Viptela-DC1-1.dc1: 172.31.7.136:18302
    2020/03/19 17:38:24 [INFO] consul: Handled member-join event for server "vManage-Viptela-DC1-1.dc1" in area "wan"
    2020/03/19 17:38:24 [INFO] agent: (LAN) joined: 0 Err: 1 error(s) occurred:

* Failed to join 172.31.7.138: dial tcp 172.31.7.138:18301: getsockopt: connection refused
    2020/03/19 17:38:24 [WARN] agent: Join failed: <nil>, retrying in 30s
    2020/03/19 17:38:24 [INFO] serf: EventMemberJoin: vmanage-Viptela-DC1-3 172.31.7.138
    2020/03/19 17:38:24 [INFO] consul: Adding LAN server vmanage-Viptela-DC1-3 (Addr: tcp/172.31.7.138:18300) (DC: dc1)
    2020/03/19 17:38:24 [INFO] serf: EventMemberJoin: vmanage-Viptela-DC1-3.dc1 172.31.7.138
    2020/03/19 17:38:24 [INFO] consul: Handled member-join event for server "vmanage-Viptela-DC1-3.dc1" in area "wan"
    2020/03/19 17:38:31 [WARN] raft: Failed to get previous log: 28592 log not found (last: 28591)
    2020/03/19 17:38:31 [INFO] consul: New leader elected: vManage-Viptela-DC1-1
mohito83 commented 4 years ago

The only time DC1 members adds to the WAN gossip pool when i manually executed following command curl http://127.0.0.1:18500/v1/agent/join/172.31.7.189?wan=1

ChipV223 commented 4 years ago

Hi @mohito83!

Generally, messages in the logs such as Failed to join 172.31.7.191: dial tcp 172.31.7.191:18301: getsockopt: connection refused would indicate potential issues in the network could block communication. Since you have custom ports for DNS, HTTP, HTTPS, Serf_Lan, and Serf_wan, can you confirm that those ports are open in the firewalls under both DCs?

mohito83 commented 4 years ago

Hi @ChipV223

I checked the iptables rules and other firewalls there is no blocking on port 18301, 18302 across the datacenters. There could be some momentary packet drops but that shouldn't stop the consul in DC1 to join the WAN gossip pool.

jkirschner-hashicorp commented 3 years ago

Hi @mohito83,

My apologies for the delayed response. Based on the information available above, my understanding is that you have servers at:

The two join failures I can see are:

# DC1
* Failed to join 172.31.7.138: dial tcp 172.31.7.138:18301: getsockopt: connection refused
# DC2
* Failed to join 172.31.7.191: dial tcp 172.31.7.191:18301: getsockopt: connection refused

Is that the set of error messages / failed behaviors you were asking about?

Are you sure that the Consul server agent at 172.31.7.138 is running and reachable from DC1? And same for 172.31.7.191 from DC2?

If you are still experiencing this issue, especially with a more recent version of Consul, let us know. Until then, I'm going to mark this as closed because it's been inactive for so long and we can't take further action without more information.