hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.33k stars 4.42k forks source link

retry-join cannot discover other servers on AWS ECS #3797

Closed cs-mahmoud-khateeb closed 6 years ago

cs-mahmoud-khateeb commented 6 years ago

I am trying to start three Consul servers in an ECS cluster that consist of three ec2 instances, each in a different availabity zone within one aws region.

the command I am using in my task definition: agent,-server,-client=0.0.0.0,-bootstrap-expect=3,-datacenter=eu-west-1,-retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster",-log-level=TRACE

and I am getting the following log in all three instances (it worth mentioning that the task definitioin has an IAM role with the abiltiy to DescribeInstances:

==> Found address 'x.x.x.x' for interface 'eth0', setting bind option...
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.2'
Node ID: 'b5062427-37cd-ffcd-8019-803ba5454c83'
Node name: 'ip-x-x-x-x'
Datacenter: 'eu-west-1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: x.x.x.xLAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2018/01/12 09:44:59 [DEBUG] Using random ID "b5062427-37cd-ffcd-8019-803ba5454c83" as node ID
2018/01/12 09:44:59 [INFO] raft: Initial configuration (index=0): []
2018/01/12 09:44:59 [INFO] raft: Node at x.x.x.x:8300 [Follower] entering Follower state (Leader: "")
2018/01/12 09:44:59 [INFO] serf: EventMemberJoin: ip-x-x-x-x.eu-west-1 x.x.x.x
2018/01/12 09:44:59 [INFO] serf: EventMemberJoin: ip-x-x-x-x x.x.x.x
2018/01/12 09:44:59 [INFO] consul: Adding LAN server ip-x-x-x-x (Addr: tcp/x.x.x.x:8300) (DC: eu-west-1)
2018/01/12 09:44:59 [INFO] consul: Handled member-join event for server "ip-x-x-x-x.eu-west-1" in area "wan"
2018/01/12 09:44:59 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2018/01/12 09:44:59 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2018/01/12 09:44:59 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
2018/01/12 09:44:59 [INFO] agent: started state syncer
2018/01/12 09:44:59 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
2018/01/12 09:44:59 [INFO] agent: Joining LAN cluster...
2018/01/12 09:44:59 [ERR] agent: Join LAN: discover: provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster: missing '='
2018/01/12 09:44:59 [WARN] agent: Join LAN failed: No servers to join, retrying in 30s
2018/01/12 09:45:06 [WARN] raft: no known peers, aborting election
2018/01/12 09:45:07 [ERR] agent: failed to sync remote state: No cluster leader
2018/01/12 09:45:29 [ERR] agent: Join LAN: discover: provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster: missing '='

I have tried the following to make sure everything is fine:

/ # ps -ef
PID   USER     TIME   COMMAND
    1 root       0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster"
    6 consul     0:00 consul agent -data-dir=/consul/data -config-dir=/consul/config -bind=x.x.x.x -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster"
   25 root       0:00 sh
   29 root       0:00 ps -ef

I have tried to ssh directly to the container instances and run the following command (with aws key): docker run --net=host -e 'CONSUL_BIND_INTERFACE=eth0' consul agent -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=Name tag_value=ecs_cluster access_key_id=x secret_access_key=x" -log-level=TRACE

and I got slightly different output:

==> Found address 'x.x.x.x' for interface 'eth0', setting bind option...
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.0.2'
           Node ID: '86d75175-f8ac-badc-19d3-30e9942f09d0'
         Node name: 'ip-x-x-x-x'
        Datacenter: 'eu-west-1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: x.x.x.x (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/01/12 09:43:27 [DEBUG] Using random ID "86d75175-f8ac-badc-19d3-30e9942f09d0" as node ID
    2018/01/12 09:43:27 [INFO] raft: Initial configuration (index=0): []
    2018/01/12 09:43:27 [INFO] raft: Node at x.x.x.x:8300 [Follower] entering Follower state (Leader: "")
    2018/01/12 09:43:27 [INFO] serf: EventMemberJoin: ip-x-x-x-x.eu-west-1 x.x.x.x
    2018/01/12 09:43:27 [INFO] serf: EventMemberJoin: ip-x-x-x-x x.x.x.x
    2018/01/12 09:43:27 [INFO] consul: Adding LAN server ip-x-x-x-x (Addr: tcp/x.x.x.x:8300) (DC: eu-west-1)
    2018/01/12 09:43:27 [INFO] consul: Handled member-join event for server "ip-x-x-x-x.eu-west-1" in area "wan"
    2018/01/12 09:43:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2018/01/12 09:43:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2018/01/12 09:43:27 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2018/01/12 09:43:27 [INFO] agent: started state syncer
    2018/01/12 09:43:27 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
    2018/01/12 09:43:27 [INFO] agent: Joining LAN cluster...
    2018/01/12 09:43:27 [DEBUG] discover: Using provider "aws"
    2018/01/12 09:43:27 [INFO] discover-aws: Address type  is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4'
    2018/01/12 09:43:27 [INFO] discover-aws: Region not provided. Looking up region in metadata...
    2018/01/12 09:43:27 [INFO] discover-aws: Region is eu-west-1
    2018/01/12 09:43:27 [DEBUG] discover-aws: Creating session...
    2018/01/12 09:43:27 [INFO] discover-aws: Filter instances with =ecs_cluster
    2018/01/12 09:43:28 [DEBUG] discover-aws: Found 0 reservations
    2018/01/12 09:43:28 [DEBUG] discover-aws: Found ip addresses: []
    2018/01/12 09:43:28 [INFO] agent: Discovered LAN servers:
    2018/01/12 09:43:28 [WARN] agent: Join LAN failed: No servers to join, retrying in 30s
    2018/01/12 09:43:34 [ERR] agent: failed to sync remote state: No cluster leader
    2018/01/12 09:43:37 [WARN] raft: no known peers, aborting election

It worth mentioning that the aws key used in the command hasn't been used according to IAM.

I was able to get around this by running: consul join x.x.x.x inside the containers.

aaronhurt commented 6 years ago

Documenting some of the conversation from gitter for others that may see this issue...

In the first case it appears to be an options passing/parsing problem with AWS and consul/go-discover is seeing the entire retry-join as one string and attempting to parse it all as the provider.

Additionally, it should be tag_key and tag_value instead of tag_name and tag_value in the discovery string.

cs-mahmoud-khateeb commented 6 years ago

closing as per above.

cs-mahmoud-khateeb commented 6 years ago

for the AWS issue, no need to use double quotes.

aaronhurt commented 6 years ago

awesome, simple enough :)

ebarault commented 6 years ago

for posterity

When used in a command block in an AWS ECS task-definition, it should be: (although it looks weird)

# task-definition.json
{
  "containerDefinitions": [
     {
        "command": [
          "agent",
          "-server",
          "-retry-join=provider=aws tag_key=foo tag_value=bar"
        ],
...
matelang commented 6 years ago

Thanks @ebarault . Your comment saved me hours of debugging.