hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.37k stars 4.42k forks source link

Consul client on Windows cannot find server nodes on AWS #8866

Closed marcofiocco closed 3 years ago

marcofiocco commented 4 years ago

Overview of the Issue

I have created a cluster on AWS using https://github.com/hashicorp/nomad-autoscaler. The Ubuntu server and client nodes work fine, they can find each other. Now I have a Windows 2016 instance on AWS (in the same subnet of a Linux client), where I have installed nomad and consul. Nomad should join the servers thanks to Consul using auto join as the Linux clients do, but it does not work in this Windows instance. Note that I’ve tagged the AWS instance with ConsulAutoJoin = auto-join already.

Reproduction Steps

The Consul HCL is (using the IP of the Windows instance):

datacenter = "dc1"
data_dir = "C:\\Consul\\data"
advertise_addr = "10.241.238.196"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"
log_level = "INFO"
retry_join = ["provider=aws tag_key=ConsulAutoJoin tag_value=auto-join"]
ui = true

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = 12b16df3 version = 1.8.4 consul: acl = disabled known_servers = 0 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 51 max_procs = 4 os = windows version = go1.14.6 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2 members = 1 query_queue = 0 query_time = 1 ```
Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 4 build: prerelease = revision = 12b16df3 version = 1.8.4 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 10.241.238.204:8300 server = true raft: applied_index = 173 commit_index = 173 fsm_pending = 0 last_contact = 76.814564ms last_log_index = 173 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:8e9862e8-54ac-7595-84a9-08de035cbbca Address:10.241.238.204:8300} {Suffrage:Voter ID:aac8650d-1d5f-d4e0-3fb5-ce0ee19469f9 Address:10.241.238.236:8300} {Suffrage:Voter ID:557417e6-d3c3-8cd7-3b79-2dcd50067feb Address:10.241.239.15:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 96 max_procs = 2 os = linux version = go1.14.6 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 4 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Windows 2016 arm64

Log Fragments

PS C:\Users\Administrator> consul.exe agent -config-dir=C:\Consul\config\
==> Starting Consul agent...
           Version: '1.8.4'
           Node ID: 'cfc5d53c-1747-8945-ce84-50cdef8d40cd'
         Node name: 'EC2AMAZ-IMF3L1P'
        Datacenter: 'dc1' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.241.238.196 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2020-10-01T10:38:14.559Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: EC2AMAZ-IMF3L1P 10.241.238.196
    2020-10-01T10:38:14.657Z [INFO]  agent.router: Initializing LAN area manager
    2020-10-01T10:38:14.658Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
    2020-10-01T10:38:14.659Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
    2020-10-01T10:38:14.659Z [INFO]  agent: Started HTTP server: address=[::]:8500 network=tcp
    2020-10-01T10:38:14.659Z [INFO]  agent: started state syncer
==> Consul agent running!
    2020-10-01T10:38:14.659Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:38:14.660Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:38:14.659Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN dis
covery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphe
re"
    2020-10-01T10:38:14.660Z [INFO]  agent: Joining cluster...: cluster=LAN
    2020-10-01T10:38:14.660Z [INFO]  agent: discover-aws: Address type  is not supported. Valid values are {private_v4,p
ublic_v4,public_v6}. Falling back to 'private_v4': cluster=LAN
    2020-10-01T10:38:14.660Z [INFO]  agent: discover-aws: Region not provided. Looking up region in metadata...: cluster
=LAN
    2020-10-01T10:38:43.098Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:38:43.098Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:38:55.439Z [ERROR] agent: Cannot discover address: cluster=LAN address="provider=aws tag_key=ConsulAutoJoin tag_value=auto-join" error="discover-aws: GetInstanceIdentityDocument fai
led: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/dynamic/instance-identity/document": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2020-10-01T10:38:55.440Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error="No servers to join"
    2020-10-01T10:38:58.905Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:38:58.905Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:39:16.803Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:39:16.803Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:39:25.450Z [INFO]  agent: discover-aws: Address type  is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4': cluster=LAN
    2020-10-01T10:39:25.450Z [INFO]  agent: discover-aws: Region not provided. Looking up region in metadata...: cluster=LAN
    2020-10-01T10:39:33.389Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:39:33.389Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:39:55.108Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:39:55.108Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:40:06.213Z [ERROR] agent: Cannot discover address: cluster=LAN address="provider=aws tag_key=ConsulAutoJoin tag_value=auto-join" error="discover-aws: GetInstanceIdentityDocument fai
led: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/dynamic/instance-identity/document": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    2020-10-01T10:40:06.216Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error="No servers to join"
    2020-10-01T10:40:10.507Z [WARN]  agent.router.manager: No servers available
    2020-10-01T10:40:10.507Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
    2020-10-01T10:40:10.597Z [INFO]  agent: Caught: signal=interrupt
    2020-10-01T10:40:10.597Z [INFO]  agent: Gracefully shutting down agent...
    2020-10-01T10:40:10.598Z [INFO]  agent.client: client starting leave
    2020-10-01T10:40:10.599Z [INFO]  agent.client.serf.lan: serf: EventMemberLeave: EC2AMAZ-IMF3L1P 10.241.238.196
    2020-10-01T10:40:13.601Z [INFO]  agent: Graceful exit completed
    2020-10-01T10:40:13.601Z [INFO]  agent: Requesting shutdown
    2020-10-01T10:40:13.602Z [INFO]  agent.client: shutting down client
    2020-10-01T10:40:13.605Z [INFO]  agent: consul client down
    2020-10-01T10:40:13.605Z [INFO]  agent: shutdown complete
    2020-10-01T10:40:13.606Z [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=tcp
    2020-10-01T10:40:13.609Z [INFO]  agent: Stopping server: protocol=DNS address=0.0.0.0:8600 network=udp
    2020-10-01T10:40:13.609Z [INFO]  agent: Stopping server: protocol=HTTP address=[::]:8500 network=tcp
    2020-10-01T10:40:13.612Z [INFO]  agent: Waiting for endpoints to shut down
    2020-10-01T10:40:13.613Z [INFO]  agent: Endpoints down
    2020-10-01T10:40:13.613Z [INFO]  agent: Exit code: code=0
igordcsouza commented 4 years ago

@marcofiocco This error agent: Cannot discover address: cluster=LAN address="provider=aws tag_key=ConsulAutoJoin tag_value=auto-join" error="discover-aws: GetInstanceIdentityDocument failed: EC2MetadataRequestError: failed to get EC2 instance identity document is probably because you are missing an IAM role policy. Try to add the bellow policy to the policy attached to the instance.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeInstances",
                "autoscaling:DescribeAutoScalingGroups"
            ],
            "Resource": "*"
        }
    ]
}
marcofiocco commented 4 years ago

The problem might have been with my AMI. I was using an AMI that I've created from an instance without the correct procedure (I was shutting down the instance without Sysprep). In that way some subnets were wrong. This was also impeding to boot instances with user_data. Then I did the correct procedure (with Sysprep) and run the same Consul configuration over a new instance from the new AMI and now magically it works.

jkirschner-hashicorp commented 3 years ago

Hi @marcofiocco,

I'm closing this issue as it seems like you resolved it and that it wasn't caused by Consul. Please reply if I misunderstood. Thanks!