hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

discover-aws will use by default the first private IP of the EC2 instance #7268

Open ltagliamonte-dd opened 4 years ago

ltagliamonte-dd commented 4 years ago

Based on the official helm chart for Consul, I'm writing a terraform module to deploy consul in my AWS environment on EC2.

The helm chart runs consul using a statefulset that retains the identity of the node (IP - dataDisk).

I've automated the re-attach of an EBS volume and of an ENI to my instance running in an ASG.

The problem i'm having now is that cloud-join doesn't work anymore because the DescribeInstance will use by default the first IP of the server, interface eth0 i've turned down, instead of eth1 that is the ENI.

I'm aware I can add the IPs to the retry-join conf but this is going to complicate the agent instances discovery of the cluster.

Is there any way I can specify to use a secondary ENI for cloud-join? What is the recommendation around retaining data disk and IP addresses? Both? At job-1 I was running consul on ephemeral without retain anything and i've never had issues, but i'd really like to hear what the community and Hashicorp team recommend

ltagliamonte-dd commented 4 years ago

adding logs:

Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Region is us-west-2
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Filter instances with Type=consul-server
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Instance i-0986f7648619efb2a has private ip 10.99.0.15
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Instance i-08f274580df100e48 has private ip 10.99.0.25
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Instance i-0510a01a548f44a3b has private ip 10.99.1.79
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Instance i-0eaa033f9cbbdc355 has private ip 10.99.1.34
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] discover-aws: Instance i-0c53c5f809dd89000 has private ip 10.99.2.105
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] agent: Discovered LAN servers: 10.99.0.15 10.99.0.25 10.99.1.79 10.99.1.34 10.99.2.105
Feb 12 01:32:46 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:46 [INFO] agent: (LAN) joining: [10.99.0.15 10.99.0.25 10.99.1.79 10.99.1.34 10.99.2.105]
Feb 12 01:32:59 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:32:59 [ERR] agent: failed to sync remote state: No cluster leader
Feb 12 01:33:00 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:00 [ERR] http: Request GET /v1/agent/metrics?format=prometheus, error: ACL not found from=127.0.0.1:39274
Feb 12 01:33:00 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:00 [ERR] http: Request GET /v1/agent/members, error: ACL not found from=172.17.0.2:33798
Feb 12 01:33:03 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:03 [ERR] agent: Coordinate update error: No cluster leader
Feb 12 01:33:07 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:07 [ERR] http: Request GET /v1/catalog/nodes?stale=, error: No cluster leader from=172.17.0.2:33796
Feb 12 01:33:07 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:07 [ERR] http: Request GET /v1/catalog/services?stale=, error: No cluster leader from=172.17.0.2:33798
Feb 12 01:33:08 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:08 [ERR] http: Request GET /v1/health/state/any?stale=, error: No cluster leader from=172.17.0.2:33800
Feb 12 01:33:25 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:25 [ERR] agent: Coordinate update error: No cluster leader
Feb 12 01:33:26 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:26 [ERR] agent: failed to sync remote state: No cluster leader
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:29 [WARN] agent: (LAN) couldn't join: 0 Err: 5 errors occurred:
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:         * Failed to join 10.99.0.15: dial tcp 10.99.0.15:8301: i/o timeout
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:         * Failed to join 10.99.0.25: dial tcp 10.99.0.25:8301: connect: connection refused
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:         * Failed to join 10.99.1.79: dial tcp 10.99.1.79:8301: i/o timeout
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:         * Failed to join 10.99.1.34: dial tcp 10.99.1.34:8301: i/o timeout
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:         * Failed to join 10.99.2.105: dial tcp 10.99.2.105:8301: i/o timeout
Feb 12 01:33:29 ip-10-99-0-198.us-west-2.compute.internal docker[2640]:     2020/02/12 01:33:29 [WARN] agent: Join LAN failed: <nil>, retrying in 10s
mkeeler commented 4 years ago

@ltagliamonte-dd Cloud auto-join and specifically selecting IP addresses of instances is all done within go-discover. The AWS provider config currently allows you to select individual instances and then choose from the first private ipv4, public ipv4 or public ipv6 address.

The instance metadata coming back from the AWS APIs does contain more information regarding individual network interfaces so currently its a limitation of go-discover. We certainly could make it work, it would just require updates to that library and then pulling those dependencies into Consul.

ltagliamonte-dd commented 4 years ago

thank you @mkeeler would be possible to describe ENI based on tag instead on EC2 instances? About my last question on what hashicorp recommend to retain, what do you suggest? I've looked also at the official cf template to install consul on aws and nothing (IP,EBS) is retained: https://github.com/aws-quickstart/quickstart-hashicorp-consul/blob/develop/templates/quickstart-hashicorp-consul.template

ltagliamonte-dd commented 4 years ago

@mkeeler any updates on this issue?