hashicorp / terraform-aws-consul

A Terraform Module for how to run Consul on AWS using Terraform and Packer
Apache License 2.0
401 stars 484 forks source link

Added Support for EC2 Spot Instances #45

Closed clovis818 closed 6 years ago

brikis98 commented 6 years ago

Thanks for the PR! Code changes look good. How did you test this?

clovis818 commented 6 years ago

I haven't tested this out yet, but based on Terraform aws_launch_configuration its supported. https://www.terraform.io/docs/providers/aws/r/launch_configuration.html#spot_price

brikis98 commented 6 years ago

Yea, it looks reasonable, but we can't accept any PRs without tests, and we can't run tests for external PRs automatically for security reasons. If you could fire up a cluster with/without spot pricing and use the helper script to check everything still runs, I'd be grateful!

clovis818 commented 6 years ago

Fixed a few bugs and Tested both consul with and without spot pricing.

  1. Added a 30 second sleep to user-data-client.sh and user-data-server.sh The aws_autoscaling_group did not propagate tags until after the servers launched, which caused run-consul to generate multiple configs with "bootstrap_expect":"1".

  2. The launch_configuration tenancy has to be be set to null for spot instances. AWS will throw an error if its set to "default". The VPC Tenancy will be used when its null. https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-in-vpc.html#as-vpc-tenancy

brikis98 commented 6 years ago

Added a 30 second sleep to user-data-client.sh and user-data-server.sh The aws_autoscaling_group did not propagate tags until after the servers launched, which caused run-consul to generate multiple configs with "bootstrap_expect":"1".

Hmm, I have not seen this issue before. Is this something you hit specifically with spot instances?

Otherwise, I'm not sure why you would get "bootstrap_expect" : 1, as we are looking up the ASG's DesiredCapacity, not the actual number of servers running in it.

The launch_configuration tenancy has to be be set to null for spot instances. AWS will throw an error if its set to "default". The VPC Tenancy will be used when its null.

Ah, right, good catch. This is why we ask to test :)

clovis818 commented 6 years ago

When the servers launch, tags are not applied for a second or two, which can cause a race condition with run-consul. Its has happened to me several times.

get_instance_tags: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L108

get_asg_size: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L111

get_cluster_size: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L126

brikis98 commented 6 years ago

Ah, I see. Are you seeing this warning in your logs? https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L128

Also, if this is indeed the race condition, where the instance looks up its own tags before they have propagated, then the fix is not a random sleep, but to fetch the tags in a retry loop until the ASG tag is found (or some max timeout is reached).

clovis818 commented 6 years ago

How about instead of the 30 second sleep using: aws ec2 wait instance-status-ok https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/instance-status-ok.html aws ec2 wait system-status-ok https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/system-status-ok.html

brikis98 commented 6 years ago

I'm not sure either of those implies tags have propagated, so a for-loop to fetch the tags until they are present, with a sleep in between, is probably safer.

clovis818 commented 6 years ago

Okay added the for-loop to fetch tags.

Server Logs: https://pastebin.com/LY5irrU3

Client Logs: https://pastebin.com/RfMw8Gcz

brikis98 commented 6 years ago

https://github.com/hashicorp/terraform-aws-consul/releases/tag/v0.2.1