Closed clovis818 closed 6 years ago
I haven't tested this out yet, but based on Terraform aws_launch_configuration its supported. https://www.terraform.io/docs/providers/aws/r/launch_configuration.html#spot_price
Yea, it looks reasonable, but we can't accept any PRs without tests, and we can't run tests for external PRs automatically for security reasons. If you could fire up a cluster with/without spot pricing and use the helper script to check everything still runs, I'd be grateful!
Fixed a few bugs and Tested both consul with and without spot pricing.
Added a 30 second sleep to user-data-client.sh and user-data-server.sh The aws_autoscaling_group did not propagate tags until after the servers launched, which caused run-consul to generate multiple configs with "bootstrap_expect":"1".
The launch_configuration tenancy has to be be set to null for spot instances. AWS will throw an error if its set to "default". The VPC Tenancy will be used when its null. https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-in-vpc.html#as-vpc-tenancy
Added a 30 second sleep to user-data-client.sh and user-data-server.sh The aws_autoscaling_group did not propagate tags until after the servers launched, which caused run-consul to generate multiple configs with "bootstrap_expect":"1".
Hmm, I have not seen this issue before. Is this something you hit specifically with spot instances?
Otherwise, I'm not sure why you would get "bootstrap_expect" : 1
, as we are looking up the ASG's DesiredCapacity, not the actual number of servers running in it.
The launch_configuration tenancy has to be be set to null for spot instances. AWS will throw an error if its set to "default". The VPC Tenancy will be used when its null.
Ah, right, good catch. This is why we ask to test :)
When the servers launch, tags are not applied for a second or two, which can cause a race condition with run-consul. Its has happened to me several times.
get_instance_tags: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L108
get_asg_size: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L111
get_cluster_size: https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L126
Ah, I see. Are you seeing this warning in your logs? https://github.com/hashicorp/terraform-aws-consul/blob/master/modules/run-consul/run-consul#L128
Also, if this is indeed the race condition, where the instance looks up its own tags before they have propagated, then the fix is not a random sleep, but to fetch the tags in a retry loop until the ASG tag is found (or some max timeout is reached).
How about instead of the 30 second sleep using:
aws ec2 wait instance-status-ok
https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/instance-status-ok.html
aws ec2 wait system-status-ok
https://docs.aws.amazon.com/cli/latest/reference/ec2/wait/system-status-ok.html
I'm not sure either of those implies tags have propagated, so a for-loop to fetch the tags until they are present, with a sleep in between, is probably safer.
Okay added the for-loop to fetch tags.
Server Logs: https://pastebin.com/LY5irrU3
Client Logs: https://pastebin.com/RfMw8Gcz
Thanks for the PR! Code changes look good. How did you test this?