hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.34k stars 9.49k forks source link

aws: Allow rolling updates for ASGs #1552

Closed radeksimko closed 7 years ago

radeksimko commented 9 years ago

Once #1109 is fixed, I'd like to be able to use Terraform to actually roll out the updated launch configuration and do it carefully.

Whoever decides to not roll out the update and change the LC association only should still be allowed to do so.

Here's an example from CloudFormation: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

How about something like this?

resource "aws_autoscaling_group" "test" {
  rolling_update_policy {
    max_batch_size = 1
    min_instances_in_service = 2
    pause_time = "PT0S"
    suspend_processes = ["launch", "terminate"]
    wait_on_resource_signals = true
  }
}

then if there's such policy defined, TF can use autoscaling API and shut down each EC2 instance separately and let ASG spin up a new one with an updated LC.

rvangundy commented 8 years ago

@brikis98 Did you manage to get around the issue related to #1109? I'm getting Cannot delete launch configuration <my-lc-name> because it is attached to <my-asg>. The LC update still manages to apply, it just can't delete the previous one so I encounter an error each time...

brikis98 commented 8 years ago

@rvangundy Do you have create_before_destroy = true on your LC?

rvangundy commented 8 years ago

@brikis98 Yeah I sure do. What's weird is I'm seeing that as the CloudFormation stack is being modified, it's actively trying to destroy the previous LC as well,

module.my_service.aws_launch_configuration.lc: Creation complete
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (18s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (28s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (30s elapsed)

Note also that I've encapsulated my CloudFormation+ASG configuration in a module, so I may try to pull that out next...

brikis98 commented 8 years ago

@rvangundy Hmm, not sure. If you post your code, I can see if there are any obvious differences, but for the most part, the ASG setup with CloudFormation has worked reasonably well for me.

rvangundy commented 8 years ago

@brikis98 Fixed it! It was because I had externalized my CloudFormation template in to a template_file resource which decouples the dependency graph between the LC and CloudFormation stack. I inlined it using <<EOF...EOF and that did the trick.

brikis98 commented 8 years ago

@rvangundy Ah, I have mine inlined with a heredoc, which is probably why I didn't hit that issue. Good to know, thanks!

shuoy commented 8 years ago

Regarding "Immagine a 10 nodes db cluster with 10TB of data, spinning up a complete new cluster will cause the full resync of the 10TB of data from the old cluster to the new cluster all at once, this might saturate the network link and cause denial of service, where maybe all you wanted was to increase the number of connections."

@BrunoBonacci , we are facing the same situation about rolling update. Imagine we want to bump up one version of the software running on a data node, we need the kind of rolling update "in-place". It looks like the rolling update with TF is not going to get you there. Maybe we should consider something like Ansible to deal with that?

BrunoBonacci commented 8 years ago

@shuoy Certainly Ansible is a way to "script" this behaviour but the idea of using terraform (at least for me) is to replace previous scripting tools such as Chef, puppet, Ansible as so on.

I've seen different approaches to rolling update around. Kubernetes allow you the set a grace time to wait between the update of a machine to the next. This certainly could solve some of the issues, however it would work only for quite short grace time. If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable.

I think the right approach it would be to provide an attribute which can be set to choose if the rolling update has to be performed automatically (based on grace period) or in stages.

something like:

## just a suggestion
rolling_update = "auto"
grace_period = "3m"

Which it will wait 3 minutes between a successful update to the next.

While the stage rolling update could work as follow:

## just a suggestion
rolling_update = "staged"
instances_updated = 1

In this case terraform would look at the ASG for example of 10 machines, and just update one. Then the operator would wait for the new instance to be fully in service before bumping the instances_updated = 2 and repeat terraform apply. In such way the rolling update could be performed in stages giving enough time to the app to replicate necessary data and become fully operative. Once all 10 instances have been updated then update can be considered successful.

Again, this is just a suggestion to allow a declarative (non scripted) approach to rolling update, which would work for stateful clusters (such as databases).

shuoy commented 8 years ago

@BrunoBonacci "If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable." -- My point is this. For a node that following into this category, Chef/Ansible is exactly the tool you will be using to have an "in-place update way"; the destroying the old node and bringing up a new node (immutable way) is exactly what you want to avoid. -- In the in-place-update way (chef/ansible), you basically apt-get update your mysqld (which take 1-2 minutes); yet the "destroy old and bring a new one way" (immutable way) takes hours unless you are using EBS volume. But I think you meant to say non-EBS way, or maybeI missed your point above.

Do we have scenario that large volume of data reside on ephemeral disk? Yes, e.g., for people use EC2 as their Hadoop cluster capacity, data is saved on ephemeral disk from cost saving perspective (EBS gives you extra cost).

So, in short, I think Terraform is great at certain scenarios, but it's not accurate that Chef/Ansible can be totally replaced. Particularly in the update use case for stateful nodes.

jdoss commented 8 years ago

@brikis98 and others that are using the output example posted here, it seems to not work if you upgrade to 0.7.0-rc2. Here is the Error that is kicking out:

* Resource 'aws_cloudformation_stack.web_autoscaling_group' does not have attribute 'outputs.AsgName' for variable 'aws_cloudformation_stack.web_autoscaling_group.outputs.AsgName'

I am still trying to get outputs working but if anyone has any advice on how to get this working again with 0.7.x that would be awesome.

brikis98 commented 8 years ago

@jdoss You may want to file a separate bug agains the 0.7 release for that. The changelog mentions that for terraform_remote_state they no longer use the .output map. Perhaps they changed something similar on aws_cloudformation_stack but didn't document it?

phinze commented 8 years ago

@jdoss on first glance, it looks like that reference is using the pre-0.7 map dot-index notation. in 0.7, maps are indexed like aws_instance.tag["sometag"]

So try changing this line to use square bracket indexing and see if that fixes it for you!

  autoscaling_group_name = "${aws_cloudformation_stack.autoscaling_group.outputs["AsgName"]}"

If that doesn't work I'd welcome a fresh bug report and we can dig in. 👍

jdoss commented 8 years ago

@phinze You da best! That was the issue. :smile:

moofish32 commented 8 years ago

@rvangundy -- did you keep the lifecycle hook create_before_destroy=true or remove it? I think I am thrashing between this error and this https://github.com/hashicorp/terraform/issues/3916

rvangundy commented 8 years ago

@moofish32 I have create_before_destroy=true on my aws_launch_configuration

antonbabenko commented 8 years ago

@brikis98 - Thank you for the piece of CloudFormation code which does rolling deployments. It does not work for me, because I use spot instances in my launch configuration and when I specify MinInstancesInService greater than zero (because I really want to not terminate all instances before rolling) I have this error from AWS:

Autoscaling rolling updates cannot be performed because the current launch configuration is using spot instances and MinInstancesInService is greater than zero.

I wonder if anyone has any ideas how to implement ASG rolling update for spot instances using less moving parts?

My initial idea was to make a shell script which would work similarly to aws-ha-release.

I prefer to use just Terraform and CloudFormation and avoid implementation of orchestration magic.

UPD: Having MinInstancesInService=0 and MaxBatchSize=1 (which is smaller than amount of instances in ASG) helped. I am happy.

iadknet commented 7 years ago

+1

shaharmor commented 7 years ago

Is there any plan to implement this in the near/far future?

james-gonzalez commented 7 years ago

+1

w32-blaster commented 7 years ago

+1

kenoir commented 7 years ago

+1

sthulb commented 7 years ago

~@phinze How about implementing this as a data source? it doesn't seem like the worst way of implementing it using current code structure.~

I immediately regret this suggestion.

I don't mind implementing this feature, but I'm kind of at a loss on how to implement it in HCL. As the comment above would suggest, I thought about using the data source format, but linking it back to the ASG resource (Something like a UpdatePolicy = "${data.aws_asg_policy.foo}"; this concept doesn't seem horrid, but leaves room for interpretation.

Please advise on the most idiomatic way this could be represented in HCL.

prog893 commented 7 years ago

+1

andrleite commented 7 years ago

+1

sudochop commented 7 years ago

+1

addisonbair commented 7 years ago

+1

radeksimko commented 7 years ago

Hi folks, we do appreciate the +1's if these don't generate notifications. 😅

Therefore it's more helpful for everyone to use reactions as we can then sort issues by the number of 👍 : https://github.com/hashicorp/terraform/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc+label%3Aprovider%2Faws

The 👍 reactions do count and we're more than happy for people to use those and prefer over "+1" comments for the mentioned reasons.

Thanks.

w32-blaster commented 7 years ago

👍 :trollface:

apparentlymart commented 7 years ago

Hi everyone,

While this sort of sophisticated behavior would definitely be useful, we (the Terraform team) don't have any short-term plans to implement this ourselves since we're generally recommending that people with these needs consider other solutions such as EC2 Container Service and Nomad, which either have or are more likely to grow sophisticated mechanisms for safe rollout and are in a better position to do so due to the ability to manage such multi-step state transitions.

We're trying to prune stale feature requests (that aren't likely to be addressed soon) by closing them. In this case we're currently leaning towards not implementing significant additional behaviors on top of what the EC2 API natively supports, so I'm going to close this.

radeksimko commented 7 years ago

After chatting with @apparentlymart privately I just want to add a few more things.

We do not suggest everyone should use containers (nor that containers solve the problem entirely) and for those who prefer not to, there's a workaround - you can use aws_cloudformation_stack similar way the folks at Travis did.

Also I'm tracking this issue in my own TODO list, to not forget about it. I personally want to get this done, but it's currently pretty low on my list of priorities, PRs are certainly welcomed.