apparentlymart commented 8 years ago

For a while now I've been wringing my hands over the issue of using computed resource properties in parts of the Terraform config that are needed during the refresh and apply phases, where the values are likely to not be known yet.

The two primary situations that I and others have run into are:

Interpolating into provider configuration blocks, as I described in #2976. This is allowed by Terraform but fails in unintuitive ways when a chicken-and-egg problem arises.
Interpolating into the count modifier on resource blocks, as described in #1497. Currently this permits only variables, but having it configurable from resource attributes would be desirable.

After a number of false-starts trying to find a way to make this work better in Terraform, I believe I've found a design that builds on concepts already present in Terraform, and that makes only small changes to the Terraform workflow. I arrived at this solution by "paving the cowpaths" after watching my coworkers and I work around the issue in various ways.

The crux of the proposal is to alter Terraform's workflow to support the idea of partial application, allowing Terraform to apply a complicated configuration over several passes and converging on the desired configuration. So from the user's perspective, it would look something like this:

$ terraform plan -out=tfplan
... (yada yada yada) ...

Terraform is not able to apply this configuration in a single step. The plan above
will partially apply the configuration, after which you should run "terraform plan"
again to plan the next set of changes to converge on the given configuration.

$ terraform apply tfplan
... (yada yada yada) ...

Terraform has only partially-applied the given configuration. To converge on
the final result, run "terraform plan" again to plan the next set of changes.

$ terraform plan -out=tfplan
... (yada yada yada) ...

$ terraform apply
... (yada yada yada) ...

Success! ....

For a particularly-complicated configuration there may be three or more apply/plan cycles, but eventually the configuration should converge.

terraform apply would also exit with a predictable exit status in the "partial success" case, so that Atlas can implement a smooth workflow where e.g. it could immediately plan the next step and repeat the sign-off/apply process as many times as necessary.

This workflow is intended to embrace the existing workaround of using the -target argument to force Terraform to apply only a subset of the config, but improve it by having Terraform itself detect the situation. Terraform can then calculate itself which resources to target to plan for the maximal subset of the graph that can be applied in a single action, rather than requiring the operator to figure this out.

By teaching Terraform to identify the problem and propose a solution itself, Terraform can guide new users through the application of trickier configurations, rather than requiring users to either have deep understanding of the configurations they are applying (so that they can target the appropriate resources to resolve the chicken-and-egg situation), or requiring infrastructures to be accompanied with elaborate documentation describing which resources to target in which order.

Implementation Details

The proposed implementation builds on the existing concept of "computed" values within interpolations, and introduces the new idea of a graph nodes being "deferred" during the plan phase.

Deferred Providers and Resources

A graph node is flagged as deferred if any value it needs for refresh or plan is flagged as "computed" after interpolation. For example:

A provider is deferred if any of its configuration block arguments are computed.
A resource is deferred if its count value is computed.

Most importantly though, a graph node is always deferred if any of its dependencies are deferred. "Deferred-ness" propagates transitively so that, for example, any resource that belongs to a deferred provider is itself deferred.

After the graph walk for planning, the set of all deferred nodes is included in the plan. A partial plan is therefore signaled by the deferred node set being non-empty.

Partial Application

When terraform apply is given a partial plan, it applies all of the diffs that are included in the plan and then prints a message to inform the user that it was partial before exiting with a non-successful status.

Aside from the different rendering in the UI, applying a partial plan proceeds and terminates just like an error occured on one of the resource operations: the state is updated to reflect what was applied, and then Terraform exits with a nonzero status.

Progressive Runs

No additional state is required to keep track of partial application between runs. Since the state is already resource-oriented, a subsequent refresh will apply to the subset of resources that have already been created, and then plan will find that several "new" resources are present in the configuration, which can be planned as normal. The new resources created by the partial application will cause the set of deferred nodes to shrink -- possibly to empty -- on the follow-up run.

Building on this Idea

The write-up above considers the specific use-cases of computed provider configurations and computed "count". In addition to these, this new concept enables or interacts with some other ideas:

3310 proposed one design for supporting "iteration" -- or, more accurately, "fan out" -- to generate a set of resource instances based on data obtained elsewhere. This proposal enables a simpler model where foreach could iterate over arbitrary resource globs or collections within resource attributes, without introducing a new "generator" concept, by deferring the planning of the multiple resource instances until the collection has been computed.
2976 proposed the idea of allowing certain resources to be refreshed immediately, before they've been created, to allow them to exist during the initial plan. Partial planning reduces the need for this, but supporting pre-refreshed resources would still be valuable to skip an iteration just to, for example, look up a Consul key to configure a provider.
2896 talks about rolling updates to sets of resources. This is not directly supported by the above, since it requires human intervention to describe the updates that are required, but the UX of running multiple plan/apply cycles to converge could be used for rolling updates too.
The cycles that result when mixing create_before_destroy with not, as documented in #2944, could get a better UX by adding some more cases where nodes are "deferred" such that the "destroy" node for the deposed resource can be deferred to a separate run from the "create" that deposed it.
1819 considers allowing the provider attribute on resources to be interpolated. It's mainly concerned with interpolating from variables rather than resource attributes, but the partial plan idea allows interpolation to be supported more broadly without special exceptions like "only variables are allowed here", and so it may become easier to implement interpolation of provider.
4084 requests "intermediate variables", where computed values can be given a symbolic name that can then be used in multiple places within the configuration. One way to support this would be to allow variable defaults to be interpolated and mark the variables themselves as "deferred" when their values are computed, though certainly other implementations are possible.

bryantbiggs commented 3 years ago

@sharkymcdongles do you contribute under a different github handle, because your profile doesn't show any contributions to Terraform, providers, nor modules?

sharkymcdongles commented 3 years ago

@sharkymcdongles do you contribute under a different github handle, because your profile doesn't show any contributions to Terraform, providers, nor modules?

This is my anonymous personal GitHub account not linked to my real name. I won't be sharing my real name sorry.

aequitas commented 3 years ago

@sharkymcdongles from what I can tell from the contributors (https://github.com/hashicorp/terraform/graphs/contributors) it's mostly Hashicorp employees who maintain Terraform. And providers are mostly maintained by other companies (for their own platform) with their own financial interest or people that need to scratch their own itch.

That Hashicorp is making money on this makes no difference. If you where the one paying the money to Hashicorp, you could argue they should prioritise a feature that benefits you. But then don't go 'nagging' on a technical forum like Github Issues and bother all people who are following this thread for the next update unless you have something useful to contribute. Just contact your sales person to sort it out.

I want this feature just as much as anyone, but adding a new feature might not be as easy as it looks in the first place. All kinds of complexities could pop up and make the product worse as a whole (new bugs, weird undocumented behaviour, steep learning curve for coworkers, etc). Just a technical implementation is often not enough.

blaggacao commented 3 years ago

Nix, a lazy, functional config language, can solve this use case quite well. If things are deterministic, no problem since its lazily evaluated all the way down. If things are pseudo-deterministic (config data fetched from existing exogenous source), one can relatively cheaply implement a config fetcher that runs at evaluation time using any cli or any other mean of communications.

If things are truely undeterministic (state, usually a host identity, itself is ad-hoc generated by an external system), that is indeed a bad problem, and maybe there are ways to make e.g. host identity deterministic (eg. using yggdrasil, which derives ipv6 addresses off of cryptographic identities, that can be made deterministically known). If making it deterministic is not an option, because of unwise upstream platfrom design (probably most cases), one can use import from derivation which is an escape hatch for those bad, bad, bad non-deterministic cases that also this issue tries to solve: it basically shifts all but the final deployment to the evaluation phase and keeps working with whatever is the result. There is terranix which wraps terraform. Folks, have a look!

Disclaimer: I don't work for anybody, I just find nix a very helpful addition to the problem.

danieldreier commented 3 years ago

Hi! I’m the Engineering Manager for the Terraform Core team. There are a number of areas where Terraform needs to evolve, and this issue is one of them. We recently tried to figure out a straightforward way to improve this, discovered it is likely to be a significant project, and concluded that we should prioritize other improvements to Terraform that we believe will help more users. While we do intend to make improvements in this area, it is not our immediate next focus area.

stevehipwell commented 3 years ago

@danieldreier this is a major blocker for the "correct" use of Terraform for infrastructure, specifically in the kubernetes-alpha provider where it's almost more common not to have the credentials needed for an apply than to have them. Until this is supported the null_resource will be the only way to handle these common cases.

ketzacoatl commented 3 years ago

@danieldreier thanks for the info/update, it's always nice to hear about the progress from an official source.

We recently tried to figure out a straightforward way to improve this, discovered it is likely to be a significant project, and concluded that we should prioritize other improvements to Terraform that we believe will help more users.

Can you clarify... if today I can relatively easily use -target to assemble/create "portions" of the graph to plan/apply less than all of a project, why does it have to be a significant effor to improve upon that filtering? Is there some other refactoring the team would prefer to do first to make this easier? Is there really no way to help improve in the meantime? Maybe we could make it easier to assemble a list of targets that could be targeted?

The impact of deferring this is significant.. Sizeable projects of a few hundred resources are SLOW to refresh/plan and to work with in general. This pushes us to create smaller and smaller projects. Then we have dozens/hundreds of projects, which adds significant management overhead for the group of projects.

Example: you have a hashistack on AWS. So that's a Network layer, the hashistack layer, the app layer, and a whole bunch of support resources. Do you have them all in one project? You would lose your mind. Even with that infra broken up into a dozen projects, you still lose your mind (but for other reasons).

Terraform should assume that our biggest deployments need support/assistance due to the scale of those projects and real-world effects like latency reading remote state, etc. The community is asking for help from the core team, we need some innovative design here!

travis-crowder-kr commented 3 years ago

It definitely hurts credibility when you say that Terraform is meant to be able to apply and destroy all in one go (never saying that again) and then you have to say, well, because of this chicken and egg problem, you have to apply this and then add more code and apply again and this problem is persistent so each time you add a new X, you have to add a little bit, apply and then add more...

ketzacoatl commented 3 years ago

@CrowderKroger Just setup multiple projects..

travis-crowder-kr commented 3 years ago

@ketzacoatl -- then you run into needing to run one project before another than then even maybe the original project again.

maetthu commented 3 years ago

@ketzacoatl -- then you run into needing to run one project before another than then even maybe the original project again.

Running a dependent workspace can be automated in Terraform Cloud and I'm pretty sure this could be scripted pretty easily with local runs as well. If you need to run workspace A and then B and then A again sounds like a circular dependency, which could be solved with a different project structure like adding a workspace C. It sounds to me like you are trying to force your way onto the tool rather than designing it the way the tool is built for, or maybe terraform just doesn't suit your use case then. But having multiple smaller workspaces rather than one big monolith is best practice after all and works very well.

ketzacoatl commented 3 years ago

But having multiple smaller workspaces rather than one big monolith is best practice after all and works very well.

I've run both ideas out at scale for very large projects. With and without TF Cloud. All combinationso have failing points (TF Cloud included).

If you have one bit monolith that you started back with an older version of Terraform, it's a lot of work to break it into smaller projects. While I agree with you on ideals, you should also keep in mind that in real world projects, there can be many things at play (such as the age of the project and number of resources, or versions of terraform), and it's not always as easy as waving a wand and going from a big monolith to cleanly broken up small projects. Lastly, if I had to choose, I would want something in between, and I'd lean more towards fewer small projects than lots and lots of really small projects ... lots and lots of small projects is less fun than big monoliths and their issues.

sergei-ivanov commented 3 years ago

I don't have a problem with decomposing larger projects into smaller ones. I have a problem with the fact that the dreaded "count/for_each cannot be calculated until apply" error often forces splitting the projects along unnatural lines, only to appease Terraform. It forces you to learn all sorts of avoidance tactics, which make the configuration uglier and which would be unnecessary with progressive apply operation.

I would count this "feature" as a number one personal bugbear when working with Terraform. To be honest, I could easily live without any other improvements in Terraform for a year if it'd take the whole Terraform team a year to finally fix this issue. Please fix it, please.

travis-crowder-kr commented 3 years ago

@sergei-ivanov -- thank you, I couldn't have said it better without being down voted.

DanyC97 commented 3 years ago

Hi! I’m the Engineering Manager for the Terraform Core team. There are a number of areas where Terraform needs to evolve, and this issue is one of them. We recently tried to figure out a straightforward way to improve this, discovered it is likely to be a significant project, and concluded that we should prioritize other improvements to Terraform that we believe will help more users. While we do intend to make improvements in this area, it is not our immediate next focus area.

@danieldreier @apparentlymart @jbardin would appreciate if you could consider my below suggestion. (Note i'm not going to expand more than what has been already said in this long / old thread, hopefully the tech debt/ design gap vs Hashi business value will align in the near future)

For the love of GOD, could you please at least accept the idea of adding a note to the TF docs to make folks aware of the scenarios/ limitations. Doing so will save so many hours for folks like me who still read the docs ...

My issue started with a not so useful error caused by

Error: Could not connect to server: dial tcp 127.0.0.1:3306: connect: connection refused 99

I spent 4 days going down to so many dark rabbit holes, until i've bumped into the below issues:

https://github.com/hashicorp/terraform-provider-mysql/issues/2#issuecomment-365314762 which took me to
https://github.com/hashicorp/terraform/issues/18720 which took me here and it was only until i saw
https://github.com/hashicorp/terraform/issues/4149#issuecomment-733687379 when i realized what was causing it... (yes i've read every single comment from begining of this thread)

I only tried to implement a v basic feature of making the code more generic so i can deploy to various envs by having

module "mainvpc" {
  count = (var.env == "prod") ? 1 : 0
...

---Update---

Others suggested same thing few y back

crw commented 2 years ago

Should this issue be resolved, please check #2253 to see if it is resolved by the resolution to this issue.

aequitas commented 2 years ago

I think this issue is broader, especially regarding the behaviour of Terraform to automate away what needs to be included/excluded.

For OP:

This workflow is intended to embrace the existing workaround of using the -target argument to force Terraform to apply only a subset of the config, but improve it by having Terraform itself detect the situation. Terraform can then calculate itself which resources to target to plan for the maximal subset of the graph that can be applied in a single action, rather than requiring the operator to figure this out.

apparentlymart commented 2 years ago

Hi all! It's been a long time.

I originally opened this issue quite some time before I joined HashiCorp to work on Terraform full-time, and although the underlying problem statement of this issue remains valid, the exact details I described here have become less relevant to modern Terraform over time and so it's been clear to us that we will need to take a fresh start at designing it, taking into account more recent changes to the way providers are developed, the better handling of unknown values in Terraform v0.12 and later, the introduction of data sources in the meantime, and various other situations that are clearer to the Terraform team today than they were to me as an external contributor back in 2015.

With that in mind, I've decided to close this issue and replace it with one that represents just the problem to be solved and not yet any specific solution to it. My hope is that we'll use that new issue to discuss the relevant constraints and challenges and eventually reach a new proposal that makes sense for Terraform as it exists today, which may or may not be similar to what I mocked up in this older issue.

The new issue is #30937. If you're interested in following along with or participating in that discussion, please move your issue subscription over to that issue instead. I'm going to lock this one just to avoid continued additions to this issue and thus a fragmented discussion.

Thanks for the discussion here so far! The history of this issue isn't going anywhere, so we'll still be able to take into account the existing feedback as we consider possible approaches to solve this problem.

hashicorp / terraform

Partial/Progressive Configuration Changes #4149

Implementation Details

Deferred Providers and Resources

Partial Application

Progressive Runs

Building on this Idea

2896 talks about rolling updates to sets of resources. This is not directly supported by the above, since it requires human intervention to describe the updates that are required, but the UX of running multiple `plan`/`apply` cycles to converge could be used for rolling updates too.

hashicorp / terraform

Partial/Progressive Configuration Changes #4149

Implementation Details

Deferred Providers and Resources

Partial Application

Progressive Runs

Building on this Idea

2896 talks about rolling updates to sets of resources. This is not directly supported by the above, since it requires human intervention to describe the updates that are required, but the UX of running multiple plan/apply cycles to converge could be used for rolling updates too.

2896 talks about rolling updates to sets of resources. This is not directly supported by the above, since it requires human intervention to describe the updates that are required, but the UX of running multiple `plan`/`apply` cycles to converge could be used for rolling updates too.