Track dependencies on a per-instance basis and so allow applying a subset of instances when an upstream fails

hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

https://www.terraform.io/

Other

42.37k stars 9.49k forks source link

Track dependencies on a per-instance basis and so allow applying a subset of instances when an upstream fails #32148

Open archoversight opened 1 year ago

archoversight commented 1 year ago

Terraform Version

latest - feature doesn't exist

Use Cases

Continue applying resources even if some other non-dependent resource failed.

Use case:

I was making some changes to some EBS volumes and ran terraform apply, I then made additional changes but ran into the 6 hour cool down period for making changes to EBS volumes. In the same terraform apply I was attempting to add some instances to a load balancer target group, but those never got executed as the EBS changes caused terraform apply to quit early.

I would love for it to continue making changes to try and bring the state as close as possible to what is defined in the terraform configuration. The EBS volume changes are completely unrelated to the addition of some instances to a load balancer target group.

Attempted Solutions

None, instead judiciously used -target to get the changes I need applied

Proposal

Add a flag or a way to tell terraform to continue deploying resources/making changes that are not direct downstream dependents of the failed changes. -proceed-on-failure or -exit-early=false.

Additions to a load balancer target group should continue to go through even if modifying some EBS volumes in unrelated location fails.

References

No response

jbardin commented 1 year ago

Hi @archoversight,

Thanks for filing the issue. Terraform already does what you describe, or at least that is the intended behavior. Can you show an example where non-dependent resources are skipped after an unrelated apply failure?

Thanks!

apparentlymart commented 1 year ago

I think what you're getting at here is that once one node in the graph generates an error, Terraform will not begin processing any other node that wasn't already executing. So in that sense Terraform does exit early, but can also potentially behave as you described if you're lucky enough that the unrelated resources happened to get started applying before Terraform encountered an error.

The way I understand this request, then, is to have Terraform not abort the walk early, and instead behave as if all actions downstream of the failure had been excluded by -target. Specifically, if there are any actions that don't directly or indirectly depend on the one that failed then they would still run to completion.

The main reason we don't do that already is that we assume that once there's any kind of error the operator would rather Terraform halt its work and give them a chance to investigate and hopefully remedy the error before trying again. Correctness-wise there isn't any reason Terraform shouldn't continue processing other actions that aren't downstream of the error -- that is already a possible outcome if the concurrent graph walk coincidentally happens in just the right order anyway -- but doing it by default would probably encourage people to send Terraform SIGINT and, if done too carelessly, possibly cause Terraform to terminate ungracefully and make the situation even worse.

However, we don't typically like to add lots of different knobs and settings to Terraform because Terraform is already a very complicated machine to document, maintain, and test. It makes me nervous to have this behavior be configurable and then have a small minority of people using Terraform in an unusual way that we may forget to consider under future changes.

Instead, I'd prefer to use this issue to consider whether there's some way we can tweak the default behavior to achieve a better compromise. I'm not sure yet exactly what to propose in that vein, but hopefully we can use this issue to collect various examples of this need and then see if they inspire an answer that can be implemented by somehow changing what Terraform does by default, rather than by adding yet another possible variant execution mode.

In the meantime, I think -target is the best option for dealing with this sort of exceptional case. If we eventually do something like what's described in #2253, then perhaps that would be a sufficient answer here since it would allow just naming the one resource instance that's currently causing a problem, rather than having to do the opposite and name all of the resource instances that aren't causing a problem. (Though that request still has the same concern as above: it's another weirdo extra execution mode that can impact the design and implementation of many other future features in Terraform.)

archoversight commented 1 year ago

The downside is that -target does not allow me to use .* or [*] syntax, so trying to make that work with large for_each blocks that call modules is a giant pain. (See https://github.com/hashicorp/terraform/issues/30350 for a feature request someone else made related to this)

Second, this behavior of quitting early may be wanted by operators that are hands on keyboard, it is frustrating when running in CI/CD pipelines where there is no chance to go back and hit up arrow and try again and or add -target.

CI/CD might retry and it might get lucky that it will apply the next set of changes, but the graph seems to be walked deterministically so there is never a chance for a CI/CD pipeline that runs terraform apply on a loop (let's say 10 times) to reach a steady state where only for example the EBS changes are not applied, but all other changes in the graph are correct. I use EBS as an example because it is easy to have 10 VM's that you just made changes to, and changed the type of their root volume from standard to GP2 when you realize you meant GP3, but now you have to wait the 6 hour cool down...

I would expect that terraform would apply as many changes as it can, and as you said continues to walk the graph.

This is something we run into fairly often because we have to integrate with other teams that do not use terraform (and thus there is no remote state for us to pull) and thus we have static lists of VPC's to connect with using transit gateways and or push routes. If the other team tears down that VPC, and re-creates it our static list is now out of date. A rename of resources led to terraform destroying a lot of required connectivity, but because of the static list of VPC's being outdated it never attempted to complete the rest of the changes to bring up the resources that it destroyed because it attempted connectivity to that VPC + a bunch of other resources.

I hear you when you say you don't want to add another mode of execution, but I would argue that the current way terraform applies is counter to my intuition of "terraform makes sure my cloud matches what is in my config", not "terraform makes sure my cloud matches until the first failure, and maybe on the next run it will still apply two or three more things before failing again".

(Side note, this is especially painful for some projects I work on as a terraform apply takes about ~30 minutes to calculate the graph in the first place before it starts a refresh, I had opened an issue related to this here: https://github.com/hashicorp/terraform/issues/30968)

@jbardin it seems that @apparentlymart got what I was going for. The issue here is that I have 30 changes, terraform successfully applies 10 of them, then when it fires of the next 10 requests one of those fails, it stops executing the last 10 changes.

If out of the 30 changes, the first 10 are all attempting to modify EBS volumes that can't be modified due to the cooldown period, terraform will never attempt to make the last 20 changes, unless you start using -target.

archoversight commented 1 year ago

Another example, I want terraform to enforce state but a machine has termination protection set to on, and it is meant to destroy.

Terraform may now leave the rest of my state half applied in the cloud just because it can't tear down that one ec2 instance. Re-running terraform apply may or may not end up reaching a steady state... but its non-deterministic whether that is going to be the case.

jbardin commented 1 year ago

I'm still not certain we're talking about the same issue here.

Take this excessively contrived example:

resource "null_resource" "fail" {
  provisioner "local-exec" {
    command = "exit 1"
  }
}

resource "time_sleep" "wait" {
  count = 100
  create_duration = "1s"
}

resource "null_resource" "a" {
  count = length(time_sleep.wait)
}

resource "null_resource" "b" {
  count = length(null_resource.a)
}

We have a single resource instance which will fail, and a lot of resources instances which will be queued up (in batches of 10 by default) and applied long after the first resource failed. Executing this config will apply everything, and the following plan will only show the single tainted null_resource.fail being replaced.

It's only in the case where there are dependencies between these resources that Terraform will abort the execution. That is because there is no way for Terraform to determine if the dependency declared in the configuration actually matters for the resources downstream of the failure.

archoversight commented 1 year ago

Alright, here's the replication using null resources:

locals {
  mappitymap = {
    "1" : 0,
    "2" : 2,
    "3" : 1,
    "4" : 0,
    "5" : 0,
    "6" : 1,
    "7" : 0,
    "8" : 0,
    "9" : 2,
    "10" : 1,
    "11" : 0,
    "12" : 0,
    "13" : 1,
    "14" : 0,
    "15" : 0,
    "16" : 2,
    "17" : 1,
    "18" : 0,
    "19" : 0,
    "20" : 1,
  }
}
resource "null_resource" "maybe_fail" {
  for_each = local.mappitymap

  provisioner "local-exec" {
    command = "echo ${each.key}; exit ${each.value}"
  }
}

resource "null_resource" "a" {
  for_each = local.mappitymap

  provisioner "local-exec" {
    command = "echo ${each.key}: ${null_resource.maybe_fail[each.key].id}; exit 0"
  }

}

Please note that no null_resource.a ever actually gets created, even though there are keys in null_resource.maybe_fail that have succeeded.

Terraform does know that null_resource.a[key] depends on null_resource.maybe_fail[key] because otherwise it wouldn't be able to plan this properly, there's no reason why null_resource.maybe_fail["2"] should be allowed to influence whether terraform applies results for null_resource.maybe_fail["1"].

Now I do see what you are saying with "depends on", but this is still surprising to me (and I probably do have a dependency somewhere due to using for_each somewhere else that leads to the whole thing failing). If I instead manually create a resource for each of the entries in the map, and duplicate that code 20 times for both null_resource.maybe_fail and null_resource.a then terraform does work as I would expect it to work.

I am specifically using maps and for_each so that we have stable state file entries, because using count means that adding or removing an entry could mean destroying + re-creating an awful lot of resources.

I would also argue that if this is a change that is being made, and let's say you have a module that outputs some results, and inside that module there is a resource that fails, as long as all of the outputs/downstream resources don't depend on that failed resource I would expect all downstream resources to continue applying correctly.

jbardin commented 1 year ago

Thanks @archoversight, I see what you're getting at now. Terraform manages dependencies at the resource level, not between individual instances. This means that while some of null_resource.maybe_fail may have succeeded, the upstream dependencies for null_resource.a resulted in errors, therefor execution of null_resource.a is not going to be started at all.

There are a number of use cases which could benefit from tracking dependencies between individual instances, but because that is a major change to the underlying architecture of Terraform, it's not something which is likely to happen soon. That type of change would also have to be weighed against the existing benefits of the current system, especially in regard to other possible features like the batching of requests for multiple instances.

hvaidya14 commented 7 months ago

Can I reapply half way failed terraform plan again , without manually destroying resources created already

g13013 commented 5 months ago

@apparentlymart @jbardin

The behavior of letting the deployment continue should be the default unless resources rely on failed resource, for various reasons:

Ensure at least some resources are available for use as they don't depend on other resources until the problem is fixed!
Faster redeployment once fixed the faulty resource!

but at least for now introduce a --continue-on-error kind of flag!

AlissonRS commented 3 months ago

This would be very helpful.

I'm trying to remove an EKS cluster, VPC and some other resources I created for testing, by using terraform apply after removing these files.

However, one single kubernetes manifest is blocking me:

Plan: 0 to add, 0 to change, 78 to destroy. ╷ │ Error: external-secrets/aws-secrets-manager failed to get resource from kubernetes: conversion webhook for external-secrets.io/v1beta1, Kind=ClusterSecretStore failed: Post "https://external-secrets-mycluster-webhook.external-secrets-mycluster.svc:443/convert?timeout=30s": service "external-secrets-mycluster-webhook" not found │ │ ╵ error Command failed with exit code 1.

My solution was manually removing this resource from state, which is fine since deleting the EKS cluster will get rid of this resource anyway (since it's a kubernetes resource within EKS itself).

apparentlymart commented 3 months ago

I created some confusion earlier in this issue by making an incorrect (outdated) statement about Terraform's current behavior, and I'm sorry for that. I want to reset by stating the current situation and what that implies for this issue:

Terraform currently already behaves as this issue originally suggested: if one resource encounters a failure during apply, Terraform will skip everything that depends on that resource, but it will process anything that doesn't depend on that resource as normal.

Downthread we refined the request to be something related in concept but totally separate in implementation: some participants want Terraform to track dependencies on a per-resource-instance basis rather than on a per-resource basis, so that the rule I described above can still allow individual instances of a downstream resource to complete even though a subset of the upstream instances failed.

That is a valid request that is indeed different than Terraform's current behavior. However, there is also no clear design for how to achieve that.

This issue is blocked not by us not understanding why this would be useful, but because we don't know how to achieve the behavior being requested. Terraform discovers dependencies by static analysis rather than dynamic analysis, but the instance keys for a resource are dynamic and references to instances are dynamic so that information is not visible to Terraform's dependency analysis.

This issue cannot move forward until there's some idea for how to implement what it asks for. We revisit it periodically, but so far we have not found any feasible design. (Rewriting Terraform's runtime to work in an entirely different way is not "feasible" for the sake of this paragraph, because the risk of breaking existing working configurations is too high.)

(A provider returning an error when asked to delete something that already doesn't exist is a provider bug, so if you have encountered that situation please report the issue to the provider developer instead. Fixing the bug in the provider will typically be far more tractable than whatever solution might help with this feature request, for most providers.)