hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.4k stars 9.5k forks source link

Add support for update -> delete -> create -> update ordering #35763

Open natedogith1 opened 3 days ago

natedogith1 commented 3 days ago

Terraform Version

Terraform v1.9.6
on windows_386

Use Cases

Some resources need to be removed from their dependents before they can be deleted.

I've encountered this issue with google_monitoring_alert_policy referencing google_monitoring_notification_channel and google_compute_region_url_map referencing google_compute_region_backend_service. There are also several github issues from encountering this problem.

Attempted Solutions

create_before_destroy has frequently been offered as a solution. This can work if there's not name conflicts. The problem is that, by design, this change propagates to the entire dependency graph, which means it can't be used if any dependency doesn't support create_before_destroy.

Proposal

Add support for removing resources from dependents before destroying. The create_before_destroy documentation should probably also be updated to mention how it affects ordering with other resources being updated.

References

jbardin commented 3 days ago

Hi @natedogith1,

There's going to be some problems we first need to overcome to do something like this, which would probably require major changes to not just Terraform, but providers and the protocol they use. The biggest hurdle is that since there's no value to use in the first update call, it would have to be some sort of "incomplete update" operation, which the provider would opt into with a planned partial set of changes that they could accomplish without the final planned values. This also encounters a "prior knowledge" problem, where the initial plan needs to be made differently in anticipation of a problem which hasn't happened yet. Provider's generally can't tell when an update is going to fail because of conditions on another resource, even if you asked a provider if the plan will be OK it cannot tell you because the failure won't happen until the final API call is made. Because it's not going to be possible to predict the failure, providers would have to defensively rewrite all actions which might result in this type of change to always assume it is required.

Given the amount of work that it would take to implement a new resource lifecycle in core, and create a new protocol to implement across providers and the frameworks they use, it's generally far less friction to update providers to work best with the current protocol. With few exceptions, resources should all be designed with the ability to handle create_before_destroy, and most of the time that represents a bug in the resource if it can't be used.

natedogith1 commented 2 days ago

With few exceptions, resources should all be designed with the ability to handle create_before_destroy, and most of the time that represents a bug in the resource if it can't be used.

I would expect most resources have a user-specified unique id (e.g. google_compute_instance's name) that would cause create_before_destroy to fail.

jbardin commented 2 days ago

Many resources do require a unique name, but have an input like name_prefix instructing the provider to generate a unique name with a given prefix, which avoids a conflict when replacing the instance.

natedogith1 commented 2 days ago

From my relatively limited experience, it seems only a few resources have something like name_prefix. And switching over to a randomly generated name because some dependent might need create_before_destroy isn't ideal.

I don't know Terraform's internals, but something like create_before_destroy that instead only applies on deletion, and thus shouldn't need to propagate, might be easier to implement and would likely handle most of the use cases.

Actually, update -> destroy instead of the current default of destroy -> update seems logically sounds at first glance, so it might be possible to have that as a default. Though I don't know if there'd be issues with re-creation causing the order of update/destroy to be swapped.

jbardin commented 2 days ago

The orderings we have for dependency operations in Terraform are not arbitrary, they are the result of requiring a deterministic, consistent ordering with no dependency cycles. If you make the destroy operation depend on the update, then you have a cycle because the create also depends on the destroy. This is why you need create_before_destroy to get the inverse ordering around updates.

I made some notes here with simplified examples to explain the basics of destroy ordering: https://github.com/hashicorp/terraform/blob/main/docs/destroying.md

natedogith1 commented 1 day ago

I don't think there would be a cycle if destroy depending on update was conditional on if it was a re-deploy. Though it would mean the ordering would be dependent on if it's a redeploy, which might not be desirable as a default.

Could there be an option that behaves like create_before_destroy except it also removes the dependency between the creation and deletion of the replaced resource? That seems like it would allow consistent edges of the dependency graph without introducing cycles and without requiring propagating the change.

jbardin commented 1 day ago

Having an order of operations which is conditional is not desirable specifically because it is not consistent. The ordering is important for correct resource behavior, and if that was different based on combinations of changes it would not be possible to write a configuration with predictable results. You can't just rely on example situations with only a couple operations, you have to imagine these combined with hundreds of other arbitrary operations on resources, locals, variables, outputs, modules, providers, data sources, etc, all interconnected and all working in all valid permutations.

Removing the dependency between the creation and deletion isn't possible, both because we guarantee that order, and removing the order means we don't know which one will happen first. That leaves things in a nondeterministic state, where what you want may work, but only some of the time while possibly breaking something else.

natedogith1 commented 1 day ago

Terraform currently guarantees an ordering between creation and deletion, but lets the user specify what the ordering is between them. I don't see what the issue is with not knowing the order, if the resource supports create_before_destroy, I would expect any order would work, similar to how I assume resources without dependencies between them are already handled.

jbardin commented 1 day ago

All related actions need a defined order, because they do not happen in isolation. Not being ordered might work in small examples, but causes nondeterministic behavior in complex ones. As I mentioned before, you have to incorporate this within arbitrarily large configurations, where there are other things dependent on both the destroy and the create operations, and using any generic resource type designed with whatever constraints were possible within the ordering we have guaranteed.

natedogith1 commented 1 day ago

If all actions need a defined order, how does Terraform order things that have no detected dependencies between each other? I would have assumed they have no defined order and Terraform would do them in parallel.

jbardin commented 1 day ago

Yes, if there is no dependency, direct or transitive, then the operations are concurrent.

natedogith1 commented 1 day ago

If Terraform supports running actions concurrently, I'm not sure what the problem would be with letting the user tell Terraform that a specific resource can be created and destroyed at the same time (similar to how the user can already tell Terraform that a resource can be created before it's destroyed).

natedogith1 commented 1 day ago

Working through an a <- b <- c dependency graph where b has "create_before_destroy except unlink b's create and destroy so it doesn't need to propagate), I see how that could cause a loop (c -> b -> a -> a_destroy -> b_destroy -> c). I think lifeycle.replace_triggered_by could be used to work around this issue without propagating the changes.

I still think splitting update into optional pre-delete and post-create would be a good idea, rather than requiring a redeploy (the provider could either assume all removals are deletions and need to happen in pre-delete, or it could be informed what terraform is deleting (identified using something like the import id) so it could handle removal of non-destroyed items in post-create). create_before_destroy isn't usable because common resources don't support it (e.g. google_compute_instance, google_compute_address, azurerm_linux_virtual_machine, azurerm_network_interface).

natedogith1 commented 1 day ago

For google_compute_backend_service, it looks like there's an in-progress issue in the provider to work around it by updating the url map from the backend service resource: https://github.com/hashicorp/terraform-provider-google/issues/4543.

jbardin commented 8 hours ago

If you make the destroy operation depend on the update, then you have an impossible ordering, the cycle is kind of the side effect of the undefined order and most of the time just "breaking" that cycle is usually not the correct resolution. If you then remove the edge between the create and destroy, while that may be unordered in isolation, you have implicitly recreated the same create_before_destroy ordering when connected to the update, bringing us back to the same problem you started with. replace_triggered_by cannot change the ordering, it only forces a replace action from the plan in the same order it would otherwise happen.

Changing the order also has ramifications on all upstream resource operations too, because it's reversing an edge between destroy and create subgraphs in the same way as create_before_destroy, cascading the change throughout the graph in order to resolve the ordering, and again returning to the same problem you were trying to originally solve.

There may be other valid patterns to use for certain resources, but Terraform needs to work with any resource which follows the lifecycle we've defined. Use cases outside of the defined lifecycle may also be possible, but that requires redefining what Terraform can do and the contract it has with providers.