Open apparentlymart opened 5 months ago
IIUC I think we have 2 disparate requirements here, both of which are satisfied with possible current configuration, however it would definitely be more user-friendly if these were not up to the user to declare correctly.
The quote from the proposal is a combination of these two requirements. The first aspect is that we would prefer to not have the user be required to add replace_triggered_by
in when the provider knows that this type of relationship will require replacement already. The only way we have to do right now is what exists in the example, where an attribute change is used to trigger the replacement. If there is not natural key for this, then it's an awkward choice between a synthetic attribute to force the change (which is often not hard to derive), or document that replace_triggered_by
is required for this type of resource combination.
The other aspect to the problem is the ordering of the destroy operation with respect to the resource which will continue to exist in the configuration. It is not true that these are un-ordered however, all destroy operations are ordered by the relationships stored in the state, so if both resources existed simultaneously in config at some point, the destroy actions will still be strictly ordered with any updates or replacements to configured resources. The default ordering which is not working in this example is not because it is un-ordered, but rather because it is the incorrect order. The order required is create_before_destroy
(i.e. "create-and-or-update-or-replace-dependencies-before-destroy"), which unfortunately currently requires applying that lifecycle option in the config before removing the old resource.
In the proposed depends_on = [aws_vpc.example]
, the default ordering of operations should already be correct. Actually creating this example and inspecting the graph shows that the ordering is fully defined, with the final aws_security_group.example
waiting on the destroy for aws_vpc.example
, which in turn is waiting on the destroy for the old aws_security_group.example
.
So global-relationships aside (automatic replace_triggered_by
), maybe this boils down to is actually defining a way to change the destroy and update order at the time of removal rather than at the time of create (or an intermediate update).
That specific request has 2 issues that I can see, both of which may not be deal breakers themselves, but need careful consideration. In graph construction, nothing can directly reference a destroy node, and destroy nodes can only depend directly on other destroy nodes (obviously other implementation details connect the destroy and create/update sides of the graph). This is a key premise that ensures we are building an order of operations which is valid (and part of why destroy-time provisioners cannot reference anything). Building this feature would not necessarily need to break that rule, but it needs to be carefully accounted for since we are dealing with a destroy-only operation.
The other technical issue is that because users don't have direct control over the destroy ordering, we ensure that the destroy can proceed by using the dependencies recorded during the create and update operations. In essence, the entire destroy order is already recorded in the state, and this would add a way for users to alter that destroy order. If this is adding order where there wasn't any (possibly for 2 things which did not exist simultaneously in the config), then there's little chance of conflict, but if there is already a defined order, changing that is likely to conflict with other planned operations.
This is an idea for a new capability based on the Stack Overflow question Make Terraform update parent resource before deleting child resource.
That question is actually about how a provider could represent that relationship so that the correct behavior would emerge by default, and that's one of the use-cases discussed in https://github.com/hashicorp/terraform-proposals/pull/87, but this idea is for a configuration-driven approach that would be complementary with the provider-driven approach for situations where a provider can't describe the relevant relationship itself. It would also serve as an interim workaround for the lack of provider-declared relationships which could be implemented without modifying any providers, and therefore with considerably less cross-team coordination.
Problem
Today Terraform cannot model the correct order of operations for some combinations of deleting and updating an object. In the proposal I linked above I'd described that as follows (bracketed parts are additions I've added here for those who cannot access the linked proposal):
(I also noted in the other proposal that this particular example is not realistic because the change to
vpc_id
inaws_security_group
would already force replacement of the security group today anyway, but let's pretend that isn't true for the sake of this discussion because there are cases where there's no changing argument to cause this effect.)Although the quoted proposal doesn't discuss it, there is another variation of this problem where
resource "aws_vpc" "example"
has been removed from the configuration andaws_security_group.example
updated to refer to a different VPC object. In that case Terraform would propose to destroy the VPC object and replace the security group:The configuration contains no record of the former dependency edge from
aws_security_group.example
toaws_vpc.example
, because Terraform uses the dependencies ofaws_security_group.example
that are currently declared in the configuration. Therefore as far as Terraform is concerned the action of destroyingaws_vpc.example
is independent of the action of replacingaws_security_group.example
and so the two can happen in either order.However, if Terraform tries to delete the VPC first then the EC2 API will reject that request because it's invalid to delete a VPC that has a security group in it. The same would be true if the VPC contained a subnet, or a network interface. There is a hidden dependency here that Terraform has lost track of.
Proposal
As noted earlier, the most ideal case here would be for us to somehow extend the provider API to allow the provider to tell Terraform Core that
aws_security_group.example
blocks the deletion ofaws_vpc.example
, which would then allow Terraform Core to know that a special dependency ordering is required. Specifically for this case, the provider would need to have previously told Terraform Core that the currentaws_security_group.example
is blocking, but the planned newaws_security_group.example
is not blocking that VPC, because it no longer refers to it. Terraform Core could then in principle notice that it must complete at least the destroy leg of the security group replacement before it begins deleting the VPC.However, for the sake of this issue let's assume that there's no way for the provider to describe that relationship. That could arise if the underlying API design doesn't allow that inference to be made, or if the reason for the blockage is something additional the module author has decided rather than an inherent part of the remote API. In that case what we want is a way for the module author to describe this relationship.
Therefore I propose the compromise of making it valid for a
depends_on
argument to refer to a resource that's mentioned in thefrom
argument of aremoved
block in the same module:Today's Terraform would treat
depends_on = [aws_vpc.example]
as invalid because there is noresource "aws_vpc" "example'
block in this module, and it's helpful to have Terraform report that as an error so that authors can quickly notice if they've made a typo or other similar referencing mistake.However, the
removed
block acts as an explicit record of the former existence of aresource "aws_vpc" "example"
block, implying the possibility that a resource instanceaws_vpc.example
might exist in the prior state. My assertion is that the agreement between these two statements is enough to assume that the author is intentionally referring to something that isn't currently declared, and so thisdepends_on
entry is unlikely to be a typo.Terraform Core can then understand this as "if there's a
aws_vpc.example
tracked in the prior state then it must outlive any pre-existingaws_security_group.example
object". Or, in other words, the required order of operations is:aws_security_group.example
object.aws_vpc.example
object.aws_security_group.example
object.(Steps 2 and 3 are technically independent and could happen in either order because the new object cannot possibly depend on something that isn't declared in configuration at the time it's created; the relative ordering of steps 1 and 2 is the important thing here and 3 could potentially happen concurrently with 2.)
If these objects were instead in a
create_before_destroy
chain, the ordering would be:aws_security_group.example
object.aws_security_group.example
object.aws_security_group.example
object.aws_vpc.example
object.(Again, the ordering of steps 3 and 4 is the main point for this proposal; steps 1-3 are just the normal
create_before_destroy
behavior.)Disadvantages
Although this model seems like a reasonable extension of some existing Terraform concepts, this treatment is admittedly quite subtle and might be hard to explain in the documentation. If we did this then I expect we'd document it primarily as part of the
depends_on
documentation, describing it as a way to declare a dependency on an object that's planned for deletion.However, it's likely true that someone would only discover that they needed this when they try to apply a plan created without the hidden dependency declared, at which point it might be too late to learn about it and add it.
For shared modules it can at least then benefit other users of the module that haven't upgraded yet, but for a single-instance module I expect the experience would be to encounter an error during apply, somehow learn that it was caused by a missing dependency, but then to resolve that by manually deleting the object that Terraform misordered and retry because that'll often be faster than making another configuration change to fix it, and it's often necessary to respond to a failed apply with some urgency.
Since this situation seems to arise pretty rarely, it may be justified for it to be a somewhat-hidden feature. I expect its implementation is relatively straightforward -- just loosening a
depends_on
validation rule and then letting the graph builder add the dependency edges it would already add in the presence of such a reference anyway -- in which case the maintenance cost is relatively low as long as we make sure this situation is covered by a regression test.