Unknown values should not block successful planning

apparentlymart commented 2 years ago

The idea of "unknown values" is a crucial part of how Terraform implements planning as a separate step from applying.

An unknown value is a placeholder for a value that Terraform (most often, a Terraform provider) cannot know until the apply step. Unknown values allow Terraform to still keep track of type information where possible, even if the exact values aren't known, and allow Terraform to be explicit in its proposed plan output about which values it can predict and which values it cannot.

Internally, Terraform performs checks to ensure that the final arguments for a resource instance at the apply step conform to the arguments previously shown in the plan: known values must remain exactly equal, while unknown values must be replaced by known values matching the unknown value's type constraint. Through this mechanism, Terraform aims to promise that the apply phase will use the same settings as were used during planning, or Terraform will return an error explaining that it could not.

(For a longer and deeper overview of what unknown values represent and how Terraform treats them, see my blog post Unknown Values: The Secret to Terraform Plan.)

The design goal for unknown values is that Terraform should always be able to produce some sort of plan, even if parts of it are not yet known, and then it's up to the user to review the plan and decide either to accept the risk that the unknown values might not be what's expected, or to apply changes from a smaller part of the configuration (e.g. using -target) in order to learn more final values and thus produce a plan with fewer unknowns.

However, Terraform currently falls short of that goal in a couple different situations:

The Terraform language runtime does not allow an unknown value to be assigned to either of the two resource repetition meta-arguments, count and for_each.

In that situation, Terraform cannot even predict how many instances of a resource are being declared, and it isn't clear how exactly Terraform should explain that degenenerate situation in a plan and so currently Terraform gives up and returns an error:

│ Error: Invalid for_each argument
│
│ ...
│
│ The "for_each" value depends on resource attributes that cannot
│ be determined until apply, so Terraform cannot predict how many
│ instances will be created. To work around this, use the -target
│ argument to first apply only the resources that the for_each
│ depends on.

│ Error: Invalid count argument
│
│ ...
│
│ The "count" value depends on resource attributes that cannot be
│ determined until apply, so Terraform cannot predict how many
│ instances will be created. To work around this, use the -target
│ argument to first apply only the resources that the count depends
│ on.

If any unknown values appear in a provider block for configuring a provider, Terraform will pass those unknown values to the provider's "Configure" function.

Although Terraform Core handles this in an arguably-reasonable way, we've never defined how exactly a provider ought to react to crucial arguments being unknown, and so existing providers tend to fail or behave strangely in that situation.

For example, some providers (due to quirks of the old Terraform SDK) end up treating an unknown value the same as an unset value, causing the provider to try to connect to somewhere weird like a port on localhost.

Providers built using the modern Provider Framework don't run into that particular malfunction, but it still isn't really clear what a provider ought to do when a crucial argument is unknown and so e.g. the AWS Cloud Control provider -- a flagship use of the new framework -- reacts to unknown provider arguments by returning an error, causing a similar effect as we see for count and for_each above.

Although the underlying causes for the errors in these two cases are different, they both lead to a similar problem: planning is blocked entirely by the resulting error and the user has to manually puzzle out how to either change the configuration to avoid the unknown values appearing in "the wrong places", or alternatively puzzle out what exactly to pass to -target to select a suitable subset of the configuration to cause the problematic values to be known in a subsequent untargeted plan.

Terraform should ideally treat unknown values in these locations in a similar way as it does elsewhere: it should successfully produce a plan which describes what's certain and is explicit about what isn't known yet. The user can then review that plan and decide whether to proceed.

Ideally in each situation where an unknown value appears there should be some clear feedback on what unknown value source it was originally derived from, so that in situations where the user doesn't feel comfortable proceeding without further information they can more easily determine how to use -target (or some other similar capabililty yet to be designed) to deal with only a subset of resources at first and thus create a more complete subsequent plan.

This issue is intended as a statement of a problem to be solved and not as a particular proposed solution to that problem. However, there are some specific questions for us to consider on the path to designing a solution:

Is it acceptable for Terraform to produce a plan which can't even say how many instances of a particular resource will be created?

That's a line we've been loathe to cross so far because the difference between a couple instances and tens of instances can be quite an expensive bill, but the same could be said for other values that Terraform is okay with leaving unknown in the plan output, such as the "desired count" of an EC2 autoscaling group. Maybe it's okay as long as Terraform is explicit about it in the plan output?

A particularly "interesting" case to consider here is if some instances of a resource already exist and then subsequent changes to the configuration cause the count or for_each to become retroactively unknown. In that case, the final result of count or for_each could mean that there should be more instances of the resource (create), fewer instances of the resource (destroy), or no change to the number of instances (no-op). I personally would feel uncomfortable applying a plan that can't say for certain whether it will destroy existing objects.
Conversely, is it acceptable for Terraform to automatically produce a plan which explicitly covers only a subset of the configuration, leaving the user to run terraform apply again to pick up where it left off?

This was essence of the earlier proposal #4149, which is now closed due to its age and decreasing relevance to modern Terraform. That proposal made the observation that, since we currently suggest folks work around unknown value errors by using -target, Terraform could effectively synthesize its own -target settings to carve out the maximum possible set of actions that can be taken without tripping over the two problematic situations above.
Should providers (probably with some help from the Plugin Framework) be permitted to return an entirely-unknown response to the UpgradeResourceState, ReadResource, ReadDataSource, and PlanResourceChange operations for situations where the provider isn't configured completely enough to even attempt these operations?

These are the four operations that Terraform needs to be able to ask a partially-configured provider to perform. If we allow a provider to signal that it isn't configured enough to even try at those, what should Terraform Core do in order to proceed with that incomplete or stale information?
We most frequently encounter large numbers of unknown values when planning the initial creation of a configuration, when nothing at all exists yet. That is definitely the most common scenario where these problems arise, but a provider can potentially return unknown values even as part of an in-place update if that is the best representation of the remote API's behavior -- for example, perhaps one of the output attributes is derived from an updated argument in a way that the provider cannot predict or simulate.

Do we need to take any extra care to deal with the situation where an unknown value cascades downstream from an updated or replaced resource instance?

For example, if I've used an attribute from a vendor-specific Kubernetes cluster resource to provide an API URL to the hashicorp/kubernetes provider and the user changes the configuration of the cluster itself in a way that causes the API URL to change, how should Terraform and the Kubernetes provider react to the cluster URL being unknown even though there are existing objects bound to resources belonging to that provider which we will need to refresh and plan?
What sort of analysis would we need to implement in order to answer questions like "why is this value unknown?" and "what subset of actions could I take in order to make this value be known?"?.

pdecat commented 2 years ago

Isn't that a typo?

but a provider can potentially return known values even as part of an in-place update

s/known/unknown

BzSpi commented 2 years ago

Same typo here I think

If any known values appear in a provider block for configuring a provider, Terraform will pass those unknown values to the provider's "Configure" function.

s/known/unknown

apparentlymart commented 2 years ago

Thanks for the corrections! I've updated the original comment to fix them.

pdecat commented 2 years ago

Hi @apparentlymart, I believe you made a similar typo in your blog post :

Terraform deals with these by just making them return known values of the appropriate type during planning.

apparentlymart commented 2 years ago

Apparently it's easy for my fingers to confuse those! I'll update my blog at some point, too. Thanks!

Nuru commented 2 years ago

@apparentlymart Is this why sort() causes plans to fail?

Oversimplified example. With var.a_list of type list(string)

count = length(var.a_list)

works, even when the values in the a_list are unknown, I understand why

count = length(compact(var.a_list))

fails, because compact causes the length of the list to depend on the values in it. Fair enough. What I don't understand is why

count = length(sort(var.a_list))

fails. The cardinality of the list is not changed by sorting it, so why should it break count?

apparentlymart commented 2 years ago

I expect that the sort function is forced to return a wholly-unknown list because it cannot predict which of the input elements will be at which indices in order to return a known list with a mixture of known and unknown elements. In the language today there is no such thing as an unknown list with a known length -- it's all bundled together into one concept -- and so the length is lost by the time the value reaches the length function.

I imagine we might be able to change sort so that it returns a known list of a particular number of values that are all unknown instead, thus creating the effect of an unknown list with a known length, but I'm not sure about the compatibility consequences of doing so. If you'd like to discuss that more, please open a new enhancement request issue for it and we can discuss the details of how it might work over in that other issue.

KamalAman commented 2 years ago

It would be great to have providers lazy load if a value it is waiting on to be initialized. This way you can create a cluster and provisioning the kuberentes provider can work in the same apply in the same workspace. #1102

I think this issue also effects for key, value loops, but surprisingly does not effect for value loops.

For example first apply when creating a many tfe_workspace.workspace using a for_count results in a bunch of partial values in this trivial example

locals {
   ids = [
      for key, workspace in tfe_workspace.workspace : "${key}-${workspace.id}"
    ]
}

where as works just fine

locals {
   ids = [
      for workspace in tfe_workspace.workspace : workspace.id
    ]
}

jbardin commented 2 years ago

I wanted to add a note here which may or may not be related depending on what solutions we find.

There is one place where unknown values are allowed, but can drastically alter the behavior of the plan -- resource attributes which force replacement. Feeding an unknown into an attribute of this type always forces replacement, which itself may cause more values to become unknown, cascading replacements through the config even when the end result may have little to no differences from the initial state.

Technically a solution to the problem posed here would generally have no negative effect on this type of resource replacement, but we need to keep this in mind if we start allowing users to more haphazardly throw around unknown values. (A solution here may even help with this problem, but we can brainstorm about that and other possibilities in other threads)

apparentlymart commented 2 years ago

Thanks for adding that, @jbardin. I think you're describing the situation where if a value of an existing argument becomes unknown later then we'll conservatively assume it will have a different value than it had before, which will at least cause the provider to plan an update but may also sometimes cause the provider to plan a replace instead, if it's in an argument that can't update in place.

The specific case where folks find that confusing / frustrating is where the new value happens to end up being the same as the old value, and so Terraform ends up needlessly updating the object to the values it already had, or replacing the object with one configured in the same way as the previous one.

This tends to be exacerbated today by the fact that our old SDK design made it awkward for providers to offer predictions of what values attributes would have during planning -- for example, to use the documented syntax for a particular kind of ARN to predict an AWS resource type's arn attribute during planning rather than applying -- and so it might also be good to try to make that more ergonomic in the new framework and encourage provider developers to make use of that functionality so that we aren't propagating unknown values that don't really need to be unknown, and instead reserving that mechanism only for truly unpredictable values like a dynamically-selected IP address, or similar.

stevehipwell commented 2 years ago

My two biggest requests in this area are as follows, I could have listed many more but these two and the count & for_each issues already mentioned cause us the majority of our pain.

Changing a tag on an EKS cluster shouldn't cause any resource using an output to have unknown data unless it's actually using the tags. This is a general pattern we see far too often, illustrated via the aws_eks_cluster resource which tends to be the resource we see it happen most on.
The Kubernetes provider kubernetes_manifest resource shouldn't need API access at plan time as it should be fine for it to be unknown which is to be expected if the cluster isn't created yet. If somehow the cluster exists and is connected during apply this should result in an error as it'd be an update and not an apply happening.

apparentlymart commented 2 years ago

Hi @stevehipwell! Thanks for sharing those.

I'm familiar with the behavior you're describing for kubernetes_manifest, but I'm not familiar with the situation you're describing about aws_eks_cluster having data become unknown as a result of changing tags. Do you have a reference to an AWS provider issue describing that?

I'm asking because I'd like to see more details on what's going on so we can understand whether this is a situation where the provider must return unknown for some reason -- in which case this issue is relevant as a way to make Terraform handle that better -- or if the AWS provider should already have enough information to avoid producing an unknown value in that case, and so we could potentially fix that inside the AWS provider with Terraform as it exists today.

For all of the AWS provider resource types I'm familiar with, changing a tag can be achieved using an in-place update which should therefore avoid anything else becoming unknown, but I don't have any direct experience with the EKS resource types in particular, and there's possibly something different about the EKS API that means the AWS provider needs to treat it differently than tagging of other resource types.

stevehipwell commented 2 years ago

@apparentlymart I'll add something here next time I see this, but there is nothing special with an EKS cluster to make it need custom behaviour. We see it elsewhere but for us the EKS cluster is the resource at the top of the biggest graphs. We spent a significant engineering effort to change how we use Terraform to limit the number of incorrect "known at apply" fields showing up in the plan. Our code is much harder to understand and develop but our end users get a cleaner plan, that said there are still a significant number of cases where this will happen.

This has also reminded me of a number of other very obvious issues in this area.

Using a data aws_iam_policy_document with dynamic values coming from a deep enough graph will result in a lot of known at apply (we can no longer use these even with careful code structuring)
Some resources (such as ACL rules or SG rules) don't have a consistent representation and always show changes

apparentlymart commented 2 years ago

IAM policy documents are a particularly tricky case because it ends up folding a bunch of separate identifiers into a single string, which must therefore end up being unknown if any part of it is unknown. The AWS provider could potentially help with this by generating ARNs during the planning step based on the syntax specified in the AWS documentation, but I'm aware that most of the time arn attributes end up being unknown today even though in principle the provider could predict what the final ARN string would be.

This issue is focused only on the situations where unknown values cause Terraform to return an error, and so the problem of "limiting the number of known after apply" in the plan is not in scope for this issue, but if you see situations where a provider is returning an unknown value even though it should have all of the information to return a known value I'd suggest opening an enhancement request with the relevant provider about that, because providers are already empowered to return final values during planning if they include the logic necessary to do so.

For the sake of this issue, we'd benefit from there being fewer situations where providers return unknown values, but it will never be possible to eliminate them entirely -- some values really are determined by the remote system only during the apply step -- and so this issue is about making sure that those situations cannot cause Terraform to fail planning altogether, so that it's always possible to make progress towards a converged result.

stevehipwell commented 2 years ago

@apparentlymart whenever we've opened issues about these issues they get closed, ignored, or both.

If you want another scenario where this error happens; if you create a new resource in the plan and compare an input value with the resource identifier to create a list of exclusive identifiers it will error even though it should be able to determine that the input value isn't equal. This usually happens in a multi module scenario but the following convoluted example should have the same behaviour.

variable "identifier" {
  type = string
}
resource "resource_type_a" "my_resource" {
}
resource "resource_type_b" "my_other_resource" {
  count = length(concat([resource_type_a.my_resource.id], var.identifier != "" && var.identifier != resource_type_a.my_resource.id ? [var.identifier] : []))
}

apparentlymart commented 2 years ago

Thanks @stevehipwell.

For the purpose of the framing of this issue the answer in that case would be for count to accept the value being unknown rather than returning an error, rather than to make the result be known.

Making the Terraform language's analysis of unknown values more precise is also possible, but is something separate from what this issue is about. What you have there is essentially a Terraform language version of the situation we were originally describing, where something is unknown even though it could technically be known. Feel free to open a new issue in this repository about that case if you like, and we could see about addressing it separately from what this issue is covering, although I will say that at first glance it seems correct to me that this would return unknown if resource_type_a.my_resource.id isn't known, because the list could have either one or two elements depending on that result. If I've misunderstood what you meant, let's talk about that in a separate issue.

The intended scope of this issue is that there should be no situation where an unknown value leads to an error saying that the value must be known, regardless of why the unknown value is present. Situations where a value turns out unknown but needn't be are also good separate enhancement requests (either in this repository for the core language, or in provider repositories for providers), but are not in scope for this issue.

stevehipwell commented 2 years ago

Making the Terraform language's analysis of unknown values more precise is also possible, but is something separate from what this issue is about. What you have there is essentially a Terraform language version of the situation we were originally describing, where something is unknown even though it could technically be known. Feel free to open a new issue in this repository about that case if you like, and we could see about addressing it separately from what this issue is covering, although I will say that at first glance it seems correct to me that this would return unknown if resource_type_a.my_resource.id isn't known, because the list could have either one or two elements depending on that result. If I've misunderstood what you meant, let's talk about that in a separate issue.

I updated the example slightly to check that the var is set for completeness, but that doesn't change the behaviour. The point here is that if data about the types isn't lost (the identifier of a resource which doesn't exist can't be passed in) this should evaluate correctly, but if the type system falls back to the lowest common denominator (known or unknown) it's a tricky problem. Shouldn't the type system be evaluated and "fixed" if possible before changing the behaviour of the system as a whole? A potential outcome might be the lack of need to change the system, but a more likely outcome would be to change the scope or shift the context.

apparentlymart commented 2 years ago

To be clear, I'm not saying that both aren't valid improvements to make, just that I don't want this issue to grow into an overly generalized "everything that's wrong with unknown values" situation which could never be resolved.

More than happy to discuss improvements to the precision of Terraform's handling of unknown values in other issues, but this one is intended to have a very specific "acceptance criteria" after which it'd be time to close it, which is that Terraform doesn't block planning when count or for_each are unknown, and that there is a clear story for how providers ought to handle unknown values in provider configuration (though actually making providers follow that recommendation would be a change to make elsewhere, of course, since providers are separate codebases).

jtv8 commented 2 years ago

It would be great to have providers lazy load if a value it is waiting on to be initialized. This way you can create a cluster and provisioning the kuberentes provider can work in the same apply in the same workspace. #1102

+1 for this - and/or to add an equivalent of depends_on for provider configuration blocks. This is my biggest pain point with Terraform so far - because dynamically configuring providers works perfectly well for creating resources but leaves you stuck with an unstable configuration if you have to replace, say, a Kubernetes cluster or PostgreSQL database later on.

In the case of Kubernetes, the behavior I'm looking for is pretty simple: if the cluster that the provider depends on needs to be created or replaced, we can safely assume that all of the resources that are defined using that provider will also need to be created from scratch, and the plan can accurately reflect this. There should be a way of passing a provider the known information that a planned operation will destroy all of its existing resources.

In the more complex case where we can't be certain if a resource will still exist or not if some of its provider's values are unknown, I'd be fine with Terraform expressing that uncertainty in the plan as "may be replaced", etc.

Having some uncertainty in the plan, and letting the end user decide whether to accept that uncertainty, is infinitely preferable over a perfectly valid configuration being impossible to apply!

If some users want to always have a 100% deterministic plan, what about letting them choose the following configurable behaviors?

Deterministic: as at present, terraform plan and terraform apply will both fail if it's not unknown exactly how many resources will be created. More complex configurations will require selective apply or separate states.
Lazy apply, with human-in-loop: terraform apply will lazily load providers as their configurations become known, and prompt the user to accept or refuse changes at each stage.
Lazy apply, fully automated: as above, but presuming the user's approval for each stage.

apparentlymart commented 2 years ago

Something came up recently that caused me to do a bit more "thought experimenting" specifically about the provider configuration part of this problem, and so I'm leaving this comment here just as some notes for future reference by future me or someone else working on this problem. I'm not actually doing any active design or implementation work in this area right now, aside from these notes.

An interesting thing about the provider-configuration-related flavor of this problem is that it is architecturally split into two parts:

Changing the provider protocol to allow providers to present a more accurate model of situations where certain actions are blocked by an incomplete provider configuration.
Deciding how Terraform Core should then respond to that more accurate model.

Problem 1 seems to have a plausible answer in extending our existing mechanism for providers to describe that certain information isn't known yet: allow all of the operations that can potentially occur before the apply step to return wholly- or partially-unknown results, where unknown values serve as placeholders for data the provider will be able to learn about later once fully configured. The challenges seem to mainly live in part 2: what exactly should Terraform Core do when it knows it has incomplete information at various points?

I listed four such operations in my writeup above, but if we refer to Resource Instance Change Lifecycle we can see that I missed the importing case, which originates "outside the circle", so the full set is actually the following:

ReadDataSource
PlanResourceChange
ImportResourceState
ReadResource
UpgradeResourceState

(Note that ValidateResourceConfig isn't included here because although it does happen before apply it's also intentionally done with unconfigured providers, so that we can always validate even if we don't have credentials yet. Any "online validation", requiring access to the remote system, happens during PlanResourceChange instead.)

The main point of this note, then, is to think about what the consequences would be of allowing all of the above to return both partially-unknown and wholly-unknown objects as their results.

Before getting into the specifics, there is a general observation that applies to all of these: our current serialization formats for plan and state don't have a way to accommodate an entirely-unknown resource instance object, but for the sake of this discussion I'm assuming that would be relatively easy to retrofit and that a wholly-unknown object wouldn't behave significantly differently than a partially-unknown one because the Terraform language has built-in behavior to handle operations that traverse through unknown values.

`ReadDataSource`

This operation is a pretty interesting one because we designed it in a world where all reading happened during a separate "refresh" phase, prior to any other work, and so one of the assumptions going in was that providers needed to be protected from dealing with unknown values altogether. Consequently, Terraform Core currently avoids calling ReadDataSource in any case where part of the configuration is unknown, and automatically (without any provider involvement) forces the read to happen during the apply step.

However, that design tradeoff didn't consider the possibility that it might be the provider configuration that was unknown, rather than the direct data resource configuration. It therefore didn't actually succeed in avoiding providers having to deal with unknown values entirely; it only dealt with the most simple case, leaving the more complex case (that is harder for providers to deal with!) unsolved.

If we were designing this anew today I expect we would build it such that Terraform Core would call ReadDataSource during the plan phase even if part of the configuration is unknown, and it would be up to provider-side logic to decide whether the configuration is sufficiently complete to return a partial result. If not, the provider could in principle return a wholly-unknown object to create the effect that's currently hard-coded into Terraform Core: defer the entire operation until the apply step.

Since this situation just slightly tweaks the process of getting to an equivalent result that we have today (a partially-unknown object in the plan), I don't think this particular operation poses any significant challenges to Terraform Core's existing behaviors.

I would expect Terraform Core to transform a wholly-unknown ReadDataSource response into a partially-unknown object which includes all of the values that originated in the configuration; the contract for provider behavior with data resources is that they are not allowed to override values the user provided directly in the configuration, because otherwise Terraform would already be forced to treat a not-yet-resolved data resource instance as entirely unknown, and that would lead to relatively useless plans.

`PlanResourceChange`

Of all of these, PlanResourceChange seems the most straightforward since it is the one that already allows providers to return partially unknown values. Since a wholly-unknown value behaves mostly the same as a known value whose attribute values are all unknown, I think it's safe to extrapolate that Terraform Core would already behave reasonably if given a wholly-unknown object during planning.

I would expect Terraform Core to transform a wholly-unknown PlanResourceChange response into a partially-unknown object which includes all of the values that originated in the configuration; the contract for provider behavior with managed resources is that they are not allowed to override values the user provided directly in the configuration, because otherwise Terraform would already be forced to treat a not-yet-resolved managed resource instance as entirely unknown, and that would lead to relatively useless plans.

`ImportResourceState`

This operation is quite easy to forget because today it happens as an entirely-separate command terraform import, outside of the normal plan/apply flow. We do want to improve on that in later releases though, so here I will treat it as if it were something happening along with other operations in the same plan.

The import operation already runs into challenges when interacting with incomplete provider configurations, and there's at least one issue already open here about that. Doing import as part of plan and apply creates the opportunity to treat it potentially more like ReadDataSource, where Terraform Core could ask the provider to attempt an import, but allow it to return a wholly-unknown object if the provider configuration isn't complete enough to actually support it.

Terraform Core could then potentially defer the actual import until the apply step, but that would come at the expense of a very ambiguous plan which would not be able to distinguish between the following situations that have significantly different consequences:

Object gets imported, Terraform sees that it matches the desired state given in the configuration, and so takes no further action against it.
Object gets imported, Terraform sees that it differs from the desired state given in the configuration, and so updates or replaces the object to match per the normal provider rules.

In the latter case, the user would seem to have no opportunity to review the changes proposed by the provider, because we'd need to defer calling ReadResource and PlanResourceChange until the apply step. To frame it in Resource Instance Change Lifecycle terms: the Initial Planned State is effectively a wholly-unknown object, and so the "Final Planned State" would be the first meaningful artifact describing the changes that need to be made, and that artifact arrives too late for review during the plan step.

`ReadResource`

This operation is how Terraform resynchronizes its snapshot of how an object looked at the end of the previous run with any changes made outside of Terraform in the meantime.

Currently we don't allow ReadResource to return unknown values at all, because there is a deep assumption in Terraform that a state snapshot is always wholly-known. Allowing providers to answer "I don't know" to ReadResource (either in whole or in part) would introduce for the first time an interesting situation where the prior state can potentially be unknown, and so we'd presumably need some equivalent phrase to (known after apply) to describe that the old value of a particular attribute isn't known:

  example = (unknown) -> "Hello"

With that said, it might be defensible to optimistically treat unknown values from ReadResource as being unchanged from the previous run. This would have similar consequences to the current workaround of using -refresh=false to prevent Terraform from calling ReadResource at all in situations where a provider can't be configured enough to execute it. Perhaps it would still be worthy of a warning: "Terraform was unable to check for changes to example.foo because there are changes pending to the associated provider configuration", or something along those lines.

`UpgradeResourceState`

This one is a particularly interesting case because we've historically always called it against a configured provider, but it's also potentially problematic for a provider to rely on the remote system when performing an upgrade: the intent is for the result to be functionally-equivalent to the input but reshaped to match the latest schema, but calling out to the remote system risks accidentally changing the data to match the remote system instead, making it behave like a funny sort of ReadResource.

I don't remember the motivations for giving this operation access to provider configuration, but my guess would be that we assumed that sometimes the provider would need to fetch certain metadata about the remote object type in order to perform the upgrade, such as when the provider is dynamically reflecting the schema of a configurable-schema system like Kubernetes. In that case we'd be using the API to fetch data about the data, rather than the data itself, which is a very reasonable thing to be doing during an upgrade.

If a provider returns "unknown" from the upgrade step then that causes all of the same concerns that unknown values during ReadResource would cause, just one step sooner. UpgradeResourceState has the additional concern that we cannot just optimistically assume that the original data is unchanged in that case, because the original data presumably doesn't conform to the current schema and therefore Terraform Core would be unable to make any use of it whatsoever.

bleachbyte commented 2 years ago

@apparentlymart: appreciate all of the updates and aforementioned "thought experiments" -- it's good to know that things like this are still in active discussion.

I have a particular feature I've been following in two issues -- the latter was closed, but re-linked to this one -- so I wanted to throw a comment in to ask about its viability, as I didn't see it pointed out above:

We have a few modules that contain resources which we'd like to safeguard against destruction (i.e. with prevent_destroy = true). However, with our implementation, we only want to add that protection in specific scopes/cases. The desired functionality would thus be something like this, where var.protect_resources is a bool that defaults to true:

lifecycle {
  prevent_destroy = var.protect_resources
}

As this is in AWS, our current workaround uses that var.protect_resources bool to conditionally create IAM policies that prevent these resources' destruction (if desired), and use prevent_destroy = true on said IAM policies. We'll probably leave the lifecycle rules for the policies in place either way, but it would be nice to be able to directly block -- or not block -- the destruction of the underlying resources.

The only current alternative solution we've found would be to have the same resource declared twice -- one with prevent_destroy = true and one without -- with each resource's existence tied to a count, e.g.:

resource "aws_kms_key" "protected_key" {
  count = var.protect_resources ? 1 : 0
  lifecycle {
    prevent_destroy = true
  }
  # ...resource attributes here...
}

resource "aws_kms_key" "unprotected_key" {
  count = var.protect_resources ? 0 : 1
  lifecycle {
    prevent_destroy = false   # or just don't use the lifecycle block
  }
  # ...resource attributes here...
}

Given that this results in a lot of duplicate code -- especially for complex resources -- the original request for supporting dynamic lifecycle blocks still stands.

apparentlymart commented 2 years ago

Hi @bleachbyte!

I'm afraid I'm not really sure how what you are describing is related to what this issue is about or to what #4149 was proposing. I've not recently done any detailed thinking about what it would take to make the lifecycle rules be dynamic rather than static, but #3116 is indeed the issue representing that.

It looks like I locked it at some point in the past because people were creating notification noise for the subscribers to the issue. We sometimes do that in cases where we already have sufficient understanding of the shape of the problem and its impact, and so continued restatements of the problem become annoying for those who are waiting for more substantial updates. I think your comment here seems to just amount to a restatement of that problem too, so if you haven't already I'd suggest subscribing to that one to hear any updates from me or others on our team about it. However, as far as I know there is nobody actively working on it right now, so I would not expect anything new to appear there in the short term.

bleachbyte commented 2 years ago

Definitely already subscribed to that one -- I will keep an eye on it. Thank you for clarifying! I'm also happy to stay subscribed to this as I think it's an excellent topic to follow and am keen to see more updates/brainstorms that evolve as a result.

aidanmelen commented 2 years ago

2. e Kubernetes provider kubernetes_manifest

I too agree that it is most unfortunate that this resource cannot be used with unknown values. However, I believe this is a challenge unique to the kubernetes provider and not Terraform at large.

aidanmelen commented 2 years ago

It would be great if we didn't need a second set of resources with count to handle unknown values when resources with for_each fail. for_each is much more powerful and cleaner but this is a huge downside.

related to:

sftim commented 1 year ago

I personally would feel uncomfortable applying a plan that can't say for certain whether it will destroy existing objects.

If there were a command line argument for opting in to that level of uncertainty, maybe that would help.

boillodmanuel commented 10 months ago

Hi, I have created a related issue: https://github.com/hashicorp/terraform/issues/34391. But my concern is only about how to change the ordering of resource to be created/destroyed, and I get issues with "unknown" value. I'm not sure on how this is already described in this issue, so I just post a link to my issue as a reminder. Regards

daviewales commented 4 months ago

I have a module which optionally assigns a Key Vault role to a group, depending on whether a Key Vault ID is specified when the module is called:

resource "azurerm_role_assignment" "pipeline_key_vault" {
  # Only create the role assignment if the variable is not null
  count                = var.pipeline_key_vault_id != null ? 1 : 0
  scope                = var.pipeline_key_vault_id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azuread_group.pipeline.object_id
}

However, because the Key Vault does not yet exist, I get the error described above. I'm not sure if there is an elegant solution to this. The only workaround I can think of is to add an additional flag variable to pass to count, instead of using the non-existent Key Vault ID.

I guess the other option would be to do what the error message says, and -target the creation of the Key Vault, then re-run the plan. But this is not ideal, as it introduces manual steps into the plan if it ever needs to be re-deployed.

EDIT:

Using an object type variable is a good workaround, because you can check for the existence of the object instead of the Key Vault ID.

variable "pipeline_key_vault" {
  type = object({
    name = string
    id   = string
  })
  description = "Pipeline Key Vault object"
  default     = null
}

MoussaBangre commented 3 months ago

Any progress on that issue ?

apparentlymart commented 3 months ago

Yes, there has been considerable progress on this issue. It's under active development.

colans commented 1 month ago

Is https://github.com/hashicorp/terraform-provider-kubernetes/issues/1775 a side effect of this issue? Would it make sense to mark that one as a duplicate?

crw commented 1 month ago

Yes, although they may have work to do to close that issue after the core functionality is finished here, and that issue is already marked upstream, so I think it is OK to leave both open for now.

andrew-bulford-form3 commented 1 week ago

Yes, there has been considerable progress on this issue. It's under active development.

@apparentlymart This is great to hear! Are you able to provide a sneak peak into what changes are being introduced and how this might affect what options we have for unknown values being used in providers? Are you implementing a mechanism to allow Terraform to converge on the desired state over multiple applies, or are you solving this another way?

I'm at a fork in the road for some refactoring I'm working on - either provision an EKS cluster in one root module and bootstrap it with the kubernetes provider in another, or provision and bootstrap the cluster in the same root module with some hacky workarounds to the provider issues described in this ticket. If there are changes coming soon which will mean that provisioning and bootstrapping a cluster can be done cleanly under the same root module then I'm inclined to go for the hacky, single module option now and clean it up once these improvements are available, to avoid needing the separate root modules.

Also, is there any expectation that this work will make it possible for child resources to be deleted using the old provider configuration and created using the new one? So taking your happycloud example, if happycloud_compute_cluster.example needs recreating, the compute_deployment.example would first be destroyed using the old provider configuration (pointing at the existing cluster), the cluster recreated, then the deployment created again with the new provider configuration (pointing at the newly created cluster). Is that something being addressed by this ticket or do you see this as a separate issue?

hashicorp / terraform