hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.1k stars 9.47k forks source link

ForceNew doesn't trigger resource recreation when defined on TypeSet, which leads to Update error #34691

Closed annakhm closed 6 months ago

annakhm commented 6 months ago

Terraform Version

Terraform v1.3.7
on linux_amd64

(happens with terraform v1.6.6 as well)

Terraform Configuration Files

resource "nsxt_principal_identity" "test_pi" {
  name            = "ci_principal_identity"
  node_id         = "ci_node"
  certificate_pem = <<EOF
-----BEGIN CERTIFICATE-----
certificate PEM
-----END CERTIFICATE-----
EOF

  roles_for_path {
    path = "/orgs/default"

    roles = ["test-value-1"]
  }
}

Debug Output

https://gist.github.com/annakhm/60101ee4667f879d650a017b00418121

Expected Behavior

The resource should be recreated since roles_for_path is defined as ForceNew

Actual Behavior

The resource is not recreated. Update fails with Error: doesn't support update (since Update function is not defined) The new value is recorded in the state although the update didn't happen:

root@10:~/go/bin# terraform show
# nsxt_principal_identity.test_pi:
resource "nsxt_principal_identity" "test_pi" {
    certificate_id  = "b6c29f38-b760-4b13-b549-96b918b02e88"
    certificate_pem = <<-EOT
        -----BEGIN CERTIFICATE-----
certificate PEM  
        -----END CERTIFICATE-----
    EOT
    id              = "9fac8f28-66df-4ee5-8314-b9bd7b0c0fb9"
    is_protected    = true
    name            = "ci_principal_identity"
    node_id         = "ci_node"

    roles_for_path {
        path  = "/orgs/default"
        roles = [
            "test-value-2", <=================================
        ]
    }
}
root@10:~/go/bin# terraform plan
╷
│ Warning: Provider development overrides are in effect
│
│ The following provider development overrides are set in the CLI configuration:
│  - vmware/nsxt in /root/go/bin
│
│ The behavior may therefore not match any released version of the provider and applying changes may cause the state to
│ become incompatible with published releases.
╵
nsxt_principal_identity.test_pi: Refreshing state... [id=9fac8f28-66df-4ee5-8314-b9bd7b0c0fb9]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the
following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # nsxt_principal_identity.test_pi will be updated in-place
  ~ resource "nsxt_principal_identity" "test_pi" {
        id              = "9fac8f28-66df-4ee5-8314-b9bd7b0c0fb9"
        name            = "ci_principal_identity"
        # (4 unchanged attributes hidden)

      - roles_for_path {
          - path  = "/orgs/default" -> null
          - roles = [
              - "test-value-1",
            ] -> null
        }
      + roles_for_path {
          + path  = "/orgs/default"
          + roles = [
              + "test-value-2",
            ]
        }
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Steps to Reproduce

  1. compile terraform-provider-nsxt based on commit https://github.com/vmware/terraform-provider-nsxt/commit/730330d397141745e9de20e2cd75a3f0412b18ae. Alternatively, define a resource with schema definition similar to https://github.com/vmware/terraform-provider-nsxt/blob/master/nsxt/resource_nsxt_principal_identity.go#L61 and no Update function.
  2. Apply config pasted above
  3. Change roles = ["test-value-1"] to roles = ["test-value-2"]
  4. Apply

Additional Context

My understanding is that any change in aggregate attribute (TypeList or TypeSet) should trigger resource recreation. Looks like its not the case. Defining ForceNew on roles attribute is fixing the issue.

If ForceNew on TypeSet is meaningless and was designed to only define behavior of basic types, then we have a problem with validation of Update function availability. Terraform would complain that we can omit Update function only if every attribute is ForceNew.

Even if the behavior explained above is expected, it feels like updating the intent value in state without success in Update is not intended behavior.

References

No response

apparentlymart commented 6 months ago

Thanks for reporting this, @annakhm!

I think the first thing we'll need to do here is figure out whether the behavior you've described belongs to Terraform Core or to the plugin SDK.

I suspect the latter, because ForceNew as a declarative schema flag is an SDK-level mechanism, with the SDK using it to generate the lower-level representation that Terraform Core expects to see, which is really just an array of paths that the provider thinks should have "(forces replacement)" written next to them in the UI. The presence of at least one of those in the response to PlanResourceChange is what tells Terraform Core to treat the change as a replace instead of just an update.

There must presumably therefore be some logic in the SDK which checks the ForceNew schema field against the differences detected during planning and populates the relevant field in the provider protocol response. My hunch -- not yet proven -- is that the logic currently only checks attributes (those where Elem is a schema.Schema) and not intermediate nested blocks where Elem is a nested schema.Resource object.

It's been a long time since I personally worked on the SDK so I don't really remember my way around that codebase anymore, but I did find at least one location where it seems to be doing a ForceNew-related thing only in the schema.Schema case: https://github.com/hashicorp/terraform-plugin-sdk/blob/43cfd3282307f68ea77eb4c15548100386f3a317/helper/schema/schema.go#L1258-L1287

If we can determine (with more certainty than I can currently muster) that this is a limitation of the SDK logic then we should probably move this issue over to the SDK repository, but I don't yet feel confident enough to do that based on my initial analysis here. Unfortunately if this is an SDK limitation then the outcome is likely to be only to document the limitation more clearly, since the development priority for the SDK is now only to keep existing providers working the same way they always did, and not to introduce any new capabilities.

annakhm commented 6 months ago

Thank you for looking at this so quickly @apparentlymart! I was indeed having my doubts on whether to post this under sdk or terraform core, since both seem to implement a portion the ForceNew logic. Thanks for your explanation here! I'm curious what is your take on the second issue described above - the one that results in the updated attribute being recorded in the state? This seems like terraform core issue to me. Would you suggest to open two separate issues?

annakhm commented 6 months ago

There must presumably therefore be some logic in the SDK which checks the ForceNew schema field against the differences detected during planning and populates the relevant field in the provider protocol response. My hunch -- not yet proven -- is that the logic currently only checks attributes (those where Elem is a schema.Schema) and not intermediate nested blocks where Elem is a nested schema.Resource object.

Yes, I see what you mean, I would expect ForceNew to propagate to Resource.Schema for the Resource case in the same way it is done here https://github.com/hashicorp/terraform-plugin-sdk/blob/43cfd3282307f68ea77eb4c15548100386f3a317/helper/schema/schema.go#L1274 But its possible current behavior is expected, although it seems counter-intuitive to me..

apparentlymart commented 6 months ago

The rule about requiring everything to have ForceNew when there's no Update function definitely is an SDK-enforced constraint, since Terraform Core doesn't actually distinguish each of the verbs in the same way the SDK does... there's just one PlanResourceChange function and one ApplyResourceChange function, and then the SDK decides which of the four callbacks to call into based on conditions such as whether the "old value" is null (suggesting a "create").

As a general rule, Terraform Core will always save the new state returned by the provider -- even if the provider also indicates an error -- because some providers are capable of a sort of "partial success", where perhaps the Update function is performing more than one API call and at least one already succeeded before the error was detected. In that case, the provider should return a description of the partially-updated state along with an error and then Terraform will save it to make the saved state as accurate as possible.

IIRC providers are allowed to leave the "new state" totally unset when they return an error, in which case Terraform Core would just retain the prior state on the assumption that the failure was so severe that the provider can't even describe what the situation is. However, I don't think the SDK offers a direct way to do that, because the SDK uses a design where the provider is directly mutating the planned new state (using d.Set, etc) rather than constructing a new object to return during the apply phase, and so if the provider does nothing at all then the ResourceData object already reflects the planned new state and then the SDK returns it as if it were the new state. (I believe the new plugin framework avoids this quirk by requiring the provider to build the "new state" itself from scratch as a new value, rather than starting with an already-populated object and making surgical modifications to it before returning.)

All of that said then: I think the behavior of updating the state is Terraform Core working as designed -- it's saving what the provide returned -- but it's debatable whether the provider (i.e. the SDK) should've returned an object at all if the provider-side situation was so invalid that it couldn't even call the Update function.

annakhm commented 6 months ago

Thanks @apparentlymart. Sounds like I should move this issue to SDK then :)

github-actions[bot] commented 5 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.