rancher / terraform-provider-rancher2

Terraform Rancher2 provider
https://www.terraform.io/docs/providers/rancher2/
Mozilla Public License 2.0
263 stars 228 forks source link

Provider produced inconsistent final plan, produced an invalid new value for │ .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now │ cty.StringVal("cattle-global-data:cc-694ng"). #835

Closed andersjohansson2021 closed 1 year ago

andersjohansson2021 commented 2 years ago

When terraforming a RKE2 cluster i receive the following: │ Error: Provider produced inconsistent final plan │ │ When expanding the plan for rancher2_cluster_v2.test-cluster to include new values learned so far during apply, provider │ "registry.terraform.io/rancher/rancher2" produced an invalid new value for │ .rke_config[0].machine_pools[1].cloud_credential_secret_name: was cty.StringVal(""), but now │ cty.StringVal("cattle-global-data:cc-trrz8"). │ │ This is a bug in the provider, which should be reported in the provider's own issue tracker.

This happends in version 1.22.1 not in 1.21.0 of the provider. /Anders.

SURE-5412 SURE-4866

frouzbeh commented 2 years ago

Any update on this? I'm seeing the same issue.

│ When expanding the plan for module.cluster.rancher2_cluster_v2.cluster to include new values learned so far
│ during apply, provider "registry.terraform.io/rancher/rancher2" produced an invalid new value for
│ .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now
│ cty.StringVal("cattle-global-data:cc-f9hbf").
│
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
andersjohansson2021 commented 2 years ago

@frouzbeh I had too use an older version as stated above. That solved the issue for me. But having said that this bug need to be adressed for the provider.

frouzbeh commented 2 years ago

@andersjohansson2021 Thank you, yes I tested with older version and it works, but as I remember older version had another issue with kube-config generation function which is fixed in 1.22.2. I hope they fix it soon.

frouzbeh commented 2 years ago

@rawmind0 Would you please take a look at this issue. We really would like to make this work.

git-ival commented 2 years ago

+1 here as well

PrakashFromBunnings commented 2 years ago

Hello , Any workaround or fix this issue . I am stuck at this issue .

PrakashFromBunnings commented 2 years ago

@frouzbeh I had too use an older version as stated above. That solved the issue for me. But having said that this bug need to be adressed for the provider.

older version brings other issues , like missing or unsupported argumnets etc.

nfsouzaj commented 2 years ago

Hello, this issue is also causing problems on my deploys. Is there a commitment to fix it ?

anupama2501 commented 2 years ago

Reproduced the issue on v2.6-head c54b655 cloud provider - Linode

Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for rancher2_cluster_v2.rke2-cluster-tf to include new values learned so far during apply, provider "registry.terraform.io/rancher/rancher2" produced an invalid new
│ value for .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now cty.StringVal("cattle-global-data:<redacted>").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
izaac commented 2 years ago

As workaround I tried this and worked. Create a cloud credential and then grab the cloud credential id using a data block.

Example:

resource "rancher2_cloud_credential" "rancher2_cloud_credential" {
  name = var.cloud_credential_name
  amazonec2_credential_config {
    access_key = var.aws_access_key
    secret_key = var.aws_secret_key
    default_region = var.aws_region
  }
}
data "rancher2_cloud_credential" "rancher2_cloud_credential" {
  name = var.cloud_credential_name
}

Then use data.rancher2_cloud_credential.rancher2_cloud_credential.id in rancher2_cluster_v2 machine configs.

Note: this only work having the cloud credential created beforehand it seems

chfrank-cgn commented 2 years ago

Still in version 1.24.2 when creating RKE2 downstream clusters on Azure:

Error: Provider produced inconsistent final plan

When expanding the plan for rancher2_cluster_v2.cluster_az to include new values learned so far during apply, provider "registry.terraform.io/rancher/rancher2" produced an invalid new value for .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now cty.StringVal("cattle-global-data:cc-ffs8c").

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for rancher2_cluster_v2.cluster_az to include new values learned so far during apply, provider "registry.terraform.io/rancher/rancher2" produced an invalid new value for .rke_config[0].machine_pools[0].name: was cty.StringVal(""), but now cty.StringVal("pool-b94345").

This is a bug in the provider, which should be reported in the provider's own issue tracker.

Looking at #878, I don't believe that it will fix both plan inconsistencies

matttrach commented 1 year ago

Still seeing this error on v1.25.0:

│ 
│ When expanding the plan for rancher2_cluster_v2.utility to include new values learned so far during apply, provider "registry.terraform.io/rancher/rancher2"
│ produced an invalid new value for .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now
│ cty.StringVal("cattle-global-data:cc-pmzs7").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

it usually works the second time

sebracs commented 1 year ago

Running terraform apply a second time also consistently works for me.

On the terraform destroy I also had a dependency issue but that could be fixed with adding:

depends_on = [rancher2_cloud_credential.my_cloud_credential]

to the rancher2_cluster_v2 resource.

Is there another depends_on-like or sleep-like thing you could do to get the apply working on the first try?

chfrank-cgn commented 1 year ago

Yes - I can confirm that it works the second time, most likely because the credential is already there from the first try. Thanks for the hint about the dependency!

moshiaiz commented 1 year ago

Facing this as well. Any plan to fix this bug?

As noted here, the only workaround is to create the cloud credential before running terraform apply - on a different apply or manually via UI Otherwise - my automation to create clusters is failing.

a-blender commented 1 year ago

Hello @moshiaiz,

I am working on this. Thank you all for your patience.

Terraform rancher2 provider with rancher 2.7 builds are currently blocked for us due to https://github.com/rancher/terraform-provider-rancher2/issues/1052. We need to branch and fix our build before I can reproduce this issue.

From my investigation, there is indeed a bug in the way the provider is processing the value for .rke_config[0].machine_pools[0].cloud_credential_secret_name. From the terraform docs, this field exists in both the cluster_v2 resource and its machine pool but offhand will need to find out why its present in the machine pool. When connecting to a rancher instance, there is only 1 cloud credential needed to connect to the instance so it may be a duplicate field.

This old PR is a potential fix https://github.com/rancher/terraform-provider-rancher2/pull/878 and should also fix https://github.com/rancher/terraform-provider-rancher2/issues/915 since rke is being installed on vSphere and this appears to be a bug in the terraform rke config.

a-blender commented 1 year ago

The main PR has been merged for https://github.com/rancher/terraform-provider-rancher2/issues/1052 and the TF build is fixed. Testing is unblocked. Trying to reproduce this for an RKE2 cluster on provider version 1.25.0

a-blender commented 1 year ago

Reproduced this issue on Amazon EC2 RKE2 cluster with TF provider 1.25.0.

main.tf

``` terraform { required_providers { rancher2 = { source = "rancher/rancher2" version = "1.25.0" } } } provider "rancher2" { api_url = var.rancher_api_url token_key = var.rancher_admin_bearer_token insecure = true } # Create amazonec2 cloud credential resource "rancher2_cloud_credential" "foo" { name = "foo" amazonec2_credential_config { access_key = var.aws_access_key secret_key = var.aws_secret_key } } # Create amazonec2 machine config v2 resource "rancher2_machine_config_v2" "foo" { generate_name = "ablender-machine" amazonec2_config { ami = var.aws_ami region = var.aws_region security_group = [var.aws_security_group_name] subnet_id = var.aws_subnet_id vpc_id = var.aws_vpc_id zone = var.aws_zone_letter } } # Create a new rancher v2 amazonec2 RKE2 Cluster v2 resource "rancher2_cluster_v2" "ablender-rke2" { name = var.rke2_cluster_name kubernetes_version = "v1.25.6-rancher1-1" enable_network_policy = false default_cluster_role_for_project_members = "user" cloud_credential_secret_name = rancher2_cloud_credential.foo.id rke_config { machine_pools { name = "pool1" cloud_credential_secret_name = rancher2_cloud_credential.foo.id control_plane_role = true etcd_role = true worker_role = true quantity = 1 machine_config { kind = rancher2_machine_config_v2.foo.kind name = rancher2_machine_config_v2.foo.name } } } } ```

The error showed up on the first terraform apply

image

It worked when running terraform apply a second time as posted above so that is a valid workaround.

image
a-blender commented 1 year ago

Investigation

After more digging, I've discovered that this error is/similar to a very popular error https://github.com/hashicorp/terraform-provider-aws/issues/19583 in the Terraform provider AWS that has been very active over the past two years and that Hashicorp refuses to acknowledge or fix.

I discovered this error in the TF debug logs

2023-02-10T13:43:54.804-0500 [WARN]  Provider "terraform.example.com/local/rancher2" produced an invalid plan for rancher2_cluster_v2.ablender-rke2, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .fleet_namespace: planned value cty.StringVal("fleet-default") for a non-computed attribute
      - .rke_config[0].machine_selector_config: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead
      - .rke_config[0].etcd: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead
      - .rke_config[0].machine_pools[0].cloud_credential_secret_name: planned value cty.StringVal("") does not match config value cty.UnknownVal(cty.String)

From poking around and according to Hashicorp https://discuss.hashicorp.com/t/context-around-the-log-entry-tolerating-it-because-it-is-using-the-legacy-plugin-sdk/1630, most of these warnings are due to an expected SDK compatibility quirk but the error for cloud_credential_secret_name is causing the apply to fail.

Full error ends in this

2023-02-10T13:26:42.997-0500 [ERROR] vertex "rancher2_cluster_v2.ablender-rke2" error: Provider produced inconsistent final plan
╷
│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for rancher2_cluster_v2.ablender-rke2 to include new values learned so far during apply,
│ provider "terraform.example.com/local/rancher2" produced an invalid new value for
│ .rke_config[0].machine_pools[0].cloud_credential_secret_name: was cty.StringVal(""), but now
│ cty.StringVal("cattle-global-data:cc-mzjcm").
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

Root cause

Something on the backend in the terraform-plugin-sdk is computing a planned value of "" for cloud_credential_secret_name when it is set to Required and set as a string in the config file. This cannot be fixed in the Terraform provider. It appears to be a bug in the sdk that the provider is using.

Fix

I tried updating the Terraform plugin SDK and that did not work, but setting machine pool cloud_credential_secret_name as Optional does fix it. This patch allows us to retain parity between Rancher and the Terraform provider and may be the most viable option to fix this issue for the scores of customers who have been running into this issue every few weeks. I will update my draft PR shortly.

a-blender commented 1 year ago

Testing template

Root cause

When creating an RKE2 cluster via Terraform on any hosted provider (Amazon EC2, Azure, Linode driver so far), Terraform computes a new value for a duplicate field cloud_credential_secret_name in the machine pool and then throws an error on a terraform apply pertaining to that value.

What was fixed, or what changes have occurred

This PR has the following fix

Areas or cases that should be tested

Test steps

main.tf

``` terraform { required_providers { rancher2 = { source = "rancher/rancher2" version = "3.0.0" } } } provider "rancher2" { api_url = var.rancher_api_url token_key = var.rancher_admin_bearer_token insecure = true } # Create amazonec2 cloud credential resource "rancher2_cloud_credential" "foo" { name = "foo" amazonec2_credential_config { access_key = var.aws_access_key secret_key = var.aws_secret_key } } # Create amazonec2 machine config v2 resource "rancher2_machine_config_v2" "foo" { generate_name = "ablender-machine" amazonec2_config { ami = var.aws_ami region = var.aws_region security_group = [var.aws_security_group_name] subnet_id = var.aws_subnet_id vpc_id = var.aws_vpc_id zone = var.aws_zone_letter root_size = var.aws_root_size } } # Create a new rancher v2 amazonec2 RKE2 Cluster v2 resource "rancher2_cluster_v2" "ablender-rke2" { name = var.rke2_cluster_name cloud_credential_secret_name = rancher2_cloud_credential.foo.id // test case kubernetes_version = "v1.25.6+rke2r1" enable_network_policy = false default_cluster_role_for_project_members = "user" rke_config { machine_pools { name = "pool1" cloud_credential_secret_name = rancher2_cloud_credential.foo.id // test case control_plane_role = true etcd_role = true worker_role = true quantity = 1 machine_config { kind = rancher2_machine_config_v2.foo.kind name = rancher2_machine_config_v2.foo.name } } } } ```

What areas could experience regressions ?

Terraform rancher2 provider, rke1 prov

Are the repro steps accurate/minimal ?

Yes.

a-blender commented 1 year ago

Blocked -- waiting on Terraform 3.0.0 for Rancher v2.7.x.

lazyfrosch commented 1 year ago

Thank you for the investigations so far, I tested 3.0.0-rc1 since I have the same problem, were first apply fails and second apply works.

In my case I could track it down to machine_global_config being built up with a "known after apply" value.

  rke_config {
    machine_global_config = yamlencode({
      cni = "calico"
      profile = "cis-1.6"
      tls-san = [
        module.vip_control_plane.fqdn,
      ]
    })
  }

If I remove the tls-san value, the problem doesn't happen on first try.

Anything I could test or investigate?

``` 2023-02-28T16:43:00.253+0100 [WARN] Provider "local/rancher/rancher2" produced an invalid plan for module.cluster.rancher2_cluster_v2.cluster, but we are tolerating it because it is using the legacy plugin SDK. The following problems may be the cause of any confusing errors from downstream operations: - .fleet_namespace: planned value cty.StringVal("fleet-default") for a non-computed attribute - .rke_config[0].machine_global_config: planned value cty.StringVal("cni: calico\nprofile: cis-1.6\ntls-san:\n- generated-fqdn.int.example.com\n") does not match config value cty.StringVal("\"cni\": \"calico\"\n\"profile\": \"cis-1.6\"\n\"tls-san\":\n- \"generated-fqdn.int.example.com\"\n") - .rke_config[0].machine_pools: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead - .rke_config[0].etcd: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead - .rke_config[0].machine_selector_config: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead 2023-02-28T16:43:00.253+0100 [ERROR] vertex "module.cluster.rancher2_cluster_v2.cluster" error: Provider produced inconsistent final plan ╷ │ Error: Provider produced inconsistent final plan │ │ When expanding the plan for module.cluster.rancher2_cluster_v2.cluster to │ include new values learned so far during apply, provider │ "local/rancher/rancher2" produced an invalid new value for .rke_config: │ block count changed from 0 to 1. │ │ This is a bug in the provider, which should be reported in the provider's │ own issue tracker. ```
a-blender commented 1 year ago

@sowmyav27 This is ready to test using Terraform rancher2 v3.0.0-rc1. Please setup local testing on the rc version of the provider with this command

./setup-provider.sh rancher2 3.0.0-rc1
lazyfrosch commented 1 year ago

@a-blender shall I open a dedicated issue? But I assume this is a general problem for "known during apply" values.

Josh-Diamond commented 1 year ago

Ticket #835 - Test Results - ✅

With Docker on a single-node instance using Rancher v2.7-64c5188a5394f7ef7858ebb6807072ad5abe0e80-head:

Verified with rancher2 provider v3.0.0-rc2:

  1. Fresh install of rancher v2.7-head
  2. Configure a resource block for cloud credentials + provision a downstream RKE2 EC2 cluster, referencing the cloud credential resource in the nodepool configuration
  3. Verified - no errors seen; cluster successfully provisions and destroys; as expected

Screenshots:

Screenshot 2023-04-11 at 3 51 21 PM Screenshot 2023-04-11 at 3 51 37 PM