terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.15k stars 1.17k forks source link

Previously null gcfs_config forcing replacement with false #2085

Closed JerkyTreats closed 2 months ago

JerkyTreats commented 2 months ago

TL;DR

Where previously we had no gcfs_config, where I assume it was null, we now have terraform plans forcing node_pool replacement on gcfs_config enabled=false.

Docs mention this is optional, so null should equal false?

Image streaming in Google console node_pool is confirmed Disabled

          + gcfs_config { # forces replacement
              + enabled = false # forces replacement
            }

I'm assuming this is some change with 33.0.0?

Expected behavior

No destructive node pool replacement where gcfs_config was previously null and is now explicitly false

Observed behavior

destructive nodepool replacement

Terraform Configuration

module "gke" {
  source     = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version    = "33.0.0"
}

Terraform Version

1.9.5

Additional information

No response

apeabody commented 2 months ago

Hi @JerkyTreats - Which version of the Google Terraform Provider are you using?

Context: Looks like the gcfs_config replacement should have been resolved in v6.2.0: https://github.com/hashicorp/terraform-provider-google/pull/19365. Otherwise I suspect we will also see permadrift in the reverse direction.

JerkyTreats commented 2 months ago

Ah, we are running provider 5.44.

Google provider 6.2.0 should fix the issue.

Closing this, thanks for linking to upstream issue.

JerkyTreats commented 2 months ago

@apeabody - I updated to v6.2.0 with 33.0.0, and the force replacement is gone- hurray.

So I apply the change and went on my way... but checked again today and tf plan shows:

          + gcfs_config {
              + enabled = false
            }

It appears that regardless of apply, plan continuously wants to update this.

Not sure if this hashicorp/terraform-provider-google/ side or terraform-google-modules/terraform-google-kubernetes-engine/

Can you confirm if this is something reproducible on your end?

apeabody commented 2 months ago

Hi @JerkyTreats - So the force replacement was fixed by the updated provider, but I suspect your remaining issue might be resolved by (upcoming) https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2093?

JerkyTreats commented 2 months ago

I see, yes probably that fixes.

I had expected a workaround would be to add an explicit enable_gcfs = false in the node_pool declaration, but it continues to trigger the update-in-place even with that added.

2093 should definitely fix it though.

wyardley commented 2 months ago

@apeabody I'm seeing new behavior with 33.0.2:

with

  enable_gcfs               = false

(which I think is the correct way, based on the module docs, and what I already had working), I'm getting:

│ Error: Incorrect attribute value type
│ 
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 398, in resource "google_container_cluster" "primary":
│  398:           enabled = gcfs_config.value
│ 
│ Inappropriate value for attribute "enabled": bool required.

Maybe this is a result of #2093?

if I remove that line temporarily, I get back to the plan wanting to do this again:

      ~ node_config {
          + gcfs_config {
              + enabled = false
            }
        }

Let me know if you want me to file a new issue or need more information.

wyardley commented 2 months ago

@apeabody I think I have a potential fix incoming

diff --git a/autogen/main/cluster.tf.tmpl b/autogen/main/cluster.tf.tmpl
index 901de66e..0f39c57c 100644
--- a/autogen/main/cluster.tf.tmpl
+++ b/autogen/main/cluster.tf.tmpl
@@ -516,7 +516,7 @@ resource "google_container_cluster" "primary" {
       min_cpu_platform            = lookup(var.node_pools[0], "min_cpu_platform", "")
       enable_confidential_storage = lookup(var.node_pools[0], "enable_confidential_storage", false)
       dynamic "gcfs_config" {
-        for_each = lookup(var.node_pools[0], "enable_gcfs", false) ? [true] : [false]
+        for_each = lookup(var.node_pools[0], "enable_gcfs", null) != null ? [var.node_pools[0].enable_gcfs] : [false]
         content {
           enabled = gcfs_config.value
         }
apeabody commented 2 months ago

@apeabody I think I have a potential fix incoming

diff --git a/autogen/main/cluster.tf.tmpl b/autogen/main/cluster.tf.tmpl
index 901de66e..0f39c57c 100644
--- a/autogen/main/cluster.tf.tmpl
+++ b/autogen/main/cluster.tf.tmpl
@@ -516,7 +516,7 @@ resource "google_container_cluster" "primary" {
       min_cpu_platform            = lookup(var.node_pools[0], "min_cpu_platform", "")
       enable_confidential_storage = lookup(var.node_pools[0], "enable_confidential_storage", false)
       dynamic "gcfs_config" {
-        for_each = lookup(var.node_pools[0], "enable_gcfs", false) ? [true] : [false]
+        for_each = lookup(var.node_pools[0], "enable_gcfs", null) != null ? [var.node_pools[0].enable_gcfs] : [false]
         content {
           enabled = gcfs_config.value
         }

Thanks @wyardley - I just opened this PR (https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2095), but happy to use your PR. We should also add enable_gcfs = false to an example node pool to provide text coverage for the dynamic block.

wyardley commented 2 months ago

Yeah, mine would have done the wrong thing in the null case, I think. Yours looks good, and seems to fix the problem in a quick test.

wyardley commented 2 months ago

FWIW, I'm still getting a permadiff on this after updating to 33.0.3 with TPG 6.0.2, either with it explicitly set to false or not. Not sure if you'd like to reopen this or have me open a new issue? There's no force replacement (at least for me, maybe because of https://github.com/GoogleCloudPlatform/magic-modules/pull/11553/), but I'm still seeing a permadiff.

I'd already had enable_gcfs explicitly set to false as mentioned, but confirmed that removing it still has it try to set node_config.gcfs_config.enabled to false

I did see some potential problems with https://github.com/GoogleCloudPlatform/magic-modules/pull/11553/ in the case where the google_container_cluster default node-pool is used, but I don't think they should be at play here.

apeabody commented 2 months ago

FWIW, I'm still getting a permadiff on this after updating to 33.0.3 with TPG 6.0.2, either with it explicitly set to false or not. Not sure if you'd like to reopen this or have me open a new issue? There's no force replacement (at least for me, maybe because of GoogleCloudPlatform/magic-modules#11553), but I'm still seeing a permadiff.

I'd already had enable_gcfs explicitly set to false as mentioned, but confirmed that removing it still has it try to set node_config.gcfs_config.enabled to false

I did see some potential problems with GoogleCloudPlatform/magic-modules#11553 in the case where the google_container_cluster default node-pool is used, but I don't think they should be at play here.

Thanks @wyardley - When you get a chance open a new issue (tag me) specific to the current permadiff, in particular a snip of your config that reproduces it and the proposed plan. Cheers!