terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.15k stars 1.17k forks source link

Upgrading from one Module version to Other Module Version causing Node pool to recreate #2127

Open koushikgongireddy opened 1 month ago

koushikgongireddy commented 1 month ago

TL;DR

Upgrading from one Module version to Other Module Version causing Node pool to recreate

Expected behavior

When we upgrade GKE module versions we are seeing breaking changes where GKE Node pools are trying to recreate.

Currently we are on old version 17.0.0 and planning to upgrade to 30.0.0 and i see there are changers in keepers which is causing the random_id to change and that's causing the node pool to recreate.

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313 30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We also tried from 30.0.0 to 32.0.0 and same happening again as new changes are added in keepers 32.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v32.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L591

We need help on how to upgrade to higher versions without causing the node pool to recreate

Observed behavior

When we run TF plan after upgrading to 30.0.0 we are seeing below resources are recreated

    module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.random_id.name["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "random_id" "name" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-green"] must be replaced
+/- resource "google_container_node_pool" "pools" {

   module.gke.module.gke.google_container_node_pool.pools["dev-gusc1-ee-vpc-defnp01"] must be replaced
+/- resource "google_container_node_pool" "pools" {

17.0.0 to 30.0.0

      ~ keepers     = { # forces replacement
          + "boot_disk_kms_key"           = ""
          + "enable_gcfs"                 = ""
          + "enable_gvnic"                = ""
          + "enable_integrity_monitoring" = ""
          + "enable_secure_boot"          = "true"
          + "gpu_partition_size"          = ""
          - "labels"                      = "xxxx-vpc-green,true" -> null
          + "max_pods_per_node"           = ""
          + "min_cpu_platform"            = ""
          + "placement_policy"            = ""
          + "pod_range"                   = ""
          + "spot"                        = ""
          - "tags"                        = "xxxx-vpc-green" -> null
            # (11 unchanged elements hidden)
        }

30.0.0 to 32.0.0

      ~ keepers     = { # forces replacement
          + "enable_confidential_storage" = ""
          + "gpu_driver_version"          = ""
          + "gpu_sharing_strategy"        = ""
          + "max_shared_clients_per_gpu"  = ""
          + "queued_provisioning"         = ""
            # (22 unchanged elements hidden)
        }

Terraform Configuration

module "gke" {
  source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"

  version = "30.0.0"
  project_id                  = var.project_id
  kubernetes_version          = var.kubernetes_version
  name                        = "${var.environment}-${var.location_acronym}-${var.cluster_name_suffix}"
  region                      = var.location
  zones                       = var.zones
.
.
.
.
.
.
 node_pools = [
    {
      name               = "${var.network_name}-defnp01"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.

    },
    {
      name               = "${var.network_name}-green"
      machine_type       = var.machine_type
      enable_secure_boot = true
      node_locations     = var.node_pools_locations
.
.
.
.

    }
]
}

Terraform Version

We are currently on below
Present - 

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "3.90.1"
  }
}

Planning to move to

terraform {
  required_version = "1.5.5"

  required_providers {
    google = "5.12.0"
  }
}

Also Module version we are using 17.0.0 and planning to move to 30.0.0 or 32.0.0

Additional information

No response

koushikgongireddy commented 1 month ago

@apeabody @aaron-lane @bharathkkb Can you please check once on this issue and let me know how to overcome?

I see issue was opened and closed without resolution previously - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/1773

aaron-lane commented 1 month ago

Sorry, I have not maintained these modules for years.

apeabody commented 1 month ago

Hi @koushikgongireddy

Any change to the keepers will normally result in nodepool replacement. To avoid this, in some cases, it may be possible to edit your remote state with the new keeper values. Here is an example of updating a keeper value in the v24.0 upgrade guide: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/docs/upgrading_to_v24.0.md#update-variant-random-id-keepers-updated

v17 to v33 is a substantial jump, so be sure to review:

koushikgongireddy commented 1 month ago

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks Koushik

apeabody commented 1 month ago

Thanks for the update @apeabody

Yeah we reviewed the changes from 17.0.0 to 30.0.0 and its not affecting our workloads too much so we are good!!

So for keepers is there any alternative other than updating the statefile? Because if its one off we can do it but if its coming up from version to version then its tedious task if we are managing 10's of clusters. Currently we have 25+ GKE clusters!

So want to check if there is any other solution than updating the state file!!

Thanks Koushik

Updating the state file should only be required when keepers are modified AND you want to avoid replacing nodepools. While keepers don't change in every major release, there have been a number of changes since v17 was released 3 years ago.

koushikgongireddy commented 1 month ago

@apeabody - But i see the changes from 30.0.0 to 32.0.0. I added the changes in above description as well!!

So we thought of changing to 32.0.0 from 30.0.0 in one dev cluster and then we again seeing keepers changes, so if its repeats its a pain point for us who is managing 25+ clusters.

apeabody commented 1 month ago

enable_confidential_storage

Hi @koushikgongireddy - Curious, is there a reason the nodepool can't be recreated, especially on a dev cluster?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/container/node_config.go#L753).

koushikgongireddy commented 1 month ago

@apeabody - We are fine on Dev Clusters, but for PROD clusters its a downtime because new node pool will be created and old will be deleted right away. Which will cause downtime for us.

So updating the statefile is an option but doing for 20 PROD clusters is difficult for us, so we want to see if there is any other alternative option because we are using external module provided by GCP

KRASSUSS commented 1 month ago

Also experiencing the issue. I'm upgrading from module v32.0.0 to v33.0.4. From the TF plan I can see the following change on the node pools. It wants to add the gcfs_config even when it's set to false:

+ gcfs_config {
+ enabled : false
}

This is the only change on a node pool level which makes me think this is what causes the nodes to recreate?

wyardley commented 1 month ago

@KRASSUSS w/r/t the gcs_config diff specifically, I think updating to >= 6.4.0 or 5.44.1 provider version should resolve it as long as you're also on the latest module version. If you're seeing that still with the latest provider and latest module version, maybe file a separate issue, but I think you should be good.

If it's a provider version where it's showing that diff but is not forcing recreation on a node pool, it should also be safe to apply, but you'll probably see a permadiff until the provider is upgraded.

wyardley commented 1 month ago

@koushikgongireddy you could maybe try removing the item from state (after backing up state, of course) and see if reimporting the cluster and nodepools works in your lower envs? Or make sure there aren't any config changes you have to add to the settings (e.g., formerly unsupported values that are now supported). I'm not super familiar with keepers, but it's possible that if you match the existing values in the configuration properly, you won't see a diff?

For example, the labels going from set to null makes me think there are some values that might need to be reflected in your Terraform configs?

Part of the challenge is these new nodepool arguments can force re-creation at the provider level (For example enable_confidential_storage has ForceNew: true

I'm hoping to eventually get more of the items that either don't support in-place updates, or don't work with updates at all, fixed, at least for the default nodepool case. I would imagine that enable_confidential_storage does need to recreate the nodepool if it is actually changing settings, but if the OP doesn't have it enabled, and doesn't want to enable it, maybe they can look at explicitly setting the value to false or null?

koushikgongireddy commented 1 month ago

@wyardley The issue is for sure keepers here.

Because if you see the above description keepers is the major change in my tf plan and that's causing the random_id to get change and random_id change is causing node pool name to change and id node pool name is changing it needed Node pool to get recreated!

The issue is not with enable_confidential_storage has ForceNew: true

If you see below lines the keepers are changing and that change is ultimately causing force node pool recreation

17.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v17.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L313 30.0.0 - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/v30.0.0/modules/beta-private-cluster-update-variant/cluster.tf#L500

We tried updating state file with the values and its working fine but we want to know is it the only option or any alternative options to avoid keepers change!!

apeabody commented 1 month ago

Only the *-update-variant modules includes keepers, specifically for the node pool create before destroy behavior: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/tree/master/modules/beta-public-cluster-update-variant#node-pool-update-variant.

For those not making use of the node pool create before destroy behavior, the other modules such as beta-private-cluster do not include the keepers and would be suggested. Otherwise to avoid the node_pool replacement with *-update-variant modules, when new keepers are introduced, the local state could be modified.

koushikgongireddy commented 1 month ago

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster beta-private-cluster-update-variant beta-private-cluster

Thanks Koushik

apeabody commented 1 month ago

@apeabody Thanks for the update, we will definitely test with beta-private-cluster and update you!!

Also can you describe more on the difference between the modules we have? I mean what benefits we get with update-variant and not with private-cluster?

beta-autopilot-private-cluster beta-private-cluster-update-variant beta-private-cluster

Thanks Koushik

Hi @koushikgongireddy

All three of these are similar in that they enable beta features and can be configured as private clusters.

The big differences are that: beta-autopilot-private-cluster: create an Autopilot cluster beta-private-cluster-update-variant: creates a Standard cluster, with Node Pool Update Variant beta-private-cluster: creates a Standard cluster