terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.13k stars 1.16k forks source link

Private Cluster Config In modules/beta-private-cluster-update-variant Causes Cluster Recreation #1876

Closed TitanRob16 closed 4 months ago

TitanRob16 commented 7 months ago

TL;DR

The private_cluster_config.master_ipv4_cidr_block and private_cluster_config.private_endpoint_subnetwork values aren't recognised by Terraform after deploying a GKE cluster.

Expected behavior

I'd expect Terraform to be aware of all the resources created with the cluster resource so that when I re-run terraform apply, it doesn't see any changes.

Observed behavior

      ~ private_cluster_config {
          - enable_private_endpoint     = false -> null
          + master_ipv4_cidr_block      = "10.0.0.0/28" # forces replacement
          + peering_name                = (known after apply)
          ~ private_endpoint            = "10.0.0.2" -> (known after apply)
          - private_endpoint_subnetwork = "projects/<project>/regions/europe-west1/subnetworks/gke-<cluster_name>-<random_id>-pe-subnet" -> null # forces replacement
          ~ public_endpoint             = "xx.xxx.xxx.xxx" -> (known after apply)
            # (1 unchanged attribute hidden)

            # (1 unchanged block hidden)
        }

Terraform Configuration

module "shared_env_project_gke_clusters" {
  for_each = local.shp.gke_clusters
  source   = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"
  version  = "~> 30.0"

  project_id                 = local.shp.project_id
  name                       = each.key
  region                     = each.value.region
  zones                      = data.google_compute_zones.shared_project_gke[each.key].names
  network                    = each.value.vpc.name
  subnetwork                 = each.value.vpc.subnet
  master_authorized_networks = each.value.master_authorized_networks
  master_ipv4_cidr_block     = each.value.vpc.master_ipv4_cidr_block
  ip_range_pods              = "pods"
  ip_range_services          = "services"
  release_channel            = each.value.auto_upgrade == true ? "REGULAR" : "UNSPECIFIED"
  add_cluster_firewall_rules = true
  disable_default_snat       = true
  enable_private_endpoint    = false
  enable_private_nodes       = true
  create_service_account     = false
  service_account            = each.value.service_account != null ? each.value.service_account == "shared_project_name-gke" ? replace(each.value.service_account, "shared_project_name", "${local.shp.project_name}") : each.value.service_account : "${local.shp.project_number}-compute@developer.gserviceaccount.com"
  remove_default_node_pool   = true

  # Addons
  gce_pd_csi_driver      = true
  config_connector       = true
  enable_cost_allocation = true

  cluster_resource_labels = {
    component = "gke"
  }

  node_pools = [for k, v in each.value.nodepools :
    {
      name               = k
      machine_type       = v.machine_type
      node_locations     = join(",", data.google_compute_zones.shared_project_gke[each.key].names)
      initial_node_count = v.initial_node_count
      min_count          = null
      max_count          = null
      total_min_count    = v.total_min_count
      total_max_count    = v.total_max_count
      local_ssd_count    = v.local_ssd_count
      disk_size_gb       = v.disk_size_gb
      disk_type          = v.disk_type
      image_type         = "COS_CONTAINERD"
      auto_repair        = true
      auto_upgrade       = true
      preemptible        = v.preemptible
    }
  ]

  node_pools_oauth_scopes = {
    all = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }

  node_pools_labels = merge({
    for k, v in each.value.nodepools :
    k => {
      (k) = true
    }
    },
    {
      all = {}
  })

  node_pools_taints = {
    for k, v in each.value.nodepools :
    k => [{
      key    = k
      value  = true
      effect = "PREFER_NO_SCHEDULE"
    }]
  }
}

Terraform Version

Terraform v1.6.6

Additional information

I suspect this is happening because GKE is automatically created the _subnet projects//regions/europe-west1/subnetworks/gke---pe-subnet_ in the background, which Terraform doesn't know about.

The only way I've been able to get around this is to fork the repo and add in the private_cluster_config dynamic block to the lifecycle block for the cluster resource:

  lifecycle {
    ignore_changes = [
      node_pool, 
      initial_node_count, 
      resource_labels["asmv"],
      private_cluster_config
      ]
  }

I'm also using Terragrunt to managed the provider versions:

terraform {
  required_version = "~> 1.6.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.16.0"
    }
    google-beta = {
      source  = "hashicorp/google-beta"
      version = "~> 5.16.0"
    }
  }
}
rwkarg commented 7 months ago

It appears that the appropriate fix for terraform is to disallow specifying master_ipv4_cidr_block and force using a pre-existing subnet in private_endpoint_subnetwork instead. Otherwise there doesn't seem to be any way to make the terraform behave appropriately given the new behavior of the backend provisioning.

agaudreault commented 6 months ago

I have the same issue and instead of forking the module, I added a sed (macOS) after the init command until this is fixed.

sed -i '' 's|ignore_changes = \[node_pool, initial_node_count, resource_labels\["asmv"\]\]|ignore_changes = [node_pool, initial_node_count, resource_labels["asmv"], private_cluster_config]|' ./.terraform/modules/<YOUR_MODULE_NAME>/modules/beta-private-cluster-update-variant/cluster.tf
Shaked commented 6 months ago

Hi @braveokafor, @agaudreault, @TitanRob16

I have experienced the same issue. Is it enough to use the lifecycle/ignore_changes though? When I describe the cluster I see that it has a different privateClusterConfig.

First private cluster:

privateClusterConfig:
  enablePrivateNodes: true
  masterGlobalAccessConfig:
    enabled: true
  masterIpv4CidrBlock: 172.16.0.0/28
  peeringName: gke-<REDACTED>-dfcf-2ee3-peer
  privateEndpoint: 172.16.0.2
  publicEndpoint: <REDACTED>

Second cluster:

privateClusterConfig:
  enablePrivateNodes: true
  masterGlobalAccessConfig:
    enabled: true
  privateEndpoint: 172.16.0.34
  privateEndpointSubnetwork: projects/<REDACTED>/regions/europe-west1/subnetworks/gke-<REDACTED>-pe-subnet
  publicEndpoint: <REDACTED>

You can see that the second one doesn't have a peeringName nor a masterIpv4CidrBlock.

braveokafor commented 6 months ago

Hi @Shaked,

I haven't experienced any issues thus far.

The masterIpv4CidrBlock from version 29.0 of the module is now the ipCidrRange in the auto-created privateEndpointSubnetwork in version 30.0.

# Cluster 1
$ gcloud container clusters describe <REDACTED> --location europe-west2 | yq '.privateClusterConfig'
enablePrivateEndpoint: true
enablePrivateNodes: true
masterGlobalAccessConfig:
  enabled: true
masterIpv4CidrBlock: 10.1.0.0/28
peeringName: gke-<REDACTED>-peer
privateEndpoint: 10.1.0.2
publicEndpoint: <REDACTED>
# Cluster 2
$ gcloud container clusters describe <REDACTED> --location europe-west2 | yq -r '.privateClusterConfig'
enablePrivateNodes: true
masterGlobalAccessConfig:
  enabled: true
privateEndpoint: 10.1.0.2
privateEndpointSubnetwork: projects/<REDACTED>/regions/europe-west2/subnetworks/gke-<REDACTED>-pe-subnet
publicEndpoint: <REDACTED>
# Subnet
$ gcloud compute networks subnets describe gke-<REDACTED>-pe-subnet --region europe-west2 | yq '.ipCidrRange'
10.1.0.0/28
github-actions[bot] commented 4 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

MaesterZ commented 1 month ago

Might be related to https://cloud.google.com/kubernetes-engine/docs/concepts/network-overview#control-plane

Control plane

In Kubernetes, the control plane manages the control plane processes, including the Kubernetes API server. How you access the control plane depends on the version of your GKE Autopilot or Standard cluster.

Clusters with Private Service Connect

Private or public clusters that meet any of the following conditions, use Private Service Connect to privately connect nodes and the control plane:

New public clusters in version 1.23 on or after March 15, 2022. New private clusters in version 1.29 after January 28, 2024. Existing public clusters that don't meet the preceding conditions are being migrated to Private Service Connect. Therefore, these clusters might already use Private Service Connect. To check if your cluster uses Private Service Connect, run the gcloud container clusters describe command. If your public cluster uses Private Service Connect, privateClusterConfig resource has the following values:

The peeringName field is empty or doesn't exist. The privateEndpoint field has a value assigned.

However, existing private clusters that don't meet the preceding conditions are not migrated yet.

You can create clusters that use Private Service Connect and change the cluster isolation.

Use authorized networks to restrict the access to your cluster's control plane by defining the origins that can reach the control plane.

Private Service Connect resources that are used for GKE clusters are hidden.

⚠️ Warning: Public clusters with Private Service Connect created before January 30, 2022 use a Private Service Connect endpoint and forwarding rule. Both resources are named gke-[cluster-name]-[cluster-hash:8]-[uuid:8]-pe and permit the control plane and nodes to privately connect. GKE creates these resources automatically with no cost. If you remove these resources, cluster network issues including downtime will occur.