terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.13k stars 1.16k forks source link

Error / issue applying kubelet config #2014

Open wyardley opened 1 month ago

wyardley commented 1 month ago

TL;DR

See also #2013

I'm seeing a permadrift which may or may not be related to having manually (outside of tf) enabled a kubelet config setting. I am somewhat confident that before this change, I did not have a permadiff or error applying this state.

Expected behavior

The config to apply

Observed behavior

  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/xxx/locations/us-central1/clusters/yyy/nodePools/primary"
        name                        = "primary"
        # (10 unchanged attributes hidden)

      ~ node_config {
            tags                        = [
                "gke-prod-cluster-01",
                "gke-prod-cluster-01-primary",
            ]
            # (17 unchanged attributes hidden)

          - kubelet_config {
              - cpu_cfs_quota  = false -> null
              - pod_pids_limit = 0 -> null
            }

            # (2 unchanged blocks hidden)
        }

This diff and then this

module.gke.google_container_node_pool.pools["primary"]: Modifying... [id=projects/xxx/locations/us-central1/clusters/yyy/nodePools/primary]
╷
│ Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.
│ Details:
│ [
│   {
│     "@type": "type.googleapis.com/google.rpc.RequestInfo",
│     "requestId": "0xaf5070f5462ddf7d"
│   }
│ ]
│ , badRequest
│ 
│   with module.gke.google_container_node_pool.pools["primary"],
│   on .terraform/modules/gke/modules/private-cluster/cluster.tf line 491, in resource "google_container_node_pool" "pools":
│  491: resource "google_container_node_pool" "pools" {
│ 
╵

See further debug output below

Terraform Configuration

module "gke" {
  source                = "terraform-google-modules/kubernetes-engine/google//modules/private-cluster"
  version               = "31.1.0"
  project_id            = var.project
  name                  = "foo-cluster-01"
  service_account_name  = "foo-cluster-01"
  grant_registry_access = true
  kubernetes_version    = "1.29.6-gke.1326000"
  release_channel       = "UNSPECIFIED"
  region                = "us-central1"
  zones = [
    data.google_compute_zones.available.names[1],
    data.google_compute_zones.available.names[2],
  ]
  network = data.terraform_remote_state.network.outputs.network_name

  subnetwork = data.terraform_remote_state.network.outputs.subnets_names[0]
  ip_range_pods     = data.terraform_remote_state.network.outputs.subnets_secondary_ranges[0][0].range_name
  ip_range_services = data.terraform_remote_state.network.outputs.subnets_secondary_ranges[0][1].range_name

  horizontal_pod_autoscaling = true
  enable_private_nodes       = true

  master_authorized_networks = local.all_allowlist_ranges
  dns_cache                  = true

  remove_default_node_pool = true
  node_pools = [
    # Note: this is intentionally different from the actual default,
    # "default-pool"
    {
      name                      = "primary"
      machine_type              = var.instance_type
      total_min_count           = var.node_pool_total_min_count
      total_max_count           = var.node_pool_total_max_count
      local_ssd_count           = 0
      spot                      = false
      local_ssd_ephemeral_count = 0
      disk_size_gb              = 100
      disk_type                 = "pd-balanced"
      image_type                = "COS_CONTAINERD"
      enable_gcfs               = false
      enable_gvnic              = false
      logging_variant           = "DEFAULT"
      auto_upgrade              = false
      preemptible               = false
      # Note: this was an attempt to resolve the permadiff; fails without it too
      pod_pids_limit            = 0
    },
  ]

  node_pools_oauth_scopes = {
    # Note: use cloud platform only, and manage monitoring etc. permissions via
    # IAM
    all = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

Terraform Version

OpenTofu v1.7.2
on darwin_arm64
+ provider registry.opentofu.org/hashicorp/external v2.3.3
+ provider registry.opentofu.org/hashicorp/google v5.37.0
+ provider registry.opentofu.org/hashicorp/kubernetes v2.31.0
+ provider registry.opentofu.org/hashicorp/null v3.2.2
+ provider registry.opentofu.org/hashicorp/random v3.6.2

### Additional information

2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: PUT /v1/projects/xxx/locations/us-central1/clusters/yyyy/nodePools/primary?alt=json&prettyPrint=false HTTP/1.1 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Host: container.googleapis.com 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: User-Agent: google-api-go-client/0.5 Terraform/1.7.2 (+https://www.terraform.io) Terraform-Plugin-SDK/2.33.0 terraform-provider-google/dev blueprints/terraform/terraform-google-kubernetes-engine:private-cluster/v31.1.0 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Content-Length: 25 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Content-Type: application/json 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: X-Goog-Api-Client: gl-go/1.21.11 gdcl/0.185.0 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: Accept-Encoding: gzip 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: "nodePoolId": "primary" 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.375-0700 [DEBUG] provider.terraform-provider-google: ----------------------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: 2024/07/26 11:49:18 [DEBUG] Google API Response Details: 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ---[ RESPONSE ]-------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: HTTP/2.0 400 Bad Request 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Cache-Control: private 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Content-Type: application/json; charset=UTF-8 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Date: Fri, 26 Jul 2024 18:49:18 GMT 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Server: ESF 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: Origin 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: X-Origin 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: Vary: Referer 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Content-Type-Options: nosniff 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Frame-Options: SAMEORIGIN 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: X-Xss-Protection: 0 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "error": { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "code": 400, 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "message": "At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "errors": [ 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "message": "At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning'] must be specified.", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "domain": "global", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "reason": "badRequest" 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ], 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "status": "INVALID_ARGUMENT", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "details": [ 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: { 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "@type": "type.googleapis.com/google.rpc.RequestInfo", 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: "requestId": "0xa4a8369efaf57da0" 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ] 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: } 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: ----------------------------------------------------- 2024-07-26T11:49:18.780-0700 [DEBUG] provider.terraform-provider-google: 2024/07/26 11:49:18 [DEBUG] Retry Transport: Stopping retries, last request failed with non-retryable error: googleapi: got HTTP response code 400 with body: HTTP/2.0 400 Bad Request

Nickmman commented 1 month ago

I also am encountering this issue. I have added (in node_pools) the following values, as per the documentation:

    cpu_cfs_quota      = false
    pod_pids_limit     = 0

However, on each plan, it is ignored and Terraform wants to revert back to the default values of null:

# module.gke.google_container_node_pool.pools["default"] will be updated in-place
~ resource "google_container_node_pool" "pools" {
      id                          = "projects/redacted/nodePools/default-5a32"
      name                        = "default-5a32"
      # (11 unchanged attributes hidden)

    ~ node_config {
          tags                        = [
              "redacted",
              "redacted-default",
              "default",
          ]
          # (20 unchanged attributes hidden)

        - kubelet_config {
            - cpu_cfs_quota        = false -> null
            - pod_pids_limit       = 0 -> null
              # (2 unchanged attributes hidden)
          }

          # (3 unchanged blocks hidden)
      }

      # (5 unchanged blocks hidden)
  }

Plan: 0 to add, 1 to change, 0 to destroy.

I'm using version 31.1.0 of the private-cluster-update-variant module.

hernan82arg commented 1 month ago

Same issue here when using the private cluster module.

module.gke.google_container_node_pool.pools["default-node-pool"] will be updated in-place
  ~ resource "google_container_node_pool" "pools" {
        id                          = "projects/xxx/locations/us-east4/clusters/yyy/nodePools/default-node-pool"
        name                        = "default-node-pool"
        # (11 unchanged attributes hidden)

      ~ node_config {
            tags                        = [
                "gke-staging",
                "gke-staging-default-node-pool",
                "default-node-pool",
            ]
            # (20 unchanged attributes hidden)

          - kubelet_config {
              - cpu_cfs_quota        = false -> null
              - pod_pids_limit       = 0 -> null
                # (2 unchanged attributes hidden)
            }

            # (2 unchanged blocks hidden)
        }

        # (5 unchanged blocks hidden)
    }

I've checked the code and node_config doesn't support kubelet_config as a dynamic block. I'm using version 31.1.0 of the private-cluster module.

Edit: master works, I just replaced source by:

source = "git::https://github.com/terraform-google-modules/terraform-google-kubernetes-engine//modules/private-cluster?ref=master"
trenslow commented 1 month ago

also happening for me:

Terraform v1.9.4
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v5.27.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.31.0
+ provider registry.terraform.io/hashicorp/random v3.6.2
+ provider registry.terraform.io/hashicorp/tfe v0.54.0
+ provider registry.terraform.io/hashicorp/time v0.12.0
rekiemfaxaf commented 3 weeks ago

Same here, after updating the google cloud provider from 5.30 to 5.42, stating to see this error with the module version 31.0, updated the module version to 32, but still failing, after adding this mentioned here, solved the issue

I also am encountering this issue. I have added (in node_pools) the following values, as per the documentation:

    cpu_cfs_quota      = false
    pod_pids_limit     = 0

However, on each plan, it is ignored and Terraform wants to revert back to the default values of null:

# module.gke.google_container_node_pool.pools["default"] will be updated in-place
~ resource "google_container_node_pool" "pools" {
      id                          = "projects/redacted/nodePools/default-5a32"
      name                        = "default-5a32"
      # (11 unchanged attributes hidden)

    ~ node_config {
          tags                        = [
              "redacted",
              "redacted-default",
              "default",
          ]
          # (20 unchanged attributes hidden)

        - kubelet_config {
            - cpu_cfs_quota        = false -> null
            - pod_pids_limit       = 0 -> null
              # (2 unchanged attributes hidden)
          }

          # (3 unchanged blocks hidden)
      }

      # (5 unchanged blocks hidden)
  }

Plan: 0 to add, 1 to change, 0 to destroy.

I'm using version 31.1.0 of the private-cluster-update-variant module.

wyardley commented 3 weeks ago

FWIW, for me, with v 32.x, the permadiff eventually shifted to a diff of cpu_manager_policy, which was easier to solve by setting it to the valid, but not documented, value of "" -- comment:

https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/2013#issuecomment-2305452939

derhally commented 2 days ago

Also ran into this with v33.02 of private-cluster-update-variant module and TPG v5.44.0

image

It fails to update the cluster

Error: googleapi: Error 400: At least one of ['node_version', 'image_type', 'updated_node_pool', 'locations', 'workload_metadata_config', 'upgrade_settings', 'kubelet_config', 'linux_node_config', 'tags', 'taints', 'labels', 'node_network_config', 'gcfs_config', 'gvnic', 'confidential_nodes', 'logging_config', 'fast_socket', 'resource_labels', 'accelerators', 'windows_node_config', 'machine_type', 'disk_type', 'disk_size_gb', 'storage_pools', 'containerd_config', 'resource_manager_tags', 'performance_monitoring_unit', 'queued_provisioning', 'max_run_duration'] must be specified. 

Details: [ { "@type": "type.googleapis.com/google.rpc.RequestInfo", "requestId": "0x32be3a3a868d29d7" } ] , badRequest