Unable to create k8s service account for Workload Identity Federation on a GKE private cluster

diguida commented 3 months ago

Terraform version, Kubernetes provider version and Kubernetes version

Terraform version: 1.8.5
Kubernetes Provider version: 2.31.0
Google Cloud Provider version: 5.34.0
Kubernetes version: 1.28.9-gke.1209000

Terraform configuration

resource "google_container_cluster" "my-cluster" {
  project            = var.GCP_PROJECT_ID
  name               = "my-cluster"
  location           = "europe-west8-a"
  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count = 1
  network            = google_compute_network.network.name
  subnetwork         = google_compute_subnetwork.network_subnet.name
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = true
    master_ipv4_cidr_block  = "172.16.0.32/28"
  }
  ip_allocation_policy {
  }
  master_authorized_networks_config {
  }
  workload_identity_config {
    workload_pool = "${var.GCP_PROJECT_ID}.svc.id.goog"
  }
  logging_config {
    enable_components = [
      "SYSTEM_COMPONENTS",
      "APISERVER",
      "WORKLOADS"
    ]
  }
}

resource "google_container_node_pool" "my-nodes" {
  name       = "my-node-pool"
  location   = "europe-west8-a"
  cluster    = google_container_cluster.my-cluster.name
  node_count = 1

  node_config {
    preemptible  = true
    machine_type = "e2-standard-4"

    service_account = google_service_account.gke-service-account.email
    oauth_scopes = [
      "cloud-platform"
    ]

    shielded_instance_config {
      enable_secure_boot = true
    }

  }

}

module "my-workload-identity" {
  source     = "terraform-google-modules/kubernetes-engine/google//modules/workload-identity"
  name       = "my-identity"
  namespace  = "default"
  project_id = var.GCP_PROJECT_ID
  roles      = [
    "roles/logging.logWriter",
    "roles/cloudsql.client",
    "roles/artifactregistry.reader"
  ]
}

data "google_client_config" "current" {}

provider "kubernetes" {
  host                   = "https://${google_container_cluster.my-cluster.endpoint}"
  token                  = data.google_client_config.current.access_token
  cluster_ca_certificate = base64decode(google_container_cluster.my-cluster.master_auth.0.cluster_ca_certificate)
}

Question

Apologies if it is a double posting. I am trying to configure a worload identity federation on a private GKE cluster using the code snippet above, which follows the documentation and the guidelines in https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/using_gke_with_terraform

The resources are deployed by a pipeline in a GitLab k8s runner hosted in GCP, but on a different project.

image:
  name: hashicorp/terraform:1.8.5
  entrypoint:
    - "/usr/bin/env"
    - "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

before_script:
  - pwd
  - mkdir .gcp
  - echo $GCP_SERVICE_ACCOUNT > .gcp/credentials.json
  - export GOOGLE_APPLICATION_CREDENTIALS=".gcp/credentials.json"
  - rm -rf .terraform
  - terraform --version
  - terraform init

# ...

apply:
  stage: apply
  script:
    - export TF_LOG=DEBUG
    - terraform apply -input=false -auto-approve "planfile"
  dependencies:
    - plan
  only:
    - main
  needs:  
    - plan
  when: manual

after_script:
- rm .gcp/credentials.json

The GKE cluster was created smoothly. Unfortunately, if I add the workload identity definition, the apply fails with this error:

module.my-workload-identity.kubernetes_service_account.main[0]: Still creating... [10s elapsed]
module.my-workload-identity.kubernetes_service_account.main[0]: Still creating... [20s elapsed]
module.my-workload-identity.kubernetes_service_account.main[0]: Still creating... [30s elapsed]
2024-06-22T12:27:37.333Z [ERROR] provider.terraform-provider-kubernetes_v2.31.0_x5: Response contains error diagnostic: @caller=github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov5/internal/diag/diagnostics.go:58 tf_proto_version=5.6 tf_provider_addr=registry.terraform.io/hashicorp/kubernetes tf_rpc=ApplyResourceChange @module=sdk.proto diagnostic_detail="" diagnostic_severity=ERROR diagnostic_summary="Post \"https://172.16.0.34/api/v1/namespaces/default/serviceaccounts\": context deadline exceeded" tf_req_id=9312e024-3cff-3a97-8799-9a54659b9c57 tf_resource_type=kubernetes_service_account timestamp=2024-06-22T12:27:37.333Z
2024-06-22T12:27:37.335Z [DEBUG] states/remote: state read serial is: 94; serial is: 94
2024-06-22T12:27:37.335Z [DEBUG] states/remote: state read lineage is: 1ee3af85-9da7-164a-413f-1b485a9fbda7; lineage is: 1ee3af85-9da7-164a-413f-1b485a9fbda7
2024-06-22T12:27:37.583Z [ERROR] vertex "module.my-workload-identity.kubernetes_service_account.main[0]" error: Post "https://172.16.0.34/api/v1/namespaces/default/serviceaccounts": context deadline exceeded
2024-06-22T12:27:37.584Z [DEBUG] states/remote: state read serial is: 95; serial is: 95
2024-06-22T12:27:37.584Z [DEBUG] states/remote: state read lineage is: 1ee3af85-9da7-164a-413f-1b485a9fbda7; lineage is: 1ee3af85-9da7-164a-413f-1b485a9fbda7
╷
│ Error: Post "https://172.16.0.34/api/v1/namespaces/default/serviceaccounts": context deadline exceeded
│ 
│   with module.my-workload-identity.kubernetes_service_account.main[0],
│   on .terraform/modules/my-workload-identity/modules/workload-identity/main.tf line 51, in resource "kubernetes_service_account" "main":
│   51: resource "kubernetes_service_account" "main" {
│ 
╵
2024-06-22T12:27:37.787Z [DEBUG] provider.terraform-provider-google_v5.34.0_x5: 2024/06/22 12:27:37 [DEBUG] [transport] [server-transport 0xc0003fdc80] Closing: Server.Stop called 
2024-06-22T12:27:37.788Z [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-06-22T12:27:37.794Z [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"

The cluster endpoint looks correct.

In the k8s API server logs, I cannot see any request coming from the terraform process.

Can you please help me understanding the issue, or redirect me to some other info channel? I am stuck on it since a few days.

Thanks in advance.

sheneska commented 3 months ago

Hi @diguida, thanks for opening this issue. Could you try to apply this separately please?

diguida commented 3 months ago

Hi @sheneska, thanks for looking into this. It is not clear to me what you are asking me with

try to apply this separately.

Should I run the apply command in a Compute Engine instance or on my laptop instead of the runner?

Thanks.

bwburch1023 commented 1 month ago

@diguida Just ran across the exact same issue i was able to get it to work by adding 0.0.0.0/0 to master authorized networks as a test, wouldn't recommend doing this. You can check the k8s api server log and see what IP is being used in the request. I'm trying to get the cidr block from Hashi since we are using Terraform cloud

hashicorp / terraform-provider-kubernetes