hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.32k stars 1.73k forks source link

Google plans broken after we cycled the GKE cluster certificates using gcloud #13950

Open varunthakur2480 opened 1 year ago

varunthakur2480 commented 1 year ago

Community Note

Terraform Version

Terraform version: v0.14.11 Kubernetes Provider version: 2.15.0 Kubernetes version: 1.23 google-beta/4.15.0 google/4.15.0

Affected Resource(s)

Terraform Configuration Files

locals {
  env_region           = "e2"
  iac_flux_secret_name = "iac-flux-secret"
  iac_gitlab_repo      = data.terraform_remote_state.project.outputs.gke_iac_repo_clone_url
  iac_repo_name        = "gke-iac"
  region               = "europe-west2"
}

terraform {
  backend "remote" {
    hostname     = "xxxx-enterprise.platform..net"
    organization = "xxxx"

    workspaces {
      name = "xxx"
    }
  }
}

# Secrets and credentials come from Vault
provider "vault" {

  auth_login {
    path = "auth/approle/login"

    parameters = {
      role_id   = var.vault_approle
      secret_id = var.vault_approle_secret
    }
  }
}

data "vault_generic_secret" "gke_cluster_viewer" {
  path = "xxxx"
}

provider "google" {
  access_token = data.xxxxx.gke_cluster_viewer.data["token"]
  region       = "europe-west2"
}

data "google_client_config" "default" {}

data "terraform_remote_state" "project" {
  backend = "remote"

  config = {
    hostname     = "terraform-xxxxx.platform.xxxx.net"
    organization = "nwm-non-prod-v2"
    workspaces = {
      name = "dev-asdfa-sdd"
    }
  }
}

##
## GKE IaC
##
resource "kubernetes_manifest" "appplication_source" {

  manifest = {
    apiVersion = "source.toolkit.fluxcd.io/v1beta2"
    kind       = "GitRepository"
    metadata = {
      name      = var.iac_repo_name
      namespace = var.namespace

      finalizers = ["finalizers.fluxcd.io"]
    }
    spec = {
      gitImplementation = "go-git"
      interval          = "1m0s"
      url               = var.iac_gitlab_repo
      ref = {
        branch = var.iac_git_branch
        tag    = var.iac_git_tag
      }
      secretRef = {
        name = var.iac_flux_secret_name
      }
      timeout = "20s"
    }
  }
  field_manager {
    force_conflicts = true
  }

}

resource "kubernetes_manifest" "application_kustomize" {
  manifest = {
    apiVersion = "kustomize.toolkit.fluxcd.io/v1beta2"
    kind       = "Kustomization"
    metadata = {
      name      = var.iac_repo_name
      namespace = var.namespace
    }
    spec = {
      force              = var.flux_force
      interval           = "1m0s"
      path               = var.git_path
      suspend            = var.flux_suspend
      prune              = true
      serviceAccountName = "flux"
      sourceRef = {
        kind      = "GitRepository"
        name      = var.iac_repo_name
        namespace = var.namespace
      }
      targetNamespace = var.namespace
      validation      = "server"
    }
  }
  field_manager {
    force_conflicts = true
  }
}

Debug Output

:31:15] [2023-02-27 08:31:15] Error: Invalid configuration for API client [2023-02-27 08:31:15] [2023-02-27 08:31:15] on ../../../modules/flux-setup/flux.tf line 37, in resource "kubernetes_manifest" "application_kustomize": [2023-02-27 08:31:15] 37: resource "kubernetes_manifest" "application_kustomize" { [2023-02-27 08:31:15] [2023-02-27 08:31:15] Get "https://10.124.239.189/apis": Service Unavailable [2023-02-27 08:31:15]

Panic Output

data "google_client_config" "default" {} seems to generate an oauth token on the fly , but after the cluster cert rotation oauth token expired and the data resource was not able to regenerate the token . I even tried deleting the resource from tfstate to force it to refresh, but that did not help

Expected Behavior

Plan should have succeeded

Actual Behavior

[2023-02-27 08:31:15] Get "https://10.124.239.189/apis": Service Unavailable

Steps to Reproduce

Cycle the cluster certificates using gcloud command Run terraform plan

References

edwardmedia commented 1 year ago

@varunthakur2480 help me to understand how we can help here. I don't see any specific google provider resource. Can you provide details and be specific?

varunthakur2480 commented 1 year ago

sorry I forgot to mention some more details

data "google_client_config" "default" {} resource is responsible for fetching the cluster data along with the temp oauth token which is then used to run terraform operations. After the cluster certificates are rotated using gcloud command , terraform plan shows that the data rendered by google_client_config data resource pulls correct API endpoint of the cluster but the O-auth token does not get refreshed and hence everything starts to fail

We had to rebuild the cluster to fix it

edwardmedia commented 1 year ago

@varunthakur2480 I noticed what you said but the O-auth token does not get refreshed. Can you share the debug log? I want to see how google_client_config is called?

https://github.com/hashicorp/terraform-provider-google/blob/main/google/data_source_google_client_config.go#L55

varunthakur2480 commented 1 year ago

I dont have debug log now as the cluster was rebuilt . Will try to recreate the issue in dev and share it next week

slevenick commented 1 year ago

Are you running plan with or without refresh? If refresh is disabled I would expect this to happen

rileykarson commented 1 year ago

FWIW terraform plan can unexpectedly preserve values when we'd expect them to change- and terraform refresh will change them. I've never figured out the exact mechanics.

varunthakur2480 commented 1 year ago

Teraform refresh is not supported for Remote backends and it is also worth mentioning that it is deprecated in latest versions https://developer.hashicorp.com/terraform/cli/commands/refresh



Error: error starting operation:

The "remote" backend does not support the "OperationTypeRefresh" operation.```
edwardmedia commented 1 year ago

@varunthakur2480 waiting for your debug log and steps that I can use to repro the issue

slevenick commented 1 year ago

I'm fairly certain that the google_client_config data source pulls the token from your local authentication source of the Terraform provider. What authentication method are you using for Terraform, and are you updating it after you cycle the GKE cluster certificate?

varunthakur2480 commented 1 year ago

I have debug logs available now, is there a way to share them privately as they might contain some classified info

edwardmedia commented 1 year ago

@varunthakur2480 https://gist.github.com/ is the place you can use. It is public, and you need to redact any secrets you don't want to share.

varunthakur2480 commented 1 year ago

run-5RhoAr4n8HZMZKSX-plan-log.txt

varunthakur2480 commented 1 year ago

did you get a chance to look at the logs ?

varunthakur2480 commented 1 year ago

I have a broken cluster just to provide more information if required, so I am wondering if you would get time this week to look at the debug logs?

edwardmedia commented 1 year ago

@varunthakur2480 your terraform version (v0.14.10) is pretty old. Is it possible to try with the latest version?

I do see below line in your log. I notice module.default.module.default.data.google_client_config.current. It appears module - module -. Not sure what impact it has from that levels. Are you able to try put the data.google_client_config.current at the root level to see if you can repro?

2023/03/20 07:31:46 [WARN] Provider "registry.terraform.io/hashicorp/google" produced an unexpected new value for module.default.module.default.data.google_client_config.current.
      - .access_token: inconsistent values for sensitive attribute
varunthakur2480 commented 1 year ago

I have attached another debug log for a smaller component let me know if that helps run-ExyaP5VkmrSMykwQ-plan-log.txt

Upgrade to latest tf is not possible due to sentinel limitations

edwardmedia commented 1 year ago

@trodge besides the terraform version, what else can you think of that could cause the issue?

varunthakur2480 commented 1 year ago

the issue is with google provider, I'm not sure how terraform version upgrade can fix it. We are on relatively new version of provider already

SarahFrench commented 1 year ago

After the cluster certificates are rotated using gcloud command , terraform plan shows that the data rendered by google_client_config data resource pulls correct API endpoint of the cluster but the O-auth token does not get refreshed and hence everything starts to fail

I'm also a bit unclear with what the issue is, but I wanted to explore the idea of the access token not being refreshed.

From your config I can see that you configure the provider with access_token = data.xxxxx.gke_cluster_viewer.data["token"]. This means that when the provider is configured in the early stages of a plan/apply step this code is hit (see it's in a block handling a scenario when the user configures the provider with an access token). I'll return to this info in the next paragraph.

When data "google_client_config" "default" {} uses the provider's client to get an access token it uses the token source set within the provider. That means it uses the token source created in the code I linked above, which looks like:

        return googleoauth.Credentials{
            TokenSource: StaticTokenSource{oauth2.StaticTokenSource(token)},
        }, nil

The token source made there, oauth2.StaticTokenSource, returns the same token without refreshing it. The oauth2 documentation for this method (https://pkg.go.dev/golang.org/x/oauth2#StaticTokenSource) says "Because the provided token t is never refreshed, StaticTokenSource is only useful for tokens that never expire.".


So it sounds like you expected the token to be refreshed, but this method doesn't allow the token to be refreshed and instead returns the same token assuming that it doesn't expire. Could you please confirm whether the access_token your configuration uses from Vault data.xxxxx.gke_cluster_viewer.data["token"] token expires or not?

Additionally: I quickly checked what token source is used when the provider is configured with credentials instead of access_token, and I see it's oauth2.reuseTokenSource, which looks like it means that tokens returned by google_client_config when the provider is configured with credentials will refresh?

If you can't change the method of how you configure the google provider in your Terraform project then (if I'm not completely wrong) I think you may need to open a feature request? I'll ask internally

varunthakur2480 commented 1 year ago

thanks for detailed response , for clarifications I am adding some more context , but it seems that the assumption of token remaining static in code is in correct.

So the problem I am facing is that kubernetes/GKE good practices expect that we regenerate the GKE cluster certificates manually every few months/years Basically when you roll the certificates as mentioned here https://cloud.google.com/kubernetes-engine/docs/how-to/credential-rotation you need to reauthenticate to the cluster and old credentials will not work. This works fine for normal authentication but however it fails for us as you pointed out that the tokens are expected to remain static. Guess this needs to change In the mean time I will explore if we can use oauth2.reuseTokenSource

varunthakur2480 commented 1 year ago

I checked vault token bit and unfortunately it needs to be access_token and can't be of type credentials in order to prevent leakage of keys Also vault token is set to expire every 2 hours

SarahFrench commented 1 year ago

Thanks for checking!

So to summarise so far, you're configuring the google provider with an access token coming from Vault. That same access token is then being retrieved by data "google_client_config" "default" {}. Due to how that data source works, the token is not refreshed at any point.

A possible concern is that unrefreshed tokens expire, but I don't think your issue is the access token expiring. The plan you've shared here starts at 2023-04-03T04:28:24 and the request that fails is at 2023-04-03T04:28:30. Tokens last 1 hour by default and your error occurs very quickly.

Is the output from data "google_client_config" "default" {} used to configure the kubernetes provider? Could you please post some details about how the google_client_config data source is used.

In the plan you shared here I see the failing request is a GET /api/v1/namespaces/prv1-e2-prv-exampletf/configmaps/my-config done by the kubernetes provider. The auth is a bearer token that looks like it's a kubernetes service account token (versus GCP service account) after I looked at it's contents:

{
  "iss": "kubernetes/serviceaccount",
  "kubernetes.io/serviceaccount/namespace": "value changed",
  "kubernetes.io/serviceaccount/secret.name": "value changed",
  "kubernetes.io/serviceaccount/service-account.name": "value changed",
  "kubernetes.io/serviceaccount/service-account.uid": "value changed",
  "sub": "value changed"
}

It's starting to feel like something is incorrect with tokens made in your cluster perhaps? Though I admit it's been a while since I've done k8s work.

I was reading this blog post here for some ideas... could you can inspect the Secret that the kubernetes service account token comes from (e.g. terraform-token-2kdzg) and see if the data.ca.crt value is correct for the new certificates?

If that doesn't help in any way I can ask someone from the kubernetes provider team if they have any ideas.

SarahFrench commented 1 year ago

Also, is the access token supplied from Vault set up with the scope https://www.googleapis.com/auth/userinfo.email?

varunthakur2480 commented 1 year ago

there is also a similar issue here though I am not sure if this any any relation to ours https://github.com/hashicorp/terraform/issues/27741

varunthakur2480 commented 1 year ago

Based on your comment I forced recreation of vault token still getting the same issue