hashicorp / terraform-provider-kubernetes

Terraform Kubernetes provider
https://www.terraform.io/docs/providers/kubernetes/
Mozilla Public License 2.0
1.56k stars 963 forks source link

Plan with k8s/helm provider doesn't wait for upstream K8s cluster. #2512

Closed chrisbecke closed 3 weeks ago

chrisbecke commented 3 weeks ago

Existing resources that use the kubernetes provider (and by implication, the helm provider) do not use upstream cluster details if the upstream cluster is referenced indirectly.

Which is to say:

  1. The initial deployment succeeds always
  2. Subsequent "apply"s work for as long as "endpoint", "cluster_ca_certificate" or "token" remain known.
  3. Using "data" resources to look up cluster details does NOT work even with depends_on directives.
  4. However, using the upstream resources directly does work.

Terraform Version, Provider Version and Kubernetes Version

Terraform v1.8.4
on darwin_amd64
+ provider registry.terraform.io/hashicorp/aws v5.52.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.30.0

Terraform Configuration Files

provider "aws" {
}

locals {
  name = replace(basename(path.cwd), "_", "-")
}

////////////////////////////////////////////////////////////////////////////////
// A minimal EKS cluster to reproduce the issue
////////////////////////////////////////////////////////////////////////////////

data "aws_iam_policy_document" "assume_role" {
  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["eks.amazonaws.com"]
    }
    actions = ["sts:AssumeRole"]
  }
}

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
  filter {
    name   = "default-for-az"
    values = [true]
  }
}

resource "aws_iam_role" "cluster" {
  name               = local.name
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

resource "aws_iam_role_policy_attachment" "amazon-eks-cluster-policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.cluster.name
}

resource "aws_iam_role_policy_attachment" "amazon-eks-vpc-resource-controller" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSVPCResourceController"
  role       = aws_iam_role.cluster.name
}

variable "authentication_mode" {
  default = null
}

resource "aws_eks_cluster" "cluster" {
  name     = local.name
  role_arn = aws_iam_role.cluster.arn
  vpc_config {
    subnet_ids = data.aws_subnets.default.ids
  }
  access_config {
    authentication_mode = var.authentication_mode
  }

  depends_on = [
    aws_iam_role_policy_attachment.amazon-eks-cluster-policy,
    aws_iam_role_policy_attachment.amazon-eks-vpc-resource-controller
  ]
}

///////////////////////////////////////////////////////////////////////////////
// A random kubernetes resource to trigger the provider
///////////////////////////////////////////////////////////////////////////////

data "aws_eks_cluster" "cluster" {
  name       = local.name
  depends_on = [aws_eks_cluster.cluster]
}

data "aws_eks_cluster_auth" "eks_auth" {
  name       = local.name
  depends_on = [aws_eks_cluster.cluster]
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks_auth.token
}

resource "kubernetes_namespace" "ns" {
  metadata {
    name = "test-ns"
  }
}

Steps to Reproduce

Assuming an AWS account:

  1. terraform init
  2. terraform apply
  3. Set authentication_mode = "API_AND_CONFIG_MAP" in terraform.tfvars to trigger a change to the cluster.
  4. terraform plan and observe the error
  5. remove "data" from the host= directive
  6. terraform plan and observe the error
  7. remove "data" from the cluster_ca_certificate= directive
  8. terraform plan and observe the error
  9. delete "token=" and use "exec" to generate the cluster authentication.
  10. terraform plan and observe success.

Expected Behavior

Using this provider configuration we use the upstream resource directly for the endpoint and certificate, and use exec to retrieve the token rather than relying on either of the data objects.

*provider "kubernetes" {
  host                   = aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(aws_eks_cluster.cluster.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", local.name]
  }
}

Actual Behavior

With the given code:

│ Error: Get "http://localhost/api/v1/namespaces/test-ns": dial tcp [::1]:80: connect: connection refused
│ 
│   with kubernetes_namespace.ns,
│   on main.tf line 109, in resource "kubernetes_namespace" "ns":
│  109: resource "kubernetes_namespace" "ns" {
│ 

with aws_eks_cluster.cluster.endpoint in place of data.aws_eks_cluster.cluster.endpoint, it now finds the correct endpoint:

provider "kubernetes" {
  host                   = aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks_auth.token
}

The following results

│ Error: Get "https://5FE1EB3BB63BA7C813D2DDA68E593F88.gr7.eu-west-1.eks.amazonaws.com/api/v1/namespaces/test-ns": tls: failed to verify certificate: x509: “kube-apiserver” certificate is not trusted
│ 
│   with kubernetes_namespace.ns,
│   on main.tf line 110, in resource "kubernetes_namespace" "ns":
│  110: resource "kubernetes_namespace" "ns" {

with aws_eks_cluster.cluster.certificate_authority in place of data.aws_eks_cluster.cluster.certificate_authority it uses the correct endpoint and cert:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(aws_eks_cluster.cluster.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks_auth.token
}

This error results

│ Error: namespaces "test-ns" is forbidden: User "system:anonymous" cannot get resource "namespaces" in API group "" in the namespace "test-ns"
│ 
│   with kubernetes_namespace.ns,
│   on main.tf line 111, in resource "kubernetes_namespace" "ns":
│  111: resource "kubernetes_namespace" "ns" {

With exec in place of data.aws_eks_cluster_auth.eks_auth.token, it can resolve the correct token.

References

The following tickets reference the "localhost" fallback but don't mention how to fix the certificate or token errors.

Community Note

appilon commented 3 weeks ago

Hello @chrisbecke,

This is a common problem of needing the output of one apply to configure another provider. Unfortunately at this time the prescribed advice is to break your workspace into separate steps and apply "progressively". It is something we are trying to address in the future but it's a complex problem.

chrisbecke commented 3 weeks ago

The weird thing is, it actually works, as long as you use the upstream objects directly to initialise the provider. If you use data objects that merely "depends_on" the upstream, then it fails. The fact it works at all, but fails with data objects, which should honour depends_on, seemingly indicates it's a bug not a feature.