oracle-terraform-modules / terraform-oci-oke

The Terraform OKE Module Installer for Oracle Cloud Infrastructure provides a Terraform module that provisions the necessary resources for Oracle Container Engine.
https://oracle-terraform-modules.github.io/terraform-oci-oke/
Universal Permissive License v1.0
153 stars 206 forks source link

Race Condition with installed extras (metrics-server, custom role bindings) #946

Open tcrowder-koerber opened 3 weeks ago

tcrowder-koerber commented 3 weeks ago

Community Note

Terraform Version and Provider Version

Terraform 1.7.5 OCI provider 6.9.0

Affected Resource(s)

metrics server, autoscaler and custom role bindings

Terraform Configuration Files

OKE Module 5.1.8

Attributes: cluster_type = "enhanced" create_service_account = true service_accounts = merge({ kubeconfig-sa = { sa_namespace = "kube-system" sa_name = "kubeconfig-sa" sa_cluster_role = "cluster-admin" sa_cluster_role_binding = "sa-crb" } }, var.cluster_service_accounts) create_iam_resources = true create_operator = true operator_install_helm = true cluster_autoscaler_install = true metrics_server_install = true

... nodepool-1 = { allow_autoscaler = true }

Debug Output

No output related. Generally the autoscaler and metrics-server will install OK but the CRB failed but reports OK. Before adding the CRB, the metrics-server would have issues. The clusters create fine, kubectl is fine, but metrics-server is missing or the CRB is missing. I run post installation validation checks on these resources now due to how often it fails.

The resolution for me is to set the values to false for the affected resources and apply. Turn them back to true and apply.

Panic Output

Expected Behavior

It should add the resources or error it failed.

Actual Behavior

The apply succeeds but some resources such as metrics-server or custom role bindings are not present.

Steps to Reproduce

  1. terraform apply

Important Factoids

The resolution for me is to set the values to false for the affected resources and apply. Turn them back to true and apply.

References

robo-cap commented 2 weeks ago

Regarding the creation of RB/CRB, it may be that some of the commands are silently failing. Please check the note in the terraform documentation for remote-exec.

Please append set -o errexit to the list of inline commands executed to create the CRB, here, and attach any log that may explain why the resource creation fails.