siderolabs / terraform-provider-talos

Mozilla Public License 2.0
123 stars 17 forks source link

Upgrading provider from 0.4.0 to 0.5.0 fails to apply #168

Closed NigelVanHattum closed 1 month ago

NigelVanHattum commented 3 months ago

I have deployed my Talos os cluster via the 0.4.0 provider, but upgrading the provider to 0.5.0 causes the following error:

I am running 1 control node and 2 workers at the moment. Hosted on Proxmox vm's.

│ Error: Error applying configuration
│ 
│   with talos_machine_configuration_apply.controlplane["minisforum-master"],
│   on talos.tf line 29, in resource "talos_machine_configuration_apply" "controlplane":
│   29: resource "talos_machine_configuration_apply" "controlplane" {
│ 
│ rpc error: code = InvalidArgument desc = unknown keys found during decoding:
│ machine:
│     features:
│         hostDNS:
│             enabled: true

My setup

Tf code

locals {
  master_ips       = [for master in var.master_vms : master.ip_address]
  cluster_name     = "talos-homelab"
  cluster_endpoint = "https://${local.master_ips[0]}:6443"
}

resource "talos_machine_secrets" "this" {}

data "talos_machine_configuration" "controlplane" {
  cluster_name     = local.cluster_name
  cluster_endpoint = local.cluster_endpoint
  machine_type     = "controlplane"
  machine_secrets  = talos_machine_secrets.this.machine_secrets
}

data "talos_machine_configuration" "worker" {
  cluster_name     = local.cluster_name
  cluster_endpoint = local.cluster_endpoint
  machine_type     = "worker"
  machine_secrets  = talos_machine_secrets.this.machine_secrets
}

data "talos_client_configuration" "this" {
  cluster_name         = local.cluster_name
  client_configuration = talos_machine_secrets.this.client_configuration
  nodes                = local.master_ips
}

resource "talos_machine_configuration_apply" "controlplane" {
  depends_on = [
    proxmox_vm_qemu.talos-controlpane
  ]

  for_each = var.master_vms

  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.controlplane.machine_configuration
  node                        = each.value.ip_address
  config_patches = [
    file("${path.module}/talos_config_patches/cert-rotate-patch.yaml"),
    templatefile("${path.module}/talos_config_patches/hostname-config.yaml", {
      hostname = each.key
    })
  ]
}

resource "talos_machine_configuration_apply" "worker" {
  depends_on = [
    proxmox_vm_qemu.talos-worker
  ]

  for_each = var.worker_vms

  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.worker.machine_configuration
  node                        = each.value.ip_address
  config_patches = [
    file("${path.module}/talos_config_patches/cert-rotate-patch.yaml"),
    templatefile("${path.module}/talos_config_patches/hostname-config.yaml", {
      hostname = each.key
    })
  ]
}

resource "talos_machine_bootstrap" "this" {
  depends_on = [
    talos_machine_configuration_apply.controlplane,
    talos_machine_configuration_apply.worker
  ]

  client_configuration = talos_machine_secrets.this.client_configuration
  node                 = local.master_ips[0]
}

data "talos_cluster_kubeconfig" "this" {
  depends_on = [
    talos_machine_bootstrap.this
  ]

  client_configuration = talos_machine_secrets.this.client_configuration
  node                 = local.master_ips[0]
}

# Add cert signing and enable the metric server
data "http" "cert_approver_download" {
  url = "https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml"
}

data "http" "metric_server_download" {
  url = "https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml"
}

data "kubectl_file_documents" "cert_approver" {
    content = data.http.cert_approver_download.response_body
}

data "kubectl_file_documents" "metric_server" {
    content = data.http.metric_server_download.response_body
}

resource "time_sleep" "wait_for_talos_boot" {
  depends_on = [talos_machine_bootstrap.this]

  ### Talos needs some time to boot after bootstrap.
  create_duration = "2m"
}

resource "kubectl_manifest" "cert_approver" {
  for_each = data.kubectl_file_documents.cert_approver.manifests
  yaml_body = each.value

  depends_on = [
    time_sleep.wait_for_talos_boot
  ]
}

resource "kubectl_manifest" "metric_server" {
  for_each = data.kubectl_file_documents.metric_server.manifests
  yaml_body = each.value

  depends_on = [
    time_sleep.wait_for_talos_boot
  ]
}

Patches:

cert-rotate-patch.yaml

# server certificate rotation is needed for the metric server to work
machine:
  kubelet:
    extraArgs:
      rotate-server-certificates: true

hostname-config.yaml

machine:
  network:
    hostname: ${hostname}
studioph commented 2 months ago

I'm using 0.5.0 fresh and facing this same issue (same error message, same key in the config), but it only appears to be with controlplane nodes...

studioph commented 2 months ago

My guess is it's likely this note from the docs:

It is recommended to set the optional talos_version attribute. Otherwise when using a new version of the provider with a new major version of the Talos SDK, new machineconfig features will be enabled by default which could cause unexpected behavior.

NigelVanHattum commented 1 month ago

This indeed seems to work as expected. Only for my setup though Running Talos 1.6.6 and provider v0.5.0.

Can I found a compatibility matrix somewhere? Trying to upgrade Talos to 1.7 gives me the same error again.

smira commented 1 month ago

See https://github.com/siderolabs/terraform-provider-talos/issues/168#issuecomment-2285268168 - set the talos_version properly, otherwise you'd get machine config incompatible with older versions of Talos.