Upgrading workload cluster failures

ysineil commented 7 months ago

Describe the bug

Upgrading workload clusters from v1.23.8+vmware.2-tkg.2-zshippable to v1.24.9+vmware.1-tkg.4 by adjusting spec.version does not function properly.

Terraform will perform the following actions:

 # tanzu-mission-control_tanzu_kubernetes_cluster.tkgs_cluster will be updated in-place
  ~ resource "tanzu-mission-control_tanzu_kubernetes_cluster" "tkgs_cluster" {
        id                      = "xxxx"
        name                    = "xxxx"
        # (2 unchanged attributes hidden)

      ~ spec {
            # (3 unchanged attributes hidden)

          ~ topology {
              ~ version           = "v1.23.8+vmware.2-tkg.2-zshippable" -> "v1.24.9+vmware.1-tkg.4"
                # (2 unchanged attributes hidden)

                # (5 unchanged blocks hidden)
            }
        }

        # (2 unchanged blocks hidden)
    }

    │ Error: Couldn't read TKG cluster.
│ Management Cluster Name: xxx, Provisioner: xxxx, Cluster Name: xxxx: Timeout exceeded while waiting for the cluster to be ready. Cluster Status: UPGRADING, Cluster Health: HEALTHY: context deadline exceeded
│ 
│   with tanzu-mission-control_tanzu_kubernetes_cluster.tkgs_cluster,
│   on main.tf line 98, in resource "tanzu-mission-control_tanzu_kubernetes_cluster" "tkgs_cluster":
│   98: resource "tanzu-mission-control_tanzu_kubernetes_cluster" "tkgs_cluster" {
│ 
╵

Relavent main.tf

resource "tanzu-mission-control_tanzu_kubernetes_cluster" "tkgs_cluster" {
  management_cluster_name = var.management_cluster
  provisioner_name = var.provisioner
  name = var.name

  spec {
    cluster_group_name = "${var.location}"

    topology {
      version = replace("${var.k8s_version}", "&#43;", "+")
      cluster_class = "tanzukubernetescluster"
      cluster_variables = jsonencode(local.tkgs_cluster_variables)

      control_plane {
        replicas = 3

        os_image {
          name = "${var.os_image}"
          version = "3"
          arch = "amd64"
        }
      }

      nodepool {
        name = "default-nodepool-a"
        description = "tkgs workload nodepool"

        spec {
          worker_class = "node-pool"
          replicas = 3
          overrides = jsonencode(local.tkgs_cluster_variables)
          failure_domain = "${var.location}-zone-a"

          os_image {
            name = "${var.os_image}"
            version = "3"
            arch = "amd64"
          }
        }
      }

      nodepool {
        name = "default-nodepool-b"
        description = "tkgs workload nodepool"

        spec {
          worker_class = "node-pool"
          replicas = 3
          overrides = jsonencode(local.tkgs_cluster_variables)
          failure_domain = "${var.location}-zone-b"

          os_image {
            name = "${var.os_image}"
            version = "3"
            arch    = "amd64"
          }
        }
      }

      nodepool {
        name        = "default-nodepool-c"
        description = "tkgs workload nodepool"

        spec {
          worker_class   = "node-pool"
          replicas       = 3
          overrides      = jsonencode(local.tkgs_cluster_variables)
          failure_domain = "${var.location}-zone-c"

          os_image {
            name    = "${var.os_image}"
            version = "3"
            arch    = "amd64"
          }
        }
      }   

      network {
        pod_cidr_blocks = [
          "172.20.0.0/16",
        ]
        service_cidr_blocks = [
          "10.96.0.0/16",
        ]
        service_domain = "cluster.local"
      }
    }
  }

  timeout_policy {
    timeout             = 60
    wait_for_kubeconfig = true
    fail_on_timeout     = true
  }
}

vars:

kind: vmware/tanzu
scope: cluster
name: xxxx
location: xxxx
management_cluster: xxxx
provisioner: xxxx
k8s_version: "v1.24.9+vmware.1-tkg.4"
os_image: photon
cp_nodecount: 3
cp_size: medium
nodepool_nodecount: 2
nodepool_size: medium
nodepool_disksize: 200G

Reproduction steps

Deploy new 1.23 cluster
Upgrade to 1.24

Expected behavior

Upgrade to complete successfully.

Additional context

No response

ysineil commented 7 months ago

It generally seems to boil down to any changes don't get reflected correctly or complete. I deployed a new cluster via TF (using the same config above) and triggered the ugprade via TMC and it upgraded successfully.

ramya-bangera commented 7 months ago

@Axpz - Can you take up this issue?

Axpz commented 7 months ago

It generally seems to boil down to any changes don't get reflected correctly or complete. Cluster Status: UPGRADING, Cluster Health: HEALTHY

Normally, this is supported. From the log, it seems like this is a backend issue, not a terraform issue.

ysineil commented 7 months ago

I'm not sure it's a TMC/backend issue - if I create a cluster with TF and manually upgrade it via TMC (clicking the upgrade button) it works. Just when triggered via TF it goes in to an odd state.

Axpz commented 7 months ago

Hi @ysineil @ramya-bangera, finally I can reproduce this issue. After terraform apply on the version update from "v1.23.8+vmware.2-tkg.2-zshippable" -> "v1.24.9+vmware.1-tkg.4", I found below error message with kubectl describe cluster tfu-100-1 -n testns

Message: error computing the desired state of the Cluster topology: failed to apply patches: failed to generate patches for patch "nodeLabels": failed to generate JSON patches for item with uid "34a77e17-4945-4e10-9dd8-b1f4a93177bc": failed to calculate value for template: failed to render template: "run.tanzu.vmware.com/tkr={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},run.tanzu.vmware.com/kubernetesDistributionVersion={{ index (index .TKR_DATA .builtin.machineDeployment.version).labels \"run.tanzu.vmware.com/tkr\" }},{{- range .nodePoolLabels }}{{ .key }}={{ .value }},{{- end }}\n": template: tpl:1:28: executing "tpl" at <index (index .TKR_DATA .builtin.machineDeployment.version).labels "run.tanzu.vmware.com/tkr">: error calling index: index of untyped nil

I think that's the root cause, but I still need some time to investigate how to fix it, I mean whether the terraform provider to fix it or upstream(TMC even TKG).

ysineil commented 7 months ago

Thanks @Axpz - unsure if related but when doing a terraform import on a workload cluster we do get the TKR_DATA as part of the tfstate but not on a new cluster deployed via TF.

Example import: "cluster_variables": "{\"TKR_DATA\":{\"v1.23.8+vmware.2\":{\"kubernetesSpec\":{\"coredns\":{\"imageTag\":\"v1.8.6_vmware.7\"},\"etcd\":{\"imageTag\":\"v3.5.4_vmware.6\"},\"imageRepository\":\"localhost:5000/vmware.io\",\"version\":\"v1.23.8+vmware.2\"},\"labels\":{\"image-type\":\"vmi\",\"os-arch\":\"amd64\",\"os-name\":\"photon\",\"os-type\":\"linux\",\"os-version\":\"3\",\"run.tanzu.vmware.com/os-image\":\"ob-20611023-photon-3-amd64-vmi-k8s-v1.23.8---vmware.2-tkg.1-zshippable\",\"run.tanzu.vmware.com/tkr\":\"v1.23.8---vmware.2-tkg.2-zshippable\",\"vmi-name\":\"ob-20611023-photon-3-amd64-vmi-k8s-v1.23.8---vmware.2-tkg.1-zsh\"},\"osImageRef\":{\"name\":\"ob-20611023-photon-3-amd64-vmi-k8s-v1.23.8---vmware.2-tkg.1-zshippable\"}}},\"clusterEncryptionConfigYaml\":\"removed",\"extensionCert\":{\"contentSecret\":{\"key\":\"tls.crt\",\"name\":\"pcb-prod-nsxintel-extensions-ca\"}},\"nodePoolVolumes\":[{\"capacity\":{\"storage\":\"200G\"},\"mountPath\":\"/var/lib/containerd\",\"name\":\"containerd-0\",\"storageClass\":\"pcb-ha-zone-1-pv\"}],\"ntp\":\"0.0.0.0\",\"storageClass\":\"pcb-ha-zone-1-pv\",\"user\":{\"passwordSecret\":{\"key\":\"ssh-passwordkey\",\"name\":\"removed"},\"sshAuthorizedKey\":\"removed"},\"vmClass\":\"best-effort-medium\"}",

example new cluster: "cluster_variables": "{\"defaultStorageClass\":\"pcb-ha-zone-1-pv\",\"nodePoolVolumes\":[{\"capacity\":{\"storage\":\"200G\"},\"mountPath\":\"/var/lib/containerd\",\"name\":\"containerd\",\"storageClass\":\"pcb-ha-zone-1-pv\"}],\"storageClass\":\"pcb-ha-zone-1-pv\",\"storageClasses\":[\"pcb-ha-zone-1-pv\"],\"vmClass\":\"best-effort-medium\"}",

Axpz commented 7 months ago

Hi @ysineil, this issue will be fix together with 364, hoping this can be done in next week considering the test and promotion.

Axpz commented 6 months ago

Hi @ysineil, this issue have been fixed and promoted into PROD stage from TMC side, sorry for some delay because of tech issue. You can take a retry and if you have any questions just feel free to let us know, thanks!

ramya-bangera commented 6 months ago

Closing this issue as per https://github.com/vmware/terraform-provider-tanzu-mission-control/issues/364#issuecomment-1928190241

vmware / terraform-provider-tanzu-mission-control