wandb / terraform-google-wandb

A Terraform module for deploying Weights & Biases on GCP.
Apache License 2.0
12 stars 6 forks source link

fix: Update GKE version #51

Closed flamarion closed 1 year ago

flamarion commented 1 year ago

This PR is intended to upgrade the Kubernetes cluster to a newer stable version since the current version is not supported anymore.

module.wandb.module.database.google_sql_user.wandb: Creation complete after 1s [id=wandb//tf-perms-gcp-star-sheep]
╷
│ Error: googleapi: Error 400: Master version "1.22.12-gke.500" is unsupported., badRequest
│
│   with module.wandb.module.app_gke.google_container_cluster.default,
│   on ../terraform-google-wandb/modules/app_gke/main.tf line 1, in resource "google_container_cluster" "default":
│    1: resource "google_container_cluster" "default" {
│
╵

The current supported versions

gcloud container get-server-config --flatten="channels" --filter="channels.channel=STABLE" \
    --format="yaml(channels.channel,channels.validVersions)"
Fetching server config for europe-west2-a
---
channels:
  channel: STABLE
  validVersions:
  - 1.24.9-gke.1500
  - 1.23.14-gke.1800
  - 1.22.16-gke.2000
  - 1.21.14-gke.14600
  - 1.21.14-gke.14100

It also adds the possibility of informing the Kubernetes version and a validation of the versions accepted.

flamarion commented 1 year ago

The upgrade process takes quite some time to upgrade the cluster and the node pool.

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the
following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.wandb.module.app_gke.google_container_cluster.default will be updated in-place
  ~ resource "google_container_cluster" "default" {
        id                          = "projects/tmp-terraform-permissions/locations/europe-west2-a/clusters/tf-perms-gcp-cluster"
      ~ min_master_version          = "1.22.16-gke.2000" -> "1.23.14-gke.1800"
        name                        = "tf-perms-gcp-cluster"
        # (27 unchanged attributes hidden)

        # (16 unchanged blocks hidden)
    }

  # module.wandb.module.app_gke.google_container_node_pool.default will be updated in-place
  ~ resource "google_container_node_pool" "default" {
        id                          = "projects/tmp-terraform-permissions/locations/europe-west2-a/clusters/tf-perms-gcp-cluster/nodePools/default-pool-relieved-dinosaur"
        name                        = "default-pool-relieved-dinosaur"
      ~ version                     = "1.22.16-gke.2000" -> "1.23.14-gke.1800"
        # (9 unchanged attributes hidden)

        # (4 unchanged blocks hidden)
    }

Cluster:

module.wandb.module.app_gke.google_container_cluster.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...-west2-a/clusters/tf-perms-gcp-cluster, 6m30s elapsed]
module.wandb.module.app_gke.google_container_cluster.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...-west2-a/clusters/tf-perms-gcp-cluster, 6m40s elapsed]
module.wandb.module.app_gke.google_container_cluster.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...-west2-a/clusters/tf-perms-gcp-cluster, 6m50s elapsed]
module.wandb.module.app_gke.google_container_cluster.default: Modifications complete after 6m54s [id=projects/tmp-terraform-permissions/locations/europe-west2-a/clusters/tf-perms-gcp-cluster]

Node Pool:

odule.wandb.module.app_gke.google_container_node_pool.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...dePools/default-pool-relieved-dinosaur, 20m50s elapsed]
module.wandb.module.app_gke.google_container_node_pool.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...dePools/default-pool-relieved-dinosaur, 21m0s elapsed]
module.wandb.module.app_gke.google_container_node_pool.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...dePools/default-pool-relieved-dinosaur, 21m10s elapsed]
module.wandb.module.app_gke.google_container_node_pool.default: Still modifying... [id=projects/tmp-terraform-permissions/loca...dePools/default-pool-relieved-dinosaur, 21m20s elapsed]
module.wandb.module.app_gke.google_container_node_pool.default: Modifications complete after 21m28s [id=projects/tmp-terraform-permissions/locations/europe-west2-a/clusters/tf-perms-gcp-cluster/nodePools/default-pool-relieved-dinosaur]
jsbroks commented 1 year ago

Wait do we need to set a min master version? Should terraform use the latest when we deploy the instance?

flamarion commented 1 year ago

Wait do we need to set a min master version? Should terraform use the latest when we deploy the instance?

The default version for the STABLE channel is not the latest, but the 1.23

gcloud container get-server-config --zone=europe-west2 --flatten=channels --filter="channels.channel=STABLE" --format="value(channels.defaultVersion)"
Fetching server config for europe-west2
1.23.14-gke.1800

1.24 is available, but it's not the default from the STABLE channel. If you think we need to go to 1.25 you need to replace the channel from STABLE with RAPID.