vexxhost / magnum-cluster-api

Cluster API driver for OpenStack Magnum
Apache License 2.0
44 stars 20 forks source link

Magnum cluster upgrade still references old template #317

Closed jessica-hofmeister closed 6 months ago

jessica-hofmeister commented 7 months ago

Create 2 cluster templates. One with ubuntu-2204-kube-v1.27.3 image and one with ubuntu-2204-kube-v1.27.4 image. Create a cluster with the 1.27.3 template. After it comes up healthy, upgrade it to the 1.27.4 image. Cluster reaches upgrade_complete status, but still references old template.

Relevant images:

openstack image list        
+--------------------------------------+-------------------------------------+--------+
| ID                                   | Name                                | Status |
+--------------------------------------+-------------------------------------+--------+
| 3097dcc8-8c26-473c-a04b-55fe88a65e13 | ubuntu-2204-kube-v1.27.3            | active |
| 4baeb5c2-42cb-457e-81cf-885b100c19a2 | ubuntu-2204-kube-v1.27.4            | active |
+--------------------------------------+-------------------------------------+--------+ 

Create a cluster template for 1.27.3:

openstack coe cluster template create \  
  --image 3097dcc8-8c26-473c-a04b-55fe88a65e13 \  
  --coe kubernetes \
  --flavor m1.medium \  
  --master-flavor m1.medium \
  --external-network public \
  --master-lb-enabled \
  --floating-ip-disabled \
  --network-driver calico \
  --docker-storage-driver overlay2 \
  --label kube_tag=v1.27.3 \
  --label boot_volume_size=40 \
  --label boot_volume_type=rbd1 \
  --label master_lb_floating_ip_enabled=false \
  --label audit_log_enabled=true 
  --label os_distro=ubuntu \
  test-v1.27.3 

Create a cluster template for 1.27.4:

openstack coe cluster template create \
  --image 4baeb5c2-42cb-457e-81cf-885b100c19a2 \
  --coe kubernetes \
  --flavor m1.medium \
  --master-flavor m1.medium \
  --external-network public \
  --master-lb-enabled \
  --floating-ip-disabled \
  --network-driver calico \
  --docker-storage-driver overlay2 \
  --label kube_tag=v1.27.4 \
  --label boot_volume_size=40 \
  --label boot_volume_type=rbd1 \
  --label master_lb_floating_ip_enabled=false \
  --label audit_log_enabled=true \
  --label os_distro=ubuntu \
  test-v1.27.4 

Create a cluster using the test-v1.27.3 template:

openstack coe cluster create \
 --cluster-template test-v1.27.3 \
 --master-count 1 \
 --node-count 1 \
 --fixed-network dev-k8s \
 --keypair svc-account \
 --floating-ip-disabled \
 test-cluster 

Wait for the cluster to come up healthy:

kubectl get nodes
NAME                                          STATUS   ROLES                  AGE     VERSION
kube-db673-control-plane-ndpfj-fqf56          Ready    control-plane,master   4m47s   v1.27.3
kube-db673-default-worker-infra-g28k4-5psxz   Ready    worker                 3m43s   v1.27.3 

List the openstack templates:

openstack coe cluster template list                 
+--------------------------------------+----------------+------+
| uuid                                 | name           | tags |
+--------------------------------------+----------------+------+
| 68813151-763b-4fb5-b2e0-c254f1ad4b42 | test-v1.27.3 | None |
| 063f2a66-a994-4d81-aa00-0442359a333e | test-v1.27.4 | None |
+--------------------------------------+----------------+------+ 

See what template the cluster has currently:

openstack coe cluster show test-cluster -f value -c cluster_template_id            
68813151-763b-4fb5-b2e0-c254f1ad4b42 

Upgrade the cluster to the test-v1.27.4 template:

openstack coe cluster upgrade test-cluster test-v1.27.4
Request to upgrade cluster test-cluster has been accepted. 

Wait for the cluster to finish upgrading:

kubectl get nodes
NAME                                          STATUS   ROLES                  AGE   VERSION
kube-db673-control-plane-7bkxd-55gnv          Ready    control-plane,master   28m   v1.27.4
kube-db673-default-worker-infra-zq68k-jtl8z   Ready    worker                 24m   v1.27.4 

The actual instances show that they are using the test-v1.27.4 image See what template the cluster has currently:

openstack coe cluster show test-cluster -f value -c cluster_template_id
68813151-763b-4fb5-b2e0-c254f1ad4b42 

Notice that it still thinks the current template is the test-v1.27.3 even though the upgrade to the test-v1.27.4 completed.

And now...the test that shows it breaks: Attempt to scale up the cluster to 2 worker nodes:

openstack coe cluster resize test-cluster 2
Request to resize cluster test-cluster has been accepted. 

The cluster resize actually fails. The details after trying to resize are: Note the difference between coe_version and kube_tag

openstack coe cluster show test-cluster -f yaml                        
status: UPDATE_FAILED
health_status: HEALTHY
cluster_template_id: 68813151-763b-4fb5-b2e0-c254f1ad4b42
node_addresses: []
uuid: 9f5066cf-3985-4a75-b350-f731370e3d7b
stack_id: kube-db673
status_reason: 'admission webhook "validation.cluster.cluster.x-k8s.io" denied the
  request: Cluster.cluster.x-k8s.io "kube-db673" is invalid: spec.topology.version:
  Invalid value: "v1.27.3": version cannot be decreased from "1.27.4" to "1.27.3"'
created_at: '2024-03-04T17:21:28+00:00'
updated_at: '2024-03-04T18:29:17+00:00'
coe_version: v1.27.4
labels:
  audit_log_enabled: 'true'
  boot_volume_size: '40'
  boot_volume_type: rbd1
  kube_tag: v1.27.3
  master_lb_floating_ip_enabled: 'false'
  os_distro: ubuntu
labels_overridden: {}
labels_skipped: {}
labels_added: {}
fixed_network: dev-k8s
fixed_subnet: null
floating_ip_enabled: false
faults:
  default-worker: 'admission webhook "validation.cluster.cluster.x-k8s.io" denied
    the request: Cluster.cluster.x-k8s.io "kube-db673" is invalid: spec.topology.version:
    Invalid value: "v1.27.3": version cannot be decreased from "1.27.4" to "1.27.3"'
keypair: svc-account
api_address: https://172.22.4.228:6443
master_addresses: []
master_lb_enabled: true
create_timeout: 60
node_count: 1
discovery_url: null
docker_volume_size: null
master_count: 1
container_version: null
name: test-cluster
master_flavor_id: m1.medium
flavor_id: m1.medium
health_status_reason:
  kube-db673-default-worker-v58w9-vphx2-59nqs.Ready: 'True'
  kube-db673-ppj9w-fztwp.Ready: 'True'
project_id: 402f35ab1fa340d5834c55e6a2d4c32f 
jessica-hofmeister commented 7 months ago

One other interesting thing is that doing a server show on one of the instances in the cluster shows us image_id: none, while the UI shows the 1.27.4 image

| id                                  | 48a735e5-f3ea-4e4c-8e00-fc33525582f3                                                                                                                                                                   |
| image                               | N/A (booted from volume)                                                                                                                                                                               |
| imageRef                            | None                                                                                                                                                                                                   |
| image_id                            | None                                                                                                                                                                                                   |
| instance_name                       | None                                              

image

mnaser commented 6 months ago

We need to save the cluster template ID here:

https://github.com/vexxhost/magnum-cluster-api/blob/main/magnum_cluster_api/driver.py#L231-L256

mnaser commented 6 months ago

Oh, this might be far more complicated, I think during an upgrade the update_cluster_status might be running somewhere else and override the save that happens here:

https://github.com/openstack/magnum/blob/35374b4380db673f9b61cb18da0f9382dcc00fce/magnum/conductor/handlers/cluster_conductor.py#L368-L369

We actually need to setup some lock or actually in the cluster update sync code to pull the cluster-template from the magnum resource..

jessica-hofmeister commented 6 months ago

after scaling down the magnum conductor to 1, we retested and the results are exactly the same: each node upgrades to the new kubernetes version, but the cluster itself still references the old template id