Upgrading a cluster does not honor all template variables #387

jessica-hofmeister commented 3 months ago

The upgrade process for magnum clusters does not fully upgrade all the configuration options specified by the cluster template that is being applied. Specifically, we noticed this on the sizing of the nodes, as shown below, but this idea should be applied to all the configuration items within a template.

Steps to replicate: first, look at the original cluster template with the medium flavor and 40gb boot volume size

openstack coe cluster template show test-template-medium-flavor
| Field                 | Value                                                                                                                                                                                                                     |
| insecure_registry     | -                                                                                                                                                                                                                         |
| labels                | {'kube_tag': 'v1.27.4', 'boot_volume_size': '40', 'boot_volume_type': 'rbd1', 'master_lb_floating_ip_enabled': 'false', 'audit_log_enabled': 'true', 'os_distro': 'ubuntu', 'min_node_count': '1', 'max_node_count': '5'} |
| updated_at            | -                                                                                                                                                                                                                         |
| floating_ip_enabled   | True                                                                                                                                                                                                                      |
| fixed_subnet          | -                                                                                                                                                                                                                         |
| master_flavor_id      | m1.medium                                                                                                                                                                                                                 |
| uuid                  | e290ab0b-3ab5-4fd2-86d6-8da380e478a4                                                                                                                                                                                      |
| no_proxy              | -                                                                                                                                                                                                                         |
| https_proxy           | -                                                                                                                                                                                                                         |
| tls_disabled          | False                                                                                                                                                                                                                     |
| keypair_id            | -                                                                                                                                                                                                                         |
| public                | False                                                                                                                                                                                                                     |
| http_proxy            | -                                                                                                                                                                                                                         |
| docker_volume_size    | -                                                                                                                                                                                                                         |
| server_type           | vm                                                                                                                                                                                                                        |
| external_network_id   | public                                                                                                                                                                                                                    |
| cluster_distro        | ubuntu                                                                                                                                                                                                                    |
| image_id              | ubuntu2204-tenant-k8s-1.27.11-20240506                                                                                                                                                                                    |
| volume_driver         | -                                                                                                                                                                                                                         |
| registry_enabled      | False                                                                                                                                                                                                                     |
| docker_storage_driver | overlay2                                                                                                                                                                                                                  |
| apiserver_port        | -                                                                                                                                                                                                                         |
| name                  | test-template-medium-flavor                                                                                                                                                  |
| created_at            | 2024-06-03T19:42:24+00:00                                                                                                                                                                                                 |
| network_driver        | calico                                                                                                                                                                                                                    |
| fixed_network         | -                                                                                                                                                                                                                         |
| coe                   | kubernetes                                                                                                                                                                                                                |
| flavor_id             | m1.medium                                                                                                                                                                                                                 |
| master_lb_enabled     | True                                                                                                                                                                                                                      |
| dns_nameserver        |                                                                                                                                                                                                                   |
| hidden                | False                                                                                                                                                                                                                     |
| tags                  | -                                                                                                                                                                                                                         |

second, look at the cluster template with a large flavor and 60gb boot volume size

openstack coe cluster template show test-template-large-flavor
| Field                 | Value                                                                                                                                                                                                                     |
| insecure_registry     | -                                                                                                                                                                                                                         |
| labels                | {'kube_tag': 'v1.27.4', 'boot_volume_size': '60', 'boot_volume_type': 'rbd1', 'master_lb_floating_ip_enabled': 'false', 'audit_log_enabled': 'true', 'os_distro': 'ubuntu', 'min_node_count': '1', 'max_node_count': '5'} |
| updated_at            | -                                                                                                                                                                                                                         |
| floating_ip_enabled   | True                                                                                                                                                                                                                      |
| fixed_subnet          | -                                                                                                                                                                                                                         |
| master_flavor_id      | m1.large                                                                                                                                                                                                                  |
| uuid                  | 0f52c805-7cf8-43b4-bf38-4e2e59274c1a                                                                                                                                                                                      |
| no_proxy              | -                                                                                                                                                                                                                         |
| https_proxy           | -                                                                                                                                                                                                                         |
| tls_disabled          | False                                                                                                                                                                                                                     |
| keypair_id            | -                                                                                                                                                                                                                         |
| public                | False                                                                                                                                                                                                                     |
| http_proxy            | -                                                                                                                                                                                                                         |
| docker_volume_size    | -                                                                                                                                                                                                                         |
| server_type           | vm                                                                                                                                                                                                                        |
| external_network_id   | public                                                                                                                                                                                                                    |
| cluster_distro        | ubuntu                                                                                                                                                                                                                    |
| image_id              | ubuntu-2204-kube-v1.27.4                                                                                                                                                                                                  |
| volume_driver         | -                                                                                                                                                                                                                         |
| registry_enabled      | False                                                                                                                                                                                                                     |
| docker_storage_driver | overlay2                                                                                                                                                                                                                  |
| apiserver_port        | -                                                                                                                                                                                                                         |
| name                  | test-template-large-flavor                                                                                                                                                       |
| created_at            | 2024-06-04T19:39:32+00:00                                                                                                                                                                                                 |
| network_driver        | calico                                                                                                                                                                                                                    |
| fixed_network         | -                                                                                                                                                                                                                         |
| coe                   | kubernetes                                                                                                                                                                                                                |
| flavor_id             | m1.large                                                                                                                                                                                                                  |
| master_lb_enabled     | True                                                                                                                                                                                                                      |
| dns_nameserver        |                                                                                                                                                                                                                   |
| hidden                | False                                                                                                                                                                                                                     |
| tags                  | -                                                                                                                                                                                                                         |

finally, upgrade the cluster and look at the values that have been adopted by the cluster. notice that the cluster references the new template id, but the flavor size is still medium.

openstack coe cluster show test-cluster
| Field                | Value                                                                                                                                                                                                                                                                                                                                                            |
| status               | UPDATE_COMPLETE                                                                                                                                                                                                                                                                                                                                                  |
| health_status        | HEALTHY                                                                                                                                                                                                                                                                                                                                                          |
| cluster_template_id  | 0f52c805-7cf8-43b4-bf38-4e2e59274c1a                                                                                                                                                                                                                                                                                                                             |
| node_addresses       | []                                                                                                                                                                                                                                                                                                                                                               |
| uuid                 | eca19ace-e4b8-4f0e-bb73-36039965e850                                                                                                                                                                                                                                                                                                                             |
| stack_id             | kube-cwpqf                                                                                                                                                                                                                                                                                                                                                       |
| status_reason        | None                                                                                                                                                                                                                                                                                                                                                             |
| created_at           | 2024-06-04T13:37:15+00:00                                                                                                                                                                                                                                                                                                                                        |
| updated_at           | 2024-06-05T17:47:42+00:00                                                                                                                                                                                                                                                                                                                                        |
| coe_version          | v1.27.4                                                                                                                                                                                                                                                                                                                                                          |
| labels               | {'kube_tag': 'v1.27.4', 'boot_volume_size': '40', 'boot_volume_type': 'rbd1', 'master_lb_floating_ip_enabled': 'false', 'audit_log_enabled': 'true', 'os_distro': 'ubuntu', 'min_node_count': '1', 'max_node_count': '5', 'manila_csi_share_network_id': '94d53598-2241-4b12-b46d-056fa090a7a4', 'auto_healing_enabled': 'True', 'auto_scaling_enabled': 'True'} |
| labels_overridden    | {'boot_volume_size': '60'}                                                                                                                                                                                                                                                                                                                                       |
| labels_skipped       | {}                                                                                                                                                                                                                                                                                                                                                               |
| labels_added         | {'manila_csi_share_network_id': '94d53598-2241-4b12-b46d-056fa090a7a4', 'auto_healing_enabled': 'True', 'auto_scaling_enabled': 'True'}                                                                                                                                                                                                                          |
| fixed_network        | test-network                                                                                                                                                                                                                                                                                                                                 |
| fixed_subnet         | None                                                                                                                                                                                                                                                                                                                                                             |
| floating_ip_enabled  | False                                                                                                                                                                                                                                                                                                                                                            |
| faults               |                                                                                                                                                                                                                                                                                                                                                                  |
| keypair              | svc_account                                                                                                                                                                                                                                                                                                                                                    |
| api_address          |                                                                                                                                                                                                                                                                                                                                        |
| master_addresses     | []                                                                                                                                                                                                                                                                                                                                                               |
| master_lb_enabled    | True                                                                                                                                                                                                                                                                                                                                                             |
| create_timeout       | 60                                                                                                                                                                                                                                                                                                                                                               |
| node_count           | 1                                                                                                                                                                                                                                                                                                                                                                |
| discovery_url        | None                                                                                                                                                                                                                                                                                                                                                             |
| docker_volume_size   | None                                                                                                                                                                                                                                                                                                                                                             |
| master_count         | 3                                                                                                                                                                                                                                                                                                                                                                |
| container_version    | None                                                                                                                                                                                                                                                                                                                                                             |
| name                 | test-cluster                                                                                                                                                                                                                                                                                                                                        |
| master_flavor_id     | m1.medium                                                                                                                                                                                                                                                                                                                                                        |
| flavor_id            | m1.medium                                                                                                                                                                                                                                                                                                                                                        |
| health_status_reason | {'kube-cwpqf-default-worker-mbrwk-ndshj-hpdfc.Ready': 'True', 'kube-cwpqf-tn6xc-czzmp.Ready': 'True', 'kube-cwpqf-tn6xc-ft52j.Ready': 'True', 'kube-cwpqf-tn6xc-wr7rr.Ready': 'True'}                                                                                                                                                                            |
| project_id           | 8cdcda55818b40c681b03132bbf3a6bc                                                                                                                                                                                                                                                                                                                                 |

Additionally, when viewing the actual storage on the node, only 40gb are available after the upgrade - not 60gb as configured in the new template

ubuntu@kube-cwpqf-control-plane-sjx56-lv777:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           392M  3.5M  388M   1% /run
/dev/vda1        40G  7.9G   30G  22% /
mnaser commented 3 months ago

I think there is more to this issue unfortunately. I think one of the concerns I just thought of as I'm writing this fix is that when you create a cluster, you can specify a master_flavor_id and/or flavor_id which is optional (if not set, copied from cluster template).

By forcing the cluster template to always go to the values there, it is possible that someone who created a cluster with a specific flavor_id or master_flavor_id see their cluster get resized down or up without them expecting or wanting that change.

It seems for us to be able to do this, we need to allow those two attributes as updatable in the Magnum API, and then we can handle it inside the update request for the driver (and not upgrade).

I hope that this explanation makes sense as to why we can't handle this right now.