vexxhost / magnum-cluster-api

Cluster API driver for OpenStack Magnum
Apache License 2.0
47 stars 22 forks source link

[Question] Manage Kubernetes cluster with terraform and magnum-cluster-api #438

Open 0x00ace opened 1 month ago

0x00ace commented 1 month ago

Does anyone have a working example for terraform with magnum-cluster-api?

So far, I have been managing clusters with the OpenStack client. Creating clusters/nodegroups and resizing with the OpenStack client works fine. But with terraform I have some issues.

I have tried the following terraform manifest:

resource "openstack_containerinfra_clustertemplate_v1" "kube_v1_30_2_cilium_ovn_tf" {
  name                  = "k8s-v1.30.2-cilium-ovn-tf"
  image                 = openstack_images_image_v2.rockylinux_9_kube_v1_30_2.name
  coe                   = "kubernetes"
  flavor                = "2cpu4ram20hdd"
  master_flavor         = "2cpu4ram20hdd"
  dns_nameserver        = "8.8.8.8,8.8.4.4"
  docker_storage_driver = "overlay2"
  network_driver        = "cilium"
  master_lb_enabled     = "true"
  server_type           = "vm"
  floating_ip_enabled   = "false"
  external_network_id   = openstack_networking_network_v2.vlan77.id
  keypair_id            = openstack_compute_keypair_v2.ace_keypair_1.name

  labels = {
    kube_tag = "v1.30.2"
    octavia_provider = "ovn"
    boot_volume_size    = 20
    boot_volume_type    = "rbd-1"
    fixed_subnet_cidr = "192.168.0.0/24"
  }
}

resource "openstack_containerinfra_cluster_v1" "cluster_1_tf" {
  name                = "cluster_1_tf"
  cluster_template_id = openstack_containerinfra_clustertemplate_v1.kube_v1_30_2_cilium_ovn_tf.id
  master_count        = 1
  node_count          = 2
  keypair             = openstack_compute_keypair_v2.ace_keypair_1.name
}

With this manifest kubernetes cluster created successfully.

But when I'm trying to change node count 2 - > 3 cluster transits to UPDATE_FAILED state.

+--------------------------------------+--------------+---------+------------+--------------+---------------+---------------+
| uuid                                 | name         | keypair | node_count | master_count | status        | health_status |
+--------------------------------------+--------------+---------+------------+--------------+---------------+---------------+
| c7dcec5b-67bd-439a-a3bb-22a57b12b27c | cluster_1_tf | ace     |          2 |            1 | UPDATE_FAILED | HEALTHY       |
+--------------------------------------+--------------+---------+------------+--------------+---------------+---------------+

In magnum logs I have the following:

2024-10-12 13:23:29.330 432 DEBUG magnum.api.controllers.v1 [-] Processing request: url: https://cloud.example.org:9511/v1/clusters/c7dcec5b-67bd-439a-a3bb-22a57b12b27c, PATCH, body: b'[{"op":"replace","path":"/node_count","value":3}]' _route /var/lib/kolla/venv/lib64/python3.9/site-packages/magnum/api/controllers/v1/__init__.py:234

After some investigation, I found, that resizing with terraform is different from openstack coe cluster resize mycluster 2 command:

2024-10-12 13:40:06.923 348 DEBUG magnum.api.controllers.v1 [-] Processing request: url: https://cloud.example.org:9511/v1/clusters/382970d5-ae39-4480-83f3-fc4723ee465b/actions/resize, POST, body: b'{"node_count": 2}' _route /var/lib/kolla/venv/lib64/python3.9/site-packages/magnum/api/controllers/v1/__init__.py:234

Also, I have tried to leverage the node group functionality but without success.

resource "openstack_containerinfra_nodegroup_v1" "nodegroup_1" {
  name                = "nodegroup_1"
  cluster_id          = openstack_containerinfra_cluster_v1.cluster_1_tf.id
  node_count          = 1
  merge_labels        = "true"
  labels = {
    kube_tag = "v1.30.2"
    octavia_provider = "ovn"
    boot_volume_size    = 20
    boot_volume_type    = "rbd-1"
    fixed_subnet_cidr = "192.168.0.0/24"
  }
}

The node group is visible as CREATE_IN_PROGRESS, but nothing happened on the CAPI/CAPO side:

+--------------------------------------+----------------+---------------+---------------------------+------------+--------------------+--------+
| uuid                                 | name           | flavor_id     | image_id                  | node_count | status             | role   |
+--------------------------------------+----------------+---------------+---------------------------+------------+--------------------+--------+
| dfbf7799-0072-4bd7-a92a-963657400822 | default-master | 2cpu4ram20hdd | rockylinux-9-kube-v1.30.2 |          1 | UPDATE_COMPLETE    | master |
| 60920ecf-fbe7-495a-bc3f-2de4cb637c60 | default-worker | 2cpu4ram20hdd | rockylinux-9-kube-v1.30.2 |          1 | CREATE_COMPLETE    | worker |
| 046ef671-ff8e-4c52-a8b5-ce38d8f54c6d | nodegroup_1    | 2cpu4ram20hdd | rockylinux-9-kube-v1.30.2 |          1 | CREATE_IN_PROGRESS | worker |
+--------------------------------------+----------------+---------------+---------------------------+------------+--------------------+--------+

Configuration:

I tried to find any issues with such cases but with no luck.

So, my question is it a bug or a feature? Or could someone point me to a related issue on github?

mnaser commented 1 month ago

This is strange, what if you create a node group manualy without Terraform (with the CLI)?

piotr-lodej commented 4 weeks ago

I think that this is the problem of API calls used by Terraform. Terraform to update node count in default pool use PATCH /v1/clusters/{cluster_ident} which raises an exception in magnum-cluster-api NotImplementedError() in https://github.com/vexxhost/magnum-cluster-api/blob/21fc4a8b365e93a708e57da89d65d2f10b620610/magnum_cluster_api/driver.py#L272

A fast workaround would be to implement this part as something like this

        worker_ng = cluster.default_ng_worker
        if worker_ng is not None:
            self._update_nodegroup(context, cluster, worker_ng)
        else:
            raise NotImplementedError()