juju / terraform-provider-juju

A Terraform provider for Juju
Apache License 2.0
19 stars 36 forks source link

Scale-out/in via an update to an application placement attribute forces a replacement #512

Open baycarbone opened 2 months ago

baycarbone commented 2 months ago

Description

When trying to scale-out or scale-in a juju application with static placement, I was expecting to:

However this results in the application resource being replaced, taking down all units and re-creating them which is not ideal (e.g. if this was the ceph-osd charm, I would lose the complete ceph cluster when trying to do this).

# juju_application.ubuntu-bm must be replaced
-/+ resource "juju_application" "ubuntu-bm" {
      ~ constraints = "arch=amd64" -> (known after apply)
      ~ id          = "mymodel:ubuntu-bm" -> (known after apply)
        name        = "ubuntu-bm"
      ~ placement   = "0,1,3" -> "0,1,3,4" # forces replacement
      + principal   = (known after apply)
      ~ units       = 3 -> 4
        # (2 unchanged attributes hidden)

      ~ charm {
            name     = "ubuntu"
          ~ revision = 24 -> (known after apply)
          ~ series   = "jammy" -> (known after apply)
            # (2 unchanged attributes hidden)
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

I would be good to have a way to mimic the juju add-unit/juju remove-unit functionality with the terraform provider.

Urgency

Blocker for our release

Terraform Juju Provider version

0.12.0

Terraform version

1.9.0

Juju version

3.5.1

Terraform Configuration(s)

terraform {
  required_providers {
    juju = {
      version = "~> 0.12.0"
      source  = "juju/juju"
    }
  }
}

variable "machines" {
  type = list(object({
    machine_id = number
    constraints = string
  }))
}

variable units {
    type = number
    default = 3
}

variable controller_ids {
  type    = list(string)
  default = ["100", "101", "102"]
}

resource "juju_model" "test-ubuntu" {
  name = "test"

  cloud {
    name = "maas"
  }
}

resource "juju_machine" "all_machines" {
  for_each    = {
  for index, machine in var.machines:
  machine.machine_id => machine
  }
  model       = juju_model.test-ubuntu.name
  name        = each.value.machine_id
  constraints = each.value.constraints
}

resource "juju_application" "ubuntu-bm" {
  name = "ubuntu-bm"

  model = juju_model.test-ubuntu.name

  charm {
    name    = "ubuntu"
    channel = "latest/stable"
    base    = "ubuntu@22.04"
  }

  units = var.units

  placement = "${join(",", sort([
    for index in var.controller_ids :
      juju_machine.all_machines[index].machine_id
  ]))}"
}

Reproduce / Test

  1. apply the above plan with following tfvars:
    machines = [
    {machine_id=100,constraints="tags=controller"},
    {machine_id=101,constraints="tags=controller"},
    {machine_id=102,constraints="tags=controller"},
    ]
  2. update the deployment with following tfvars:
    
    machines = [
    {machine_id=100,constraints="tags=controller"},
    {machine_id=101,constraints="tags=controller"},
    {machine_id=102,constraints="tags=controller"},
    {machine_id=400,constraints="tags=compute-storage"},
    ]

units = 4

controller_ids = ["100", "101", "102", "400"]


### Debug/Panic Output

```shell
# juju_application.ubuntu-bm must be replaced
-/+ resource "juju_application" "ubuntu-bm" {
      ~ constraints = "arch=amd64" -> (known after apply)
      ~ id          = "mymodel:ubuntu-bm" -> (known after apply)
        name        = "ubuntu-bm"
      ~ placement   = "0,1,3" -> "0,1,3,4" # forces replacement
      + principal   = (known after apply)
      ~ units       = 3 -> 4
        # (2 unchanged attributes hidden)

      ~ charm {
            name     = "ubuntu"
          ~ revision = 24 -> (known after apply)
          ~ series   = "jammy" -> (known after apply)
            # (2 unchanged attributes hidden)
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Notes & References

I would like to mimic this behavior through terraform:

hmlanigan commented 1 month ago

This request is a feature change request, not a bug. The application resource schema defines that any changes to placement require replacement of the application. Changes in this space could break previously written plans.

The placement attribute of an application resource is problematic as it allows a user to define what they'd like all or part of the machine placement of units to look like. It mimic placement on the juju command line. This is not easily compatible with how terraform works.

In the example of above it appears clear cut, add a unit to the newly specified machine 4. More complex placement directives cause issues. Consider a placement directive of lxd, lxd and a unit count of 3. In a clean model, this could produce machines 0/lxd/0, 1/lxd/0, 2. If you remove the one lxd from the placement directive, which container gets removed?

Placement is something we need to reconsider as it causes considerable angst and a few bug reports. A caveat being that by definition, the terraform provider will never behave exactly as the juju cli in all situations. Terraform is IaC where juju is not.

One potential solution is to deprecate placement and add an optional machine attribute. Placement and machine would be mutually exclusive in a plan. The values in the set of machines would have to be existing machine ids. If machines are specified and do not equal the number of units specified, that would cause an error. We would have to fix #506 at the same time for containers to work. It could also resolve #443 at the same time.

A potential work around for remove-unit is coming to mind, but not for add-unit to a specific machine at this time.

baycarbone commented 1 month ago

Thanks for the feedback @hmlanigan. Do you see an alternative to using static placement in the scenario where we have different types of machines and applications that need to land on the right machine (either on the machine itself or in a LXD container)?

Would application constraints be applicable for this (e.g. tag constraint and container constraint)? Would application constraints guarantee that multiple units of the same application are spread over different machines e.g. not have all 3 LXD based units hosted on the same machine?

Regarding the placement directive, I'm not sure I follow the example you gave. When the provider needs to asses if a change needs to happen regarding placement and what change that is, could you not compare sorted lists between the current state and desired state? (This is probably too naive and I'm probably missing something here ;)).

hmlanigan commented 1 month ago

The way the schema is written today, I do not see a way to add a unit to a specific machine. If you wish to deploy to existing machines and containers, then you can write a plan to do that. It's that scaling of an existing application which causes issues with the explicit placement today.

Tags work with MAAS only, and don't work with lxd containers on MAAS machines with juju.

Static placement is better when deploying, however it'll cause issues when adding a unit, as change to placement will destroy and recreate the application. If you added a unit just by changing the number of units, juju would put the unit on the first machine available without a unit, if on wasn't available it'd create a new machine for the unit.

There isn't a constraint to say, place units in a container, only to define what type of container.

By default, units will be spread over different machines in different availability zones. Specifying lxd says to create a new machine (or an existing machine without units) and put a new container on it. Specifying lxd:X says to create a new container on machine X.

Regarding the placement directive, I'm not sure I follow the example you gave. When the provider needs to asses if a change needs to happen regarding placement and what change that is, could you not compare sorted lists between the current state and desired state? (This is probably too naive and I'm probably missing something here ;)).

It depends on what is in the placement list. If the placement list required actual existing machines, then yes you are correct. However placement directives do not have to be actual existing machines. That's where it becomes a problem as the provider can't be sure which one you're referring to. If you have a placement directive of lxd, lxd, lxd and change it to lxd, lxd to remove a unit. How can the provider determine which actual unit you wish to remove?

We can chat in the juju/terraform Matix room if you have more questions on that topic.

Per the placement doc, you can also specify system-id=<system ID> for MAAS, however I do not believe that will work for a container on a MAAS machine.