terraform-lxd / terraform-provider-lxd

LXD Resource provider for Terraform
https://registry.terraform.io/providers/terraform-lxd/lxd/latest/docs
Mozilla Public License 2.0
251 stars 70 forks source link

Failed VM creation is left behind but also not tracked #452

Open lathiat opened 3 months ago

lathiat commented 3 months ago

If you create a VM which fails, the VM is left behind in LXD but terraform does not know about it. This means trying to re-run terraform with a fixed configuration fails because the VM already exists by name.

Simple example is trying to create a VM with a bad limits.memory size like "8G" or "8192". Since the exact desired format wasn't documented it made this mistake easier :)

If you make a similar mistake with the OpenStack provider either the VM wouldn't exist or it would remember it and know it needed to destroy it first.

Steps to reproduce

  1. Create the terraform config
    
    terraform {
    required_providers {
    lxd = {
      source = "terraform-lxd/lxd"
    }
    }
    }

provider "lxd" { }

resource "lxd_instance" "vm1" { name = "vm1" type = "virtual-machine" image = "ubuntu:22.04" limits = { cpu = 2 memory = "8G" } device { name = "root" type = "disk" properties = { path = "/" pool = "optane" } } }


2. Run `terraform apply`

lxd_instance.vm1: Creating... ╷ │ Error: Failed to start instance "vm1" │ │ with lxd_instance.vm1, │ on lvs.tf line 24, in resource "lxd_instance" "vm1": │ 24: resource "lxd_instance" "vm1" { │ │ Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 -- /snap/lxd/27049/bin/qemu-system-x86_64 -S -name vm1 -uuid abee7dd2-f8d4-43e4-be7d-69a9082d9cbc -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config │ -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/c00381473_vm1/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/c00381473_vm1/qemu.spice -pidfile │ /var/snap/lxd/common/lxd/logs/c00381473_vm1/qemu.pid -D /var/snap/lxd/common/lxd/logs/c00381473_vm1/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : exit status 1 ╵


3. Modify the terraform config to correctly specify "8GiB" instead of "8G"

4. Run `terraform apply` again

lxd_instance.vm1: Creating... ╷ │ Error: Failed to create instance "vm1" │ │ with lxd_instance.vm1, │ on lvs.tf line 24, in resource "lxd_instance" "vm1": │ 24: resource "lxd_instance" "vm1" { │ │ Failed instance creation: Failed creating instance record: Add instance info to the database: This "instances" entry already exists

MusicDin commented 3 months ago

Thanks for reporting this issue.

Currently, state is partially updated when instance is successfully created. This should be done sooner to ensure Terraform is aware of a failed instance and will replace it on the next apply.

MusicDin commented 2 months ago

Hi @lathiat, we could not reproduce the issue with LXD 5.20 (revision 27049 - from your log), 5.21 and 5.0.3.

The error we get is:

╷
│ Error: Failed to create instance "vm1"
│ 
│   with lxd_instance.vm1,
│   on main.tf line 12, in resource "lxd_instance" "vm1":
│   12: resource "lxd_instance" "vm1" {
│ 
│ Failed instance creation: Failed creating instance record: Invalid value: 2G

While we could not reproduce this specific error, if Terraform fails creating the instance (and the failed instance remains in LXD) it is not tracked in Terraform state.

MusicDin commented 2 months ago

We were able to partially reproduce the issue by setting CPU count to the number higher than available CPUs on the host.

resource "lxd_instance" "instance" {
  name      = "inst"
  image     = "ubuntu-daily:22.04"
  type      = "virtual-machine"

  limits = {
    "cpu" = 512
  }
}
$ lxc ls
+------+---------+------+------+-----------------+-----------+
| NAME |  STATE  | IPV4 | IPV6 |      TYPE       | SNAPSHOTS |
+------+---------+------+------+-----------------+-----------+
| inst | STOPPED |      |      | VIRTUAL-MACHINE | 0         |
+------+---------+------+------+-----------------+-----------+

Reapplying the configuration shows that the instance will be recreated. Partially inserted instance can also be observed in terraform.tfstate.

$ terraform apply

...
  # lxd_instance.instance is tainted, so must be replaced
-/+ resource "lxd_instance" "instance" {
      + image            = "ubuntu-daily:22.04"
      + ipv4_address     = (known after apply)
      + ipv6_address     = (known after apply)
      + mac_address      = (known after apply)
        name             = "inst"
      ~ running          = false -> true
      ~ status           = "Stopped" -> (known after apply)
      + target           = (known after apply)
        # (6 unchanged attributes hidden)
    }

The instance's state is partially updated only once the instance is successfully created. In case the instance fails to start (which is the case with invalid CPU count) the state is updated, and the instance is recreated on the next run.

If the instance fails to be created, the instance's state would indeed not be updated and Terraform would not be aware of it. However, in such case, the LXD should take care of cleaning up the instance.

lathiat commented 1 month ago

I agree I can no longer reproduce the original issue, although, I updated terraform-lxd since then. However I agree I can reproduce the issue with CPU cores. I guess that highlights the issue generally.

So I guess the problem here is currently a VM can be created, fail to start, and not be recorded as created. I guess we can solve that as a first pass.

As a note, I have a similar problem with MAAS here, we didn't yet determine if the issue was with the way LXD did or didn't clean it up or the way MAAS was handling it. But may be of interest: https://bugs.launchpad.net/maas/+bug/2055252

MusicDin commented 1 month ago

In the CPU case the instance gets inserted into the Terraform state, therefore, Terraform recreates it on reapply (or remove it on destroy).

If I understand correctly, in your case the instance remains hanging in LXD and Terraform is not aware of it (causing instance already exists error on reapply)? Unfortunately, I cannot reproduce this issue.

Could you please share the version of LXD Terraform provider that is being used?