thomasklein94 / packer-plugin-libvirt

Packer Plugin for Libvirt
Mozilla Public License 2.0
20 stars 15 forks source link

Building golden images for terraform + libvirt consumption #61

Open MattSnow-amd opened 10 months ago

MattSnow-amd commented 10 months ago

I have what I believe is a simple use case but am struggling to figure out a solution. I will try to explain.

context: I have an ubuntu machine setup as a hypervisor. I use this ansible playbook as the basis.

I have this repo of terraform code that provisions several instances of Ubuntu-22.04 from a cloud-init image source (see osimage.tf ) and customized with a cloud-init config.

intended use: I want to create a CI pipeline for building golden images that have some basic configurations like packages, CA certs, users. I would like to take make this process an intermediate step that generates "golden" images that are used by the terraform code.

Gist of packer build HCL: https://gist.github.com/MattSnow-amd/3b36f82364fe6105ac52cc7a68dc3812 I have tried a variety of combinations of manually deleting files generated by the packer build process and running cloud-init clean commands from the documentation.

Problem: The terraform created VM/domain boots up, but none of the cloud-init configurations are applied and the network is not configured. I am able to communicate between virsh and the VM's qemu-guest-agent via virsh domifaddr --domain mymachinename.example.com --source agent. Sample outget from virsh domifadd:

 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4         127.0.0.1/8
 -          -                    ipv6         ::1/128
 enp1s0     52:54:00:04:1d:01    N/A          N/A

I can also virsh console into the running domain and confirm that the cdrom at /dev/sr0 is presented in the domain, and the cidata image can be mounted and contains all of the terraform templated values in the user-data file.

Any guidance or pointers are much appreciated. Thank you for your effort in writing such a useful tool!

thomasklein94 commented 10 months ago

Hi,

If I understand you right, your problem is not with this packer plugin, but with the image and the infrastructure that image is intended to be used later. You were able to run the linked packer build and it did run the provisioning steps and produced an image. Then later, you tried to use that image with (what I guess was) dmacvicar's libvirt terraform provider, only to see that it failed to run the provided cloud-init script.

I would suggest to you to closely monitor the booting process of the image (boot with nosplash, without quiet, etc.) to see if you can spot any log/message related to cloud-init. You must at least should be able to see the cloud-init related systemd units starting. Also, cloud-init should create log files like /var/log/cloud-init-output.log. The logging configuration should reside in /etc/cloud/cloud.cfg.d/05_logging.cfg (or something similar).

You can also try to run cloud-init manually (with cloud-init init -d or something similar, check the docs).

Also, to narrow things down, can you run your cloud-init script on the original cloud image from Cannonical? If that image also fails to run your cloud-init script, then the issue is probably with your infra environment / cloud-init script, and not with this packer plugin and config.

Another idea would be to make sure that you are building the image with a similar VM config to the environment intended for later use. For example, make sure that you are running on the same chipset and both system runs either BIOS or UEFI.

Probably I should also mention that I have a very vague memory of something similar happening to me previously, when for some mysterious reason, the image I created won't booted when the vm was created with terraform. If I recall it correctly, it was something to do with the libvirt terraform plugin messing up bus types and addresses and for some unexplainable reason, it prevented the system from booting. I don't remember much, but that it was really annoying to debug.

I can share you some snippets from my terraform and packer configs that might inspire you on your debug journey.

This is the last provisioning step for my builds:

build {
  // ...
  provisioner "shell" {
    inline = [ 
      "echo 'Cleaning up cloudinit'",
      "sudo cloud-init clean --logs",
      "", 
      "truncate -s 0 ~/.ssh/authorized_keys",
    ]   
  }
}

I have a separate terraform module for managing "compute" nodes and another for IPAM. Here is the cloud-init part.

resource "libvirt_cloudinit_disk" "this" {
  name = "${var.id}-${var.name}-cloudinit"
  pool = local.root_storage_pool

  meta_data      = local.meta_data
  network_config = local.network_config
  user_data      = var.user_data
}

locals {
  default_network_config = {
    version = 2
    ethernets = {
      eth = {
        match = {
          macaddress = macaddress.this.address
        }
        "set-name" = "eth"
        addresses = [
          "${module.ipam.ip_address}/${module.ipam.cidr}"
        ]
        gateway4 = module.ipam.gateway
        nameservers = {
          search = module.ipam.search_domains
          addresses = [
            module.ipam.nameserver
          ]
        }
      }
    }
  }

  default_meta_data = <<EOM
instance-id: ${var.id}-${var.name}
local-hostname: ${var.name}
EOM

  meta_data      = var.meta_data != null ? var.meta_data : local.default_meta_data
  network_config = var.network_config != null ? var.network_config : jsonencode(local.default_network_config)
}

To generate the cloud-init file, I use data "template_cloudinit_config".

data "template_cloudinit_config" "vm" {
  gzip          = false
  base64_encode = false

  part {
    filename     = "init.cfg"
    content_type = "text/cloud-config"
    content = yamlencode({
      ssh_authorized_keys = local.ssh_authorized_keys
      users = [ 
        {
          name                = "terraform"
          groups              = ["sudo"]
          shell               = "/bin/bash"
          hashed_passwd       = random_password.vm.bcrypt_hash
          lock_passwd         = false
          ssh_authorized_keys = concat(local.ssh_authorized_keys, [
            tls_private_key.mgmt.public_key_openssh,
            tls_private_key.terraform.public_key_openssh
          ])
        }
      ]   
      packages = [ 
        "python3",
        "python3-pip",
        "python3-wheel",
        "python3-virtualenv",
        "python3-netaddr",

        "git",
        "ipvsadm",
      ]
    })
  }
}
module "vm" {
  // ...
  user_data = data.template_cloudinit_config.vm.rendered
}

And here is my XSLT causing perpetual diffs but making my nodes as I wanted them to be:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*" />

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- Making the worker headless -->
    <xsl:template match="/domain/devices/graphics" />
    <xsl:template match="/domain/devices/video" />
    <xsl:template match="/domain/devices/audio" />
    <xsl:template match="/domain/devices/input[@type='mouse' or @type='keyboard']" />

    <!-- SEE https://github.com/dmacvicar/terraform-provider-libvirt/issues/667 -->
    <!-- Thanks dariush, https://gist.github.com/dariush/7405cbf62835e03d0b5c953d798a87cd -->
    <!-- replace <target dev='hdd'...> with <target dev='sdd'...>  -->
    <xsl:template match="/domain/devices/disk[@device='cdrom']/target/@dev">
        <xsl:attribute name="dev">
            <xsl:value-of select="'sdd'"/>
        </xsl:attribute>
    </xsl:template>

    <!-- replace <target bus='ide'...> with <target bus='sata'...>  -->
    <xsl:template match="/domain/devices/disk[@device='cdrom']/target/@bus">
        <xsl:attribute name="bus">
            <xsl:value-of select="'sata'"/>
        </xsl:attribute>
    </xsl:template>

    <!-- replace <target bus='ide'...> with <target bus='sata'...>  -->
    <xsl:template match="/domain/devices/disk[@device='disk' and target/@bus='scsi']">
        <xsl:copy>
            <xsl:apply-templates select="@*|*[not(self::wwn) and not(self::target)]"/>
            <target bus="sata">
                <xsl:attribute name="dev"><xsl:value-of select="target/@dev" /></xsl:attribute>
            </target> 
        </xsl:copy>
    </xsl:template>

    <!-- replace <alias...> with nothing ie delete the <alias...> element  -->
    <xsl:template match="/domain/devices/disk[@device='cdrom']/alias" />
</xsl:stylesheet

I'm using the 0.7.1 version of dmacvicar's libvirt plugin. Also, all my machines are now provisioned on q35 machine type and UEFI.

Hope this helps. Let us know if and how you managed to figure out your issue.

MattSnow-amd commented 10 months ago

Hi,

If I understand you right, your problem is not with this packer plugin, but with the image and the infrastructure that image is intended to be used later. You were able to run the linked packer build and it did run the provisioning steps and produced an image. Then later, you tried to use that image with (what I guess was) dmacvicar's libvirt terraform provider, only to see that it failed to run the provided cloud-init script.

Correct. I am able to successfully run and build an image using packer with the libvirt builder and a cloud-init image. I am also successfully able to build domains using 'fresh' cloud-init images using dmacvicar's libvirt terraform provider.

In both of these cases, starting with an unmodified cloud image (so far I have only tried Ubuntu-22.04) I am able to successfully apply cloud-init configurations.

I would suggest to you to closely monitor the booting process of the image (boot with nosplash, without quiet, etc.) to see if you can spot any log/message related to cloud-init. You must at least should be able to see the cloud-init related systemd units starting. Also, cloud-init should create log files like /var/log/cloud-init-output.log. The logging configuration should reside in /etc/cloud/cloud.cfg.d/05_logging.cfg (or something similar).

I have not modified the grub boot options to remove those boot options, but am able to monitor the console of both terraform apply built domain and packer build by running virsh console <domain>. Again, starting from an unbooted cloud-init image I am able to see cloud-init start and run to completion in both packer and terraform built domains. As soon as I try to pass the packer built image as a source to terraform, I no longer see cloud-init starting and running, even with the various cloud-init clean options (sudo cloud-init clean [--logs|--seed|--machine-id] DI_LOG=stderr /usr/lib/cloud-init/ds-identify --force systemctl enable cloud-init[-local|-config|-final].

You can also try to run cloud-init manually (with cloud-init init -d or something similar, check the docs).

Also, to narrow things down, can you run your cloud-init script on the original cloud image from Cannonical? If that image also fails to run your cloud-init script, then the issue is probably with your infra environment / cloud-init script, and not with this packer plugin and config.

Great point! I had tried this already but was trying to keep my problem statement a bit too condensed. As mentioned above, I can successfully apply cloud-init configs in either packer or terraform, but cannot have the packer built image passed into terraform.

Another idea would be to make sure that you are building the image with a similar VM config to the environment intended for later use. For example, make sure that you are running on the same chipset and both system runs either BIOS or UEFI.

Probably I should also mention that I have a very vague memory of something similar happening to me previously, when for some mysterious reason, the image I created won't booted when the vm was created with terraform. If I recall it correctly, it was something to do with the libvirt terraform plugin messing up bus types and addresses and for some unexplainable reason, it prevented the system from booting. I don't remember much, but that it was really annoying to debug.

I can share you some snippets from my terraform and packer configs that might inspire you on your debug journey.

snip

And here is my XSLT causing perpetual diffs but making my nodes as I wanted them to be:


<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*" />

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- Making the worker headless -->
    <xsl:template match="/domain/devices/graphics" />
    <xsl:template match="/domain/devices/video" />
    <xsl:template match="/domain/devices/audio" />
    <xsl:template match="/domain/devices/input[@type='mouse' or @type='keyboard']" />

    <!-- SEE https://github.com/dmacvicar/terraform-provider-libvirt/issues/667 -->
    <!-- Thanks dariush, https://gist.github.com/dariush/7405cbf62835e03d0b5c953d798a87cd -->
    <!-- replace <target dev='hdd'...> with <target dev='sdd'...>  -->
    <xsl:template match="/domain/devices/disk[@device='cdrom']/target/@dev">
        <xsl:attribute name="dev">
            <xsl:value-of select="'sdd'"/>
        </xsl:attribute>
    </xsl:template>

This section right here caught my attention

I compared optical drive sections of both packer+packer-plugin-libvirt and terraform+terraform-libvirt-provider domains by running virsh dumpxml --domain [domain] from both libvirt instances.

packer+packer-plugin-libvirt generated XML

    <disk type='volume' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source pool='packer' volume='ubuntu-2204-focal_PUBLIC-cloudinit' index='1'/>
      <backingStore/>
      <target dev='sdb' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>

terraform+terraform-libvirt-provider generated XML

I am guessing you may already be aware but I feel it is worth pointing out, the terraform-provider-libvirt code hard codes the device type (cdrom), target bus(ide), and dev (hdd).

    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/scratch/libvirt/terraform/pool/commoninit.my-tf-hostname01.example.com.iso' index='1'/>
      <backingStore/>
      <target dev='hdd' bus='sata'/>
      <readonly/>
      <serial>cloudinit</serial>
      <alias name='sata0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>

As a test, I did the following: 1) Build the packer image and did the cloud-init clean commands to reset the cloud-init state. 2) exported the packer image to the staging directory for terraform to pickup. 3) Ran terraform apply using the packer built image. 4) I quickly destroyed the domain while the kernel was still loading after terraform saw it successful, but before . virsh destroy --domain my-tf-hostname01.example.com 5) Modified the terraform built domain virsh edit --domain my-tf-hostname01.example.com and changed the target attributes dev and bus to match the packer domain, specifically the dev value is changed to sdb and nothing else is changed. 6) Started the domain.

The result: The terraform deployed domain starts up and runs cloud-init successfully as expected. I have run through this a couple of times now and can confirm this process produces the desired result. However, if I let the terraform created domain continue with startup and systemd runs, any future modification to the domain XML definition will not enable cloud-init to run without further intervention.

<!-- replace <target bus='ide'...> with <target bus='sata'...>  -->
<xsl:template match="/domain/devices/disk[@device='cdrom']/target/@bus">
    <xsl:attribute name="bus">
        <xsl:value-of select="'sata'"/>
    </xsl:attribute>
</xsl:template>

<!-- replace <target bus='ide'...> with <target bus='sata'...>  -->
<xsl:template match="/domain/devices/disk[@device='disk' and target/@bus='scsi']">
    <xsl:copy>
        <xsl:apply-templates select="@*|*[not(self::wwn) and not(self::target)]"/>
        <target bus="sata">
            <xsl:attribute name="dev"><xsl:value-of select="target/@dev" /></xsl:attribute>
        </target> 
    </xsl:copy>
</xsl:template>

<!-- replace <alias...> with nothing ie delete the <alias...> element  -->
<xsl:template match="/domain/devices/disk[@device='cdrom']/alias" />

</xsl:stylesheet



I'm using the `0.7.1` version of dmacvicar's libvirt plugin. Also, all my machines are now provisioned on `q35` machine type and UEFI.
I am using the same version as well. Same machine type on the terraform end. It seems my packer domain is starting with `pc-i440fx-focal` machine type. I don't believe UEFI is enabled anywhere on my environment yet.

Hope this helps. Let us know if and how you managed to figure out your issue.

This was extremely helpful and I appreciate all the support very much!