hashicorp / packer-plugin-vsphere

Packer plugin for VMware vSphere Builder
https://www.packer.io/docs/builders/vsphere
Mozilla Public License 2.0
93 stars 91 forks source link

IP address gets changed after reboot (not always) when building using vsphere - no route to host ssh #425

Closed LeifErik1995 closed 1 month ago

LeifErik1995 commented 2 months ago

Hi folks,

I am experiencing an issue with IP address where old one not getting recognized after the reboot when building ubuntu 22.04, host ESXi 7.0. After reboot IP is no longer assigned to the VM.

source "vsphere-iso" "ubuntu" {

  vcenter_server        = var.vsphere_server
  host                  = var.vsphere_host
  username              = var.vsphere_username
  password              = var.vsphere_password
  insecure_connection   = "true"
  datacenter            = var.vsphere_datacenter
  datastore             = var.vsphere_datastore

  CPUs                  = var.cpu_num
  RAM                   = var.mem_size
  RAM_reserve_all       = true
  disk_controller_type  = ["pvscsi"]
  guest_os_type         = "ubuntu64Guest"
  iso_checksum       = "45f873de9f8cb637345d6e66a583762730bbea30277ef7b32c9c3bd6700a32b2"

  iso_url = "[datastore1 (38)] packer_cache/67a67612e798509328c9ee5490f7388971ed878b.iso"
  cd_content            = {
    "/meta-data" = file("${var.cloudinit_metadata}")
    "/user-data" = file("${var.cloudinit_userdata}")
  }
  cd_label              = "cidata"

  network_adapters {
    network             = var.vsphere_network
    network_card        = "vmxnet3"
  }

  storage {
    disk_size = var.disk_size
    disk_thin_provisioned = true
  }
  storage {
    disk_size = var.disk_size
    disk_thin_provisioned = true
  }

  vm_name               = var.vsphere_vm_name
  convert_to_template   = "true"
  communicator          = "ssh"
  ssh_username          = var.ssh_username
  ssh_password          = var.ssh_password
  ssh_timeout           = "30m"
  ssh_handshake_attempts = "100000"
  export {
    force = true
    output_directory = "./output-artifacts"
  }

  boot_wait             = "3s"
  boot_command          = var.boot_command
  shutdown_command      = "echo '${var.ssh_password}' | sudo -S -E shutdown -P now"
  shutdown_timeout      = "15m"

  configuration_parameters = {
    "disk.EnableUUID" = "true"
  }
}

My cloud-init is as below

#cloud-config
autoinstall:
    version: 1
    early-commands:
        # Stop ssh for packer
        - sudo systemctl stop ssh
    locale: en_US
    keyboard:
        layout: us
        #variant: us
    identity:
        hostname: ubuntu-server
        username: ubuntu
        password: '$6$rounds=4096$.yfET/Pf2chKB5IS$vUcqTvUzup6eYSx.wajZD0zHTobzapEKKjw5GvXbRonLJqB08eRhB1W6Wkhq6Kp4WzC19F3uWNoocqlPgZLrv.'
    ssh:
        install-server: yes
        allow-pw: yes
    storage:
        layout:
            name: direct
    packages: [open-vm-tools, cloud-init, openssh-server, python3, sshpass, net-tools, perl, open-iscsi, ntp, curl, vim, ifupdown, zip, ufw, unzip, gnupg2, gnupg-agent, software-properties-common, apt-transport-https, ca-certificates, lsb-release, python3-pip, jq]
    user-data:
        disable_root: false
    ssh_deletekeys: false
    late-commands:
        # Prevent DHCP release message from being sent on reboot
        - iptables -I OUTPUT -p udp --dport 67 -j DROP
        - sed -i -e 's/^#\?PasswordAuthentication.*/PasswordAuthentication yes/g' /target/etc/ssh/sshd_config
        - sed -i -e 's/^#\?PermitRootLogin.*/PermitRootLogin yes/g' /target/etc/ssh/sshd_config
        - echo 'ubuntu ALL=(ALL) NOPASSWD:ALL' > /target/etc/sudoers.d/ubuntu
        - curtin in-target --target=/target -- ssh-keygen -A
        - curtin in-target --target=/target -- systemctl enable ssh
        - curtin in-target --target=/target -- systemctl start ssh
        - curtin in-target --target=/target -- chmod 440 /etc/sudoers.d/ubuntu
        - curtin in-target --target=/target -- apt-get update
        - curtin in-target --target=/target -- apt-get upgrade --yes

Boot command is as follows

boot_command = [
    "c<wait>",
    "linux /casper/vmlinuz --- autoinstall ds=\"nocloud-net\"",
    "<enter><wait>",
    "initrd /casper/initrd",
    "<enter><wait>",
    "boot",
    "<enter>",
    "<wait30>"
    ]

I scoured through the issues that relate to this to make some headway on this, but there seems to be some issue with my config. <BTW, I am very new to packer, so any advice is appreciated however trivial it might be>. Also attaching the screenshot of both the after reboot vm log, and the packer debug log at the time of the error. After that and i get the no route to host error, the vm ends up in the state of the final image.

Thanks

Screenshot 2024-05-04 at 12 14 12 AM Screenshot 2024-05-04 at 12 13 56 AM Screenshot 2024-05-04 at 12 26 57 AM Screenshot 2024-05-04 at 12 32 46 AM

patschi commented 2 months ago

Interestingly, I'm running in the same issue. Still trying to figure out more however and I don't know if it is similar issue. I seem to face this behavior only when running 2 packer jobs at the same time - running one at once, seems to work. So you mentioning "DHCP release" gave me a good idea to look towards DHCP...

What I noticed watching my DHCP leases... When the VM starts and boots the first time: image

Here, the Unique ID is a very long ID.

Just shortly after the first reboot (the autoinstall/cloudinit setup), I can see: image

I seem to only have this issue with Ubuntu 22.04, not with 24.02 when both are building at pretty much the same time...

If I get another clever idea or find out something, I will update accordingly.

LeifErik1995 commented 2 months ago

In my case, i am hitting the issue randomly even when running a single packer job. So might not be similar to what you are facing @patschi

patschi commented 2 months ago

Is your packer VM the only VM your DHCP server is handling in the same network? I have a dedicated network for packer it building VM templates, so no other DHCP requests ongoing.

LeifErik1995 commented 2 months ago

hmm, interesting. No there is no dedicated network for packer (The network i use in my packer build is commonly used with around 500 vms associated with it). Maybe i will ask my build infra team to have a dedicated network and try this and see.

patschi commented 2 months ago

Interesting. I was tail'ing the DHCP server logs during one template build process: image

I think it is related to the multiple releases... when it's released and another client requests a IP, the IP might get reallocated to a different client.

There are 3 releases:

  1. During the first boot process before cloud-init is started
  2. When cloud-init was started and is performing the initial configuration
  3. When the VM is being shut down before being converted to the final template

Still trying to figure out more...

patschi commented 1 month ago

So. I found a workaround. As per my previous guess, the issue was indeed the DHCP.

The tricky part was, that after autoinstall starts, it immediately gets an IP address assigned from the DHCP server and Packer is picking up this specific address - even before any autoinstall configuration was parsed. This happens before the network: configuration in the autoinstall definition takes effect.

In other words:

  1. When autoinstall starts, it gets IP e.g. 192.168.35.100.
  2. When autoinstall processes further, it reads the network: configuration where it releases prior IP and requests a new one. (Possibly a restart of dhclient) Looks like during this timeframe, the same IP can get re-assigned and breaks packer.

For context and as an example, I use following network::

  network:
    network:
      version: 2
      ethernets:
        ens192:
          dhcp4: true
          dhcp-identifier: mac

My workaround was stopping open-vm-tools in autoinstall context to not give packer chance to read the current IP (which could change and packer doesn't need the IP at this stage anyway):

  early-commands:
    # Ensures that Packer does not connect too soon.
    - sudo systemctl stop ssh.socket # Needed for Ubuntu 24.04 as any SSH connection restarts SSH daemon.
    - sudo systemctl stop ssh # Stop the SSH daemon itself (for good measure).
    - sudo systemctl stop open-vm-tools # Stop the VM reporting the current IP to vCenter, so also to packer. Does not affect package in target build.
    - sudo systemctl disable open-vm-tools # For good measure and prevent auto-restart during open-vm-tools update.

And on the packer side, I tell packer to wait at least 2 minutes before checking the current IP. This prevents that packer receives the IP address via open-vm-tools before they were stopped in the early stage:

  ip_settle_timeout = "2m" # wait to get final IP after autoinstall complete

So ideally packer will only get the IP address from the VM once rebooted, to get its final IP. This took a friend and me now hours to figure out...

For me, it works now as I expected it to.

LeifErik1995 commented 1 month ago

Thanks @patschi, i will check it out. It is my first time with such in-depth tasks on devops and infra side. Usually i work on application side of the stuff. Once i try it out then i will update the thread.

patschi commented 1 month ago

Yeah, let me know if it helps and if it is the same/similar issue. I'm very curious to know.

I have been building this for my private lab now - for the like 2-4 times per year I need a new, fresh Ubuntu VM... Maybe the time invested pay off in a few decades...

tenthirtyam commented 1 month ago

Try modifying the boot command for Ubuntu to include the following:

boot_command = [
  ... other commands ...
  "autoinstall network-config=disabled"
  ... other commands ...
]

This should skip the initial IP address.

tenthirtyam commented 1 month ago

Any luck @LeifErik1995 @patschi?

Try modifying the boot command for Ubuntu to include the following:

boot_command = [
  ... other commands ...
  "autoinstall network-config=disabled"
  ... other commands ...
]

This should skip the initial IP address.

LeifErik1995 commented 1 month ago

@tenthirtyam I will try this in some time. Currently busy with another task. Once I am somewhat light on my workload, will check back on all the suggestions here.

patschi commented 1 month ago

Try modifying the boot command for Ubuntu to include the following:

That's a valid workaround, thanks for suggesting!

While I think this will help, it does have following disadvantages:

  1. The Ubuntu installer won't be automated updated during installation over the internet
  2. The Ubuntu installer won't automatically update and upgrade all packages during the installation

But obviously, the latter can be done during using provisioning script.

tenthirtyam commented 1 month ago

Thanks for confirming, @patschi.

Closing based on the workaround provided for Ubuntu's installer in https://github.com/hashicorp/packer-plugin-vsphere/issues/425#issuecomment-2106511430.

patschi commented 2 weeks ago

Well. I stumbled about some other weird issue when using kind of the same approach as suggested above: The network was also disabled on the final, created VM after the installer finished.

Interestingly, this has only shown its behavior with Ubuntu 24.04 on an ongoing reproducible basis. With Ubuntu 22.04 the kernel parameters are copied too, but network was still working. Looks like there were some changes in latest 24.04.

So in case anyone has the same behavior...

The boot commands were:

  boot_command = [
    "c<wait3s>",
    "set gfxpayload=keep<enter>",
    "linux /casper/vmlinuz --- autoinstall network-config=disabled",
    "<enter><wait>",
    "initrd /casper/initrd",
    "<enter><wait>",
    "boot",
    "<enter>"
  ]

After almost 8 hours of research and pulling my hair out, I found out the three --- essentially mean to persist these kernel parameters in /etc/default/grub after even the install completed: https://serverfault.com/questions/1055649/how-can-i-add-a-kernel-argument-to-a-debian-preseed-file/1055838#1055838:~:text=that%20triple%20dash%20is%20the%20key%20to%20telling%20the%20installer%20that%20the%20kernel%20command%20line%20parameters%20following%20it%20are%20not%20just%20meant%20to%20work%20around%20issues%20during%20installation%2C%20but%20should%20also%20be%20persisted%20in%20bootloader%20configuration%20-%20such%20as%20%2Fetc%2Fdefault%2Fgrub.

Removed --- - now all good and Ubuntu 24.04 deployment works offline + final update when finished.