lavabit / robox

The tools needed to robotically create/configure/provision a large number of operating systems, for a variety of hypervisors, using packer.
637 stars 141 forks source link

Booting failure on libvirt/SCSI combo (RHEL-based and others) #125

Closed timschumi closed 4 years ago

timschumi commented 4 years ago

The failing boxes are: centos7, fedora25, fedora26, fedora27, fedora28, fedora29, fedora30, fedora31, opensuse15, opensuse42, oracle7, rhel7.

As mentioned by older tickets (#8, #17, #45, #94, #96, #115), this might indeed be caused by distros discontinuing (baked-in) support for the sym53c8xx SCSI drive that libvirt usually emulates.

One possibility for solving this would be to change the default controller to SATA or VirtIO.

The other would be to amend the installation scripts and bake in the needed SCSI drivers into the initramfs manually.

ladar commented 4 years ago

This is a pending/ongoing issue. It isn't clear what combination of packer/vagrant options will yield the best compatibility across all the different host/guest combos. Right now packer is setup to build using virtio-scsi and most of the Vagrantfiles are setup to use scsi ... but it seems that doesn't work with certain versions of libvirt.

I don't recall the details, so I could be mistaken, but I fear that setting packer and vagrant to virtio will cause a different set of problems... because Red Hat uses a different definition of virtio. I also fear that some of the guest OSes (BSDs, etc) will also have issues with virtio devices.

If anyone knows what's best, please let me know.

abbbi commented 4 years ago

If anyone knows what's best, please let me know.

i dont have an good answer on this but note that it also depends on the qemu version that is used with packer during building, not only on the targeted host system that runs the vagrant instance. Qemu dropped the support for the SCSI type disk emulation some time ago:

https://github.com/qemu/qemu/commit/f778a82f0c179634892ea716970d4d35264dc841

so using packer with disk_interface='scsi' only works on quite old qemu instances. I think the best Bet for systems which dont support virtio might be sata, but thats not supported by packer. Both SATA and IDE/the old SCSI types are emulated anyway, so it wouldnt make a big difference to use IDE for systems that dont support other types from a performance view of things.

ladar commented 4 years ago

The packer config uses virtio-scsi ... and that seems to work fine during the box build process.

Most of the bundled Vagrantfiles are to scsi, which was a change made at some point to ensure compatibility. Unfortunately, certain host/libvirt/qemu combos have emerged which map the scsi bus type to a different piece of virtual hardware, and the driver it requires isn't included in the kernel image. Hence during the boot process, the Linux kernel doesn't find any devices it can boot from.

On these systems, changing the Vagrantfile to virtio causes the broken host/libvirt/qemu get configured with a virtual deice that is supported, and thus fixes the issue.

Unfortunately, I believe some guests don't support the virtio device type. And/or I'm worried the use of the different bus type causes the partition map to change, and thus breaks UUID based boot configs.

The solution might be to use virtio for distros with newer kernels, while leaving the older releases/BSD systems using the current paradigm. Then the question arises should the packer config be switched to virtio , I think is handled differently on Red Hat systems. And finally rebuilding all of the boxes and then testing the result on different host/libvirt/qemu versions to ensure everything works.

electrofelix commented 4 years ago

@ladar I recall virtio first appearing in RHEL4 and https://access.redhat.com/articles/2488201 appears to confirm this, so perhaps for the RHEL/CentOS/Fedora distros it would be better to default to this from the versions listed.

I've spotted there is something can be done around setting the controller options correctly, which looks to what is missing if the machine was built using virtio-scsi - vagrant-libvirt/vagrant-libvirt#692 Possibly though all of the distro versions reference should work using virtio by default, certainly centos/7 libvirt box build by the distro works without issue, so it might be worth looking into what options they use as I get the feeling this is considered an exception with newer versions of libvirt https://libvirt.org/formatdomain.html#elementsVirtioTransitional

ladar commented 4 years ago

@electrofelix I think the issue is a feature mismatch, namely how virtio, scsi and virtio-scsi are different, and/or unavailable across various host systems, and packer/vagrant.

Long story short, I think the solution is to try and migrate the configs to using virtio during the packer box build ... and assuming that works, then update the bundled Vagrantfile to use virtio. At least in theory it should largely work, but it will need a fair bit of testing. My concerns are two fold. Some of the guests, particularly, BSDs, and/or older Linux distros won't build properly using virtio (which is why I think it uses virtio-scsi in the first place), and then subsequently users might run into problems because the vrtio hardware ends up being different, causing the guest to fail during boot. But I'll need to build everything to find out what is what... unless someone else can tackle the task.

Once I push the 3.0.2 build, I might try rebuilding all of the libvirt boxes using the new config, and push them as 3.0.1 so they can be tested. I'll keep you posted.

ladar commented 4 years ago

So far, switching the packer configs to virtio has caused every box to fail while waiting for SSH connection... from A to FreeBSD. I haven't run it on a machine with a console to find out why yet.

timschumi commented 4 years ago

@ladar Doing a quick build of the Arch Linux box, it simply doesn't find the HDD, which probably means that it didn't find the correct driver.

EDIT: Should we try keeping the installation media as-is (i.e. "virtio-scsi") and only changing the disk type in the resulting Vagrantfile?

ladar commented 4 years ago

@timschumi can you keep looking into why virtio doesn't work? A likely issue is the device path may have changed because of the new driver. Like /dev/hda instead of /dev/sda ... etc.

I'm currently building all of the boxes using virtio-scsi, with a Vagrantfile that says virtio ... as you suggested. They should be ready tomorrow for testing tomorrow. (Currently made it to Debian.)

timschumi commented 4 years ago

@timschumi can you keep looking into why virtio doesn't work? A likely issue is the device path may have changed because of the new driver. Like /dev/hda instead of /dev/sda ... etc.

On the Arch Linux box I built yesterday (with "virtio"/"virtio"), there was simply no block device present, neither sda nor vda (Scripts do have compatibility for both). I can only assume that the installer ships a limited subset of drivers, and the full set is installed afterwards.

I'll look further into it though.

I'm currently building all of the boxes using virtio-scsi, with a Vagrantfile that says virtio ... as you suggested. They should be ready tomorrow for testing tomorrow. (Currently made it to Debian.)

Sounds good.

I quickly built the Fedora 32 box before suggesting that. Building the box went fine, and it boots as well. Maybe virtio-scsi is close enough to the full virtio stack, that it causes mkinitcpio/update-initramfs to include all the necessary drivers?

abbbi commented 4 years ago

@timschumi can you keep looking into why virtio doesn't work? A likely issue is the device path may have changed because of the new driver. Like /dev/hda instead of /dev/sda ... etc.

its likely the cause of the issue. "virtio" creates /dev/vdX named devices, virtio-scsi uses the scsi named scheme /dev/sdX.

ladar commented 4 years ago

@abbbi @timschumi I imagine with some of the guests, it's simply a matter of not handling the renamed device, while with others, it is probably the lack of the virtio kernel module being included. The former are easy to fix once identified, while the latter are not. Hence why it's taken so long to tackle this problem and the need for help.

A collection of libvirt boxes carrying the version 3.0.1 are being uploaded now which include bundled Vagrantfiles specifying virtio instead of scsi. Once done, I'll download and test vagrant up on all of them using CentOS 7, but could use help testing them via another host like Ubuntu/Arch etc, which have far newer libvirt / qemu versions.

ladar commented 4 years ago

In case it wasn't obvious, the 3.0.1 isn't being marked as released, so in order to download and test them, you'll need to manually specify the box version when running vagrant.

timschumi commented 4 years ago

I can tackle Arch (or rather, Manjaro, but same thing) starting tomorrow. From the ones I built locally it already looks promising though.

ladar commented 4 years ago

I had several failures. Mostly BSD, which was partially expected. I haven't investigated any of them yet.

The generic alpine35 libvirt test run failed! Exiting. { !! }
The generic alpine36 libvirt test run passed!
The generic alpine37 libvirt test run passed!
The generic alpine38 libvirt test run passed!
The generic alpine39 libvirt test run passed!
The generic alpine310 libvirt test run passed!
The generic alpine311 libvirt test run passed!
The generic arch libvirt test run passed!
The generic centos6 libvirt test run passed!
The generic centos7 libvirt test run passed!
The generic centos8 libvirt test run passed!
The generic debian8 libvirt test run passed!
The generic debian9 libvirt test run failed! Exiting. { !! }
The generic debian10 libvirt test run passed!
The generic dragonflybsd5 libvirt test run failed! Exiting. { !! }
The generic fedora25 libvirt test run passed!
The generic fedora26 libvirt test run passed!
The generic fedora27 libvirt test run passed!
The generic fedora28 libvirt test run passed!
The generic fedora29 libvirt test run passed!
The generic fedora30 libvirt test run passed!
The generic fedora31 libvirt test run passed!
The generic fedora32 libvirt test run passed!
The generic freebsd11 libvirt test run failed! Exiting. { !! }
The generic freebsd12 libvirt test run failed! Exiting. { !! }
The generic gentoo libvirt test run failed! Exiting. { !! }
The generic hardenedbsd11 libvirt test run failed! Exiting. { !! }
The generic hardenedbsd12 libvirt test run failed! Exiting. { !! }
The generic netbsd8 libvirt test run failed! Exiting. { !! }
The generic openbsd6 libvirt test run passed!
The generic opensuse15 libvirt test run failed! Exiting. { !! }
The generic opensuse42 libvirt test run failed! Exiting. { !! }
The generic oracle7 libvirt test run passed!
The generic oracle8 libvirt test run passed!
The generic rhel6 libvirt test run passed!
The generic rhel7 libvirt test run passed!
The generic rhel8 libvirt test run passed!
The generic ubuntu1604 libvirt test run passed!
The generic ubuntu1610 libvirt test run passed!
The generic ubuntu1704 libvirt test run passed!
The generic ubuntu1710 libvirt test run passed!
The generic ubuntu1804 libvirt test run passed!
The generic ubuntu1810 libvirt test run passed!
The generic ubuntu1904 libvirt test run passed!
The generic ubuntu1910 libvirt test run passed!
The generic ubuntu2004 libvirt test run passed!

@timschumi I can post my test script, if interested.

timschumi commented 4 years ago

I'm getting the same results as you, except for alpine35:

Test for alpine310 successful
Test for alpine311 successful
Test for alpine35 successful
Test for alpine36 successful
Test for alpine37 successful
Test for alpine38 successful
Test for alpine39 successful
Test for arch successful
Test for centos6 successful
Test for centos7 successful
Test for centos8 successful
Test for debian10 successful
Test for debian8 successful
Test for debian9 failed (/dev/vda is present; cmdline lists root=/dev/sda)
Test for dragonflybsd5 failed (searches for something else; only vbd0s1d is present)
Test for fedora25 successful
Test for fedora26 successful
Test for fedora27 successful
Test for fedora28 successful
Test for fedora29 successful
Test for fedora30 successful
Test for fedora31 successful
Test for fedora32 successful
Test for freebsd11 failed (searches for something else; would need vtbd0s1a instead)
Test for freebsd12 failed
Test for gentoo failed
Test for hardenedbsd11 failed
Test for hardenedbsd12 failed
Test for netbsd8 failed
Test for openbsd6 successful
Test for opensuse15 failed
Test for opensuse42 failed
Test for oracle7 successful
Test for oracle8 successful
Test for rhel6 successful
Test for rhel7 successful
Test for rhel8 successful
Test for ubuntu1604 successful
Test for ubuntu1610 successful
Test for ubuntu1704 successful
Test for ubuntu1710 successful
Test for ubuntu1804 successful
Test for ubuntu1810 successful
Test for ubuntu1904 successful
Test for ubuntu1910 successful
Test for ubuntu2004 successful

In a few cases I already checked what the issue is (noted in brackets behind the tests). Mostly, it appears to be cases where the installer hardcoded the root device path (without the usage of Labels or UUIDs).

ladar commented 4 years ago

I don't see an obvious choice. We could migrate the distros that work, but that feels like we're leaving the rest broken.

@timschumi if you update the packer config to virtio do the boxes compile on a non-RHEL system?

timschumi commented 4 years ago

I don't see an obvious choice. We could migrate the distros that work, but that feels like we're leaving the rest broken.

The scope of this issue was (and still is) fixing the boxes that don't boot in their current in-production configuration. I (and probably most) users couldn't care less if a box is using VirtIO or SCSI, as long as it works. The fact that other distributions might abandon their SCSI drivers as well is of course a point of concern, but they still work as of now.

The boxes which are broken right now are only a few, maybe we should try finding working solutions for those boxes first (even if they aren't all using the same interface in the end). The other boxes can probably follow at another time, as we see fit (or as deemed necessary).

Mainly the CentOS/Fedora/Oracle/RHEL boxes, which are currently broken, worked fine with the virtio-scsi/virtio combination. That would already take care of more than half of the broken boxes, at least for now. I'll file a PR for extensive testing shortly (if you are ok with that), so that we are working with the same scripts and can track the specific subset of results there.

@timschumi if you update the packer config to virtio do the boxes compile on a non-RHEL system?

I ran a few quick builds of the Alpine and Debian 9 boxes (on a Manjaro host, but as far as I can tell, this isn't the problem here), using the "virtio"/"virtio" combination (i.e. tip at commit 879edb1). All of those failed, stating that they can't find their root devices.

While I wasn't able to get out of the installer on Debian 9 to debug further, I got some more information on Alpine: It appears that, while the virtio_blk kernel module is included in the image, it doesn't load even when a VirtIO disk is attached (no dice with manually loading it either, the disks simply won't show up).

EDIT: Also, we might have some more luck with a virtio-scsi/virtio-scsi combination for the boxes which can't use virtio-scsi/virtio, since virtio-scsi apparently uses parts of the VirtIO stack (but keeps SCSI compatibility). Also, that would make the installer disk type and final disk type match, which hopefully removes any chance of having mismatches between root devices. I'll try that as soon as we can get the actually broken boxes out of the way.

EDIT2: virtio-scsi only exists for packer apparently, not Vagrant. Huh.

electrofelix commented 4 years ago

I think the simplest fix might be for me to finish off https://github.com/vagrant-libvirt/vagrant-libvirt/pull/692 and try to work out a sane set of rules as when to automatically add the scsi controller and that would most likely result in all of the boxes using scsi immediately working.

So far my thought is if disk_controller_model is unset and disk_bus is set to scsi, then default it to virtio-scsi and then it will enable the controller. Otherwise rely on what is set explicitly. Assuming ths doesn't break other boxes. If it does, would it be possible for that setting to be added to the Vagrantfile packaged with these boxes?

timschumi commented 4 years ago

I'm unsure whether this would actually help with our problem at hand through.

One set of boxes does not support SCSI anymore, the other doesn't support going full VirtIO (additional to there being mismatch issues in some cases). This means that we have to go through the boxes and find working combinations anyways. virtio-scsi geems to be a good middleway for most of them.

If I understood the conversation in the linked PR correctly, you want to decouple virtio-scsi, from the disk_type setting, so that it can even be used when scsi is specified instead? We can surely update the template files for the new behaviour, but I'm unsure whether you would want to roll out such a change at all, since it might potentially break stuff outside of roboxes, especially for running systems and boxes that aren't updated anymore (but that's just my opinion though).

EDIT: I just realized that virtio-scsi is only available inside packer, not in Vagrant. Whoops. You can probably disregard my whole message now. What you are proposing does indeed make sense. I'm pretty sure that the configs can be updated as needed, however we would have to keep compatibility for people who won't have that feature yet.

timschumi commented 4 years ago

@ladar I filed #149 so that we can do some seperate extensive testing. It takes care of all the boxes mentioned in the original issue, except for the OpenSUSE ones. They seem to be fine with being built on virtio-scsi but ultimately using pure virtio. The only remaining thing I can see is to test those on different (non-Manjaro) host OSes.

ladar commented 4 years ago

virtio-scsi only exists for packer apparently, not Vagrant. Huh.

I believe there is an unmerged pull request which adds virtio-scsi to the Vagrant plugin, but it isn't fully vetted/merged. I think this is the issue.

ladar commented 4 years ago

So far my thought is if disk_controller_model is unset and disk_bus is set to scsi, then default it to virtio-scsi and then it will enable the controller. Otherwise rely on what is set explicitly. Assuming ths doesn't break other boxes. If it does, would it be possible for that setting to be added to the Vagrantfile packaged with these boxes?

@electrofelix if I recall correctly, the issue isn't a lack scsi support, but the lack of virtio-scsi specifically, which is why the bundled Vagrantfile used scsi ... which worked for awhile. Until differing distros started defining scsi differently ... which is what broke things to begin with. If vagrant follows packer it should scsi / virtio and virtio-scsi as distinct options.

I use the bundled Vagrantfiles to try and define options similar to those used when the box image is built, to avoid issues with mismatched device names, etc.

ladar commented 4 years ago

I think the mid-term strategy is to use the virtio-scsi / virtio combination with specific boxes, but keepthe BSD, and other boxes which failed during out test with the existing virtio-scsi / scsi option.

The risk is that the boxes which stick with the existing model may start to break as the users move to newer hosts. While the guests we switch to new pattern, might stop working on older hosts.

Like I said, I don't see an obvious pair of options that are safe. I initially chose scsi because I figured it would be well supported... and it worked well for awhile.

ladar commented 4 years ago

@electrofelix @timschumi would either of you care to propose a list of boxes for migration to virtio-scsi / virtio ?

If I look at the list of boxes which passed during my test run, the only failure which concerns me is Debian 9. Updating Debian 8 and 10 but not 9 doesn't seem right.

The OpenSUSE failures are also noteworthy, since those boxes should be using recent Linux kernels. but leaving them as-is for now is an option.

If we solved the mystery of those boxes, that would just leave the BSD guests, and Alpine 3.5, all of which I'm fine leaving set to the current virtio-scsi / scsi paradigm. At least until support for virtio-scsi can be set via the Vagrantfile.

timschumi commented 4 years ago

@electrofelix @timschumi would either of you care to propose a list of boxes for migration to virtio-scsi / virtio ?

I already put together a list of boxes (and tested them on a Manjaro host) in #149. That would already take care of a lot of the boxes that are currently absolutely broken (basically everything except for OpenSUSE will work with this PR included).

If I look at the list of boxes which passed during my test run, the only failure which concerns me is Debian 9. Updating Debian 8 and 10 but not 9 doesn't seem right.

The issue with Debian 8/9/10 isn't a driver issue in this case (at least not fully). The way that Debian finds its root device is by simply specifying it in the kernel cmdline. Debian 8 and 10 use lookups by UUID, which works independent of the driver as long as the drive is present and accessable. For whatever reason, Debian 9 decides to simply store a static root=/dev/sda3 during the installation process instead, which obviously breaks when the switch to virtio (and therefore vda) is made.

For comparison:

vagrant@debian8:~$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.16.0-10-amd64 root=UUID=1b0f4e37-3f1d-40bb-ba1c-9daa0198393c ro biosdevname=0 net.ifnames=0 biosdevname=0 quiet

vagrant@debian9:~$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.9.0-12-amd64 root=/dev/sda3 ro ipv6.disable_ipv6=1 net.ifnames=0 biosdevname=0 net.ifnames=0 biosdevname=0 quiet

vagrant@debian10:~$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.19.0-9-amd64 root=UUID=6d5ac749-e496-4ce6-b68b-e88d9e28573a ro ipv6.disable_ipv6=1 net.ifnames=0 biosdevname=0 net.ifnames=0 biosdevname=0 quiet

The only reason I can think of is that there was an installer or GRUB change that was reverted for later releases.

Maybe-relevant bugreport.

The OpenSUSE failures are also noteworthy, since those boxes should be using recent Linux kernels. but leaving them as-is for now is an option.

How new a kernel is doesn't really seem to be relevant in most of those cases. Rather, what of the kernel modules are =y and which are =m, how update-initramfs (or its equivalent) is configured and what the installer does to store which root device it should look for (e.g. Debian 9).

If we solved the mystery of those boxes, that would just leave the BSD guests, and Alpine 3.5, all of which I'm fine leaving set to the current virtio-scsi / scsi paradigm. At least until support for virtio-scsi can be set via the Vagrantfile.

I'm unsure whether your Alpine 3.5 fail was just a fluke. I wasn't able to replicate that fail on current Manjaro (as well as Ubuntu 16.04), and the remaining failed boxes seemed to be pretty consistent (independent of the host OS).

ladar commented 4 years ago

I already put together a list of boxes (and tested them on a Manjaro host) in #149. That would already take care of a lot of the boxes that are currently absolutely broken (basically everything except for OpenSUSE will work with this PR included).

I haven't looked at #148 yet, but moving the RHEL images to virtio seems pretty safe. I'm more concerned with the rest, and trying to pick a consistent strategy.

The issue with Debian 8/9/10 isn't a driver issue in this case (at least not fully). The way that Debian finds its root device is by simply specifying it in the kernel cmdline. Debian 8 and 10 use lookups by UUID, which works independent of the driver as long as the drive is present and accessable. For whatever reason, Debian 9 decides to simply store a static root=/dev/sda3 during the installation process instead, which obviously breaks when the switch to virtio (and therefore vda) is made.

Good catch, finding the UUID distinction. I figured it was something like that, I just wasn't sure what. I looked over the preseed file documentation, and I'm currently building a batch of 3.0.5 boxes with d-i partman/mount_style select uuid ... as a test. It isn't clear from the docs which option controls this parameter, but the partman and grub-installer options looked the most promising. Switching to lvm may also be an option.

How new a kernel is doesn't really seem to be relevant in most of those cases. Rather, what of the kernel modules are =y and which are =m, how update-initramfs (or its equivalent) is configured and what the installer does to store which root device it should look for (e.g. Debian 9).

I say newer kernel version, because it feels like the newer kernels tend to include virtio along with other virtualization drivers by default. I suspect that's because the defaults changed, and the maintainers simply accepted the new default.

I'm unsure whether your Alpine 3.5 fail was just a fluke. I wasn't able to replicate that fail on current Manjaro (as well as Ubuntu 16.04), and the remaining failed boxes seemed to be pretty consistent (independent of the host OS).

I don't think Alpine 3.5 is a fluke. It failed on 2 different CentOS systems. I think the failure is a by-product of the virtio variations that have me worried. Alpine 3.5 is also a good example of an "older" kernel which doesn't include as many drivers by default. Alpine 3.5 won't even boot on Hyper-V without using the larger, vanilla kernel, because it lacks support for the virtual hardware.

ladar commented 4 years ago

@timschumi uploading the 3.0.5 test images now.

I also looked over that list thread you linked to. If this doesn't work, perhaps I can regenerate the GRUB config as part of the post install script, as a workaround.

timschumi commented 4 years ago

@timschumi uploading the 3.0.5 test images now.

No dice, the cmdline still lists root=/dev/sda3 on Debian 9.

EDIT: Apparently, simply running update-grub indeed replaces the static path with the UUID. Maybe it really is a race condition, but we'd need to dive further into the grub configuration to find out where exactly it fails (due to lack of alternatives though, it's probably test -e "/dev/disk/by-uuid/${GRUB_DEVICE_UUID}", since the others are static configuration settings or LVM-specific only).

ladar commented 4 years ago

@timschumi running /bin/update-dev ; /usr/bin/grub-installer /target from the console during install fixes the problem, but I haven't gotten it to work as part of the late command string.

I also noticed that if I run update-grub after installation, it fixes the configuration. So if I can't get the late command string to work soon, I'll go with that fix and build new 3.0.5 test boxes. Stay tuned.

ladar commented 4 years ago

The 3.0.5 boxes are live, with the fix, and the GRUB config appears correct, with the root parameter set to the UUID, but it doesn't seem to be working. I've pasted the GRUB config below.

#
# DO NOT EDIT THIS FILE
#
# It is automatically generated by grub-mkconfig using templates
# from /etc/grub.d and settings from /etc/default/grub
#

### BEGIN /etc/grub.d/00_header ###
if [ -s $prefix/grubenv ]; then
  set have_grubenv=true
  load_env
fi
if [ "${next_entry}" ] ; then
   set default="${next_entry}"
   set next_entry=
   save_env next_entry
   set boot_once=true
else
   set default="0"
fi

if [ x"${feature_menuentry_id}" = xy ]; then
  menuentry_id_option="--id"
else
  menuentry_id_option=""
fi

export menuentry_id_option

if [ "${prev_saved_entry}" ]; then
  set saved_entry="${prev_saved_entry}"
  save_env saved_entry
  set prev_saved_entry=
  save_env prev_saved_entry
  set boot_once=true
fi

function savedefault {
  if [ -z "${boot_once}" ]; then
    saved_entry="${chosen}"
    save_env saved_entry
  fi
}
function load_video {
  if [ x$feature_all_video_module = xy ]; then
    insmod all_video
  else
    insmod efi_gop
    insmod efi_uga
    insmod ieee1275_fb
    insmod vbe
    insmod vga
    insmod video_bochs
    insmod video_cirrus
  fi
}

if [ x$feature_default_font_path = xy ] ; then
   font=unicode
else
insmod part_msdos
insmod ext2
set root='hd0,msdos3'
if [ x$feature_platform_search_hint = xy ]; then
  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos3 --hint-efi=hd0,msdos3 --hint-baremetal=ahci0,msdos3  85028ce1-57c5-402d-9979-7969b310748e
else
  search --no-floppy --fs-uuid --set=root 85028ce1-57c5-402d-9979-7969b310748e
fi
    font="/usr/share/grub/unicode.pf2"
fi

if loadfont $font ; then
  set gfxmode=auto
  load_video
  insmod gfxterm
  set locale_dir=$prefix/locale
  set lang=en_US
  insmod gettext
fi
terminal_output gfxterm
if [ "${recordfail}" = 1 ] ; then
  set timeout=30
else
  if [ x$feature_timeout_style = xy ] ; then
    set timeout_style=menu
    set timeout=5
  # Fallback normal timeout code in case the timeout_style feature is
  # unavailable.
  else
    set timeout=5
  fi
fi
### END /etc/grub.d/00_header ###

### BEGIN /etc/grub.d/05_debian_theme ###
set menu_color_normal=cyan/blue
set menu_color_highlight=white/blue
### END /etc/grub.d/05_debian_theme ###

### BEGIN /etc/grub.d/10_linux ###
function gfxmode {
    set gfxpayload="${1}"
}
set linux_gfx_mode=
export linux_gfx_mode
menuentry 'Debian GNU/Linux' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-85028ce1-57c5-402d-9979-7969b310748e' {
    load_video
    insmod gzio
    if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
    insmod part_msdos
    insmod ext2
    set root='hd0,msdos1'
    if [ x$feature_platform_search_hint = xy ]; then
      search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  0e296f10-d593-4378-a898-63bcf4be4fa5
    else
      search --no-floppy --fs-uuid --set=root 0e296f10-d593-4378-a898-63bcf4be4fa5
    fi
    echo    'Loading Linux 4.9.0-12-amd64 ...'
    linux   /vmlinuz-4.9.0-12-amd64 root=UUID=85028ce1-57c5-402d-9979-7969b310748e ro ipv6.disable_ipv6=1 net.ifnames=0 biosdevname=0 net.ifnames=0 biosdevname=0 quiet
    echo    'Loading initial ramdisk ...'
    initrd  /initrd.img-4.9.0-12-amd64
}
submenu 'Advanced options for Debian GNU/Linux' $menuentry_id_option 'gnulinux-advanced-85028ce1-57c5-402d-9979-7969b310748e' {
    menuentry 'Debian GNU/Linux, with Linux 4.9.0-12-amd64' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.9.0-12-amd64-advanced-85028ce1-57c5-402d-9979-7969b310748e' {
        load_video
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_msdos
        insmod ext2
        set root='hd0,msdos1'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  0e296f10-d593-4378-a898-63bcf4be4fa5
        else
          search --no-floppy --fs-uuid --set=root 0e296f10-d593-4378-a898-63bcf4be4fa5
        fi
        echo    'Loading Linux 4.9.0-12-amd64 ...'
        linux   /vmlinuz-4.9.0-12-amd64 root=UUID=85028ce1-57c5-402d-9979-7969b310748e ro ipv6.disable_ipv6=1 net.ifnames=0 biosdevname=0 net.ifnames=0 biosdevname=0 quiet
        echo    'Loading initial ramdisk ...'
        initrd  /initrd.img-4.9.0-12-amd64
    }
    menuentry 'Debian GNU/Linux, with Linux 4.9.0-12-amd64 (recovery mode)' --class debian --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.9.0-12-amd64-recovery-85028ce1-57c5-402d-9979-7969b310748e' {
        load_video
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_msdos
        insmod ext2
        set root='hd0,msdos1'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1  0e296f10-d593-4378-a898-63bcf4be4fa5
        else
          search --no-floppy --fs-uuid --set=root 0e296f10-d593-4378-a898-63bcf4be4fa5
        fi
        echo    'Loading Linux 4.9.0-12-amd64 ...'
        linux   /vmlinuz-4.9.0-12-amd64 root=UUID=85028ce1-57c5-402d-9979-7969b310748e ro single ipv6.disable_ipv6=1 net.ifnames=0 biosdevname=0 net.ifnames=0 biosdevname=0
        echo    'Loading initial ramdisk ...'
        initrd  /initrd.img-4.9.0-12-amd64
    }
}

### END /etc/grub.d/10_linux ###

### BEGIN /etc/grub.d/20_linux_xen ###

### END /etc/grub.d/20_linux_xen ###

### BEGIN /etc/grub.d/30_os-prober ###
### END /etc/grub.d/30_os-prober ###

### BEGIN /etc/grub.d/30_uefi-firmware ###
### END /etc/grub.d/30_uefi-firmware ###

### BEGIN /etc/grub.d/40_custom ###
# This file provides an easy way to add custom menu entries.  Simply type the
# menu entries you want to add after this comment.  Be careful not to change
# the 'exec tail' line above.
### END /etc/grub.d/40_custom ###

### BEGIN /etc/grub.d/41_custom ###
if [ -f  ${config_directory}/custom.cfg ]; then
  source ${config_directory}/custom.cfg
elif [ -z "${config_directory}" -a -f  $prefix/custom.cfg ]; then
  source $prefix/custom.cfg;
fi
### END /etc/grub.d/41_custom ###
ladar commented 4 years ago

I may have had an out-dated copy of the volume file in my libvirt storage, which was being cloned, instead of my most recent attempt. Confirming now...

timschumi commented 4 years ago

After almost reporting that the root parameter didn't actually change, I remembered this as well.

After clearing out the storage and cloning the box again, debian9 now boots successfully. The root block device is now successfully identified as /dev/vda3 (after resolving by UUID) and everything seems to work fine.

ladar commented 4 years ago

That should clear the way for us to make a similar change to Fedora/RHEL/Ubuntu/Arch and Alpine 3.6+.

Alpine 3.5, Gentoo, OpenSUSE, and the BSD variants will continue using SCSI... for now.

ladar commented 4 years ago

@timschumi do you concur?

timschumi commented 4 years ago

That should clear the way for us to make a similar change to Fedora/RHEL/Ubuntu/Arch and Alpine 3.6+.

Alpine 3.5, Gentoo, OpenSUSE, and the BSD variants will continue using SCSI... for now.

Agreed. Although we should probably focus on the ones which are actually broken first, the others would just be "nice to have" at this point.

EDIT: And just as I wrote this, the merge notification for #149 comes in. :P

I'll probably take a look at OpenSUSE next, since they are the last two that are still having acute issues.

ladar commented 4 years ago

OpenSUSE and Gentoo... although the latter is probably the device name, or a failure to use UUIDs.

ladar commented 4 years ago

While you look into those, I've kicked off the 3.0.8 build, so we'll be able to release a fresh batch of boxes with this change in 2-4 days.

electrofelix commented 2 years ago

I think I've determined how to fix up vagrant-libvirt/vagrant-libvirt#692 in a way that will generally solve this problem a bit better by defaulting to adding a virtio-scsi controller for the box disks any time the box disk bus is set to scsi or the box disk device is sd[a-z].

Sorry for the delays, left that PR hanging for a while due to some issues with it. Only recently been working enough with sorting out disk devices to determine how to best to finish it off.