lavabit / robox

The tools needed to robotically create/configure/provision a large number of operating systems, for a variety of hypervisors, using packer.
633 stars 140 forks source link

debian10 is missing the vboxsf kernel module #196

Closed timschumi closed 3 years ago

timschumi commented 3 years ago

Apparently, debian10 is using an old enough version of the VBoxGuestAdditions ISO (which was explicitly bumped down in 0f9b2b676c19c3cdfdcc341cec16d0911138ce62?) so that it doesn't have compatibility for Linux 4.19 (or rather, Linux 4.18+) yet. Since Debian 10 shipped with Linux 4.19 in the beginning, I'm going to assume that this has been broken since day one.

Somehow, the installation script doesn't feel like not being able to compile vboxsf is an error that is important enough to return an exit code for.

I'm currently trying to figure out SVN to check when the necessary compatibility code has been introduced, I'll upload a "bump guest additions" patch for testing once I find out. Other boxes might be affected as well, depending on their kernel and VBox versions.

ladar commented 3 years ago

@timschumi you might recall that I use the add.sh script to "clone" existing box configs when adding a new variant. My guess is that change was needed by Debian 9 and ended up being carried over to Debian 10 because of the clone process.

When I add a variant, I try and look for things that might need to be updated manually, but I don't usually think to check the VBox guest version ISO, since there are only a couple of boxes which use old versions. You can probably switch the config to the current default target (which is 5.2.44) since that is what I have installed on the build robots, and what most of the configs currently use.

Unfortunately the newest packer versions don't work with this VBox version (they blindly pass the nested virtualization flag to Vbox and flag is invalid on version 5.2.44), so I've been stuck using packer v1.6.6. I tried upgrading the robots to Vbox 6.1 but that caused the NetBSD install process to fail. Probably because of a timing issue (I think that is the config that mimics user input because there is no automated install method, but not sure, as I didn't have time to investigate).

Anyways, when I fix that issue, I'll update the ISO url for all the boxes using 5.2.44 up to 6.1.x (and then downgrade any that break).

timschumi commented 3 years ago

@timschumi you might recall that I use the add.sh script to "clone" existing box configs when adding a new variant. My guess is that change was needed by Debian 9 and ended up being carried over to Debian 10 because of the clone process.

The downgrade commit I linked above happened after a generic "Updated VirtualBox ISO" and only affected debian10, so debian9 was still on the newer version during that release. Is it possible that something was wrong with the guest additions and they had to be downgraded due to debian10 being in alpha at that time?

When I add a variant, I try and look for things that might need to be updated manually, but I don't usually think to check the VBox guest version ISO, since there are only a couple of boxes which use old versions. You can probably switch the config to the current default target (which is 5.2.44) since that is what I have installed on the build robots, and what most of the configs currently use.

I've been trying to get 5.2.44 running, but I hit a few roadblocks along the way. It seems like the installer does something weird when loading vboxguest (basically a common dependency of the various VirtualBox kernel modules) so that vboxsf can't find it's required functions and therefore fails to load (which of course is an error that is important enough to return an exit code over). Force-unloading the vboxguest module or rebooting fixes the issue, but the installer failing kills the build before that and I don't really want to remove the exit code check there. This means fixing it properly seems to be the only good option.

I also added a check for the presence of vboxsf so that I can maybe spot a few other broken boxes (if they exist), but I haven't built the whole virtualbox roster of boxes yet.

ladar commented 3 years ago

@timschumi what if we skip ahead, and use the 6.1.22 ISO?

The 6.1.22 modules should work on 5.1 and 5.2 VBox systems, right? At least in theory?

ladar commented 3 years ago

@timschumi I think the reason I don't skip ahead in general is because using a newer ISO will generate a warning during install that it's running on an older version of VirtualBox... but I could be mixing up my providers. If possible, let's see if there is any easy way to silence that warning, otherwise I might find myself thinking there is a problem, should I happen to notice that warning in the log files.

ladar commented 3 years ago

@timschumi finally, if you decide to build all of the VirtualBox configs, you can control how many it does in parallel:

export GOMAXPROCS="4"
export PACKERMAXPROCS="2"
./robox.sh generic-virtualbox

The first param controls how many threads Go uses for parallel processing tasks. This mostly effects the box file generation stage, as it determines how many CPUs will be used to compress the output into a box file. You'll want to limit that, otherwise it can use up all your CPUs, and starve other box builds, which can sometimes cause a failure, depending on what that other build might be doing. The second param controls how many boxes are built in parallel... so the above values should work well on an 8 core system... with a single disk, or slower SSD. If you have a fast network, and multiple SSDs (and/or fast NVME drives), you could probably go with 2 for Go and 4 for Packer values on that same 8 core system. Anyways, you get the idea. If you have a super fast server with 64 mores, and an SSD RAID array, then increase accordingly.

Anyways, usually set those params in the ..credentialsrc file OR via the command line. If you set them in both places, the .credentialsrc will override the command line settings... so be sure to your file first.

I say this, because if you're using the default .credentialsrc if might have set those values to low default levels to avoid causing problems.

timschumi commented 3 years ago

@timschumi you might recall that I use the add.sh script to "clone" existing box configs when adding a new variant. My guess is that change was needed by Debian 9 and ended up being carried over to Debian 10 because of the clone process.

The downgrade commit I linked above happened after a generic "Updated VirtualBox ISO" and only affected debian10, so debian9 was still on the newer version during that release. Is it possible that something was wrong with the guest additions and they had to be downgraded due to debian10 being in alpha at that time?

I just tested the 5.2.26 ISO and I'm getting the same issues as with 5.2.44, so I'm starting to piece together what happened: Back when there was the big upgrade to 5.2.26 the debian10 box didn't build (since the installer detects kernel module loading issues) so it was restored to the previous version, 5.1.38. However, 5.1.38 was already broken previously but unnoticed since the installer doesn't detect build issues of the kernel modules.

@timschumi what if we skip ahead, and use the 6.1.22 ISO?

The 6.1.22 modules should work on 5.1 and 5.2 VBox systems, right? At least in theory?

I'll try if that one works better.

To my knowledge, the host/guest modules offer quite a bit of compatibility, but I'm not sure about major versions if the guest is the newer one. Newer host obviously works fine since most users are probably using 6.x at this point. On the other hand, none of the boxes that rely on the distributors guest modules seem to have any issues. And if we are already on that topic, is there a particular reason why we prefer the ISO over the distributions own package for guest modules?

@timschumi I think the reason I don't skip ahead in general is because using a newer ISO will generate a warning during install that it's running on an older version of VirtualBox... but I could be mixing up my providers. If possible, let's see if there is any easy way to silence that warning, otherwise I might find myself thinking there is a problem, should I happen to notice that warning in the log files.

Would VirtualBox even recognize the guest modules at this point during the installation? The installation happens pretty late during the whole box build...

I know Vagrant does have a setting to enable presence-/out-of-date-checking of the guest modules, but I'm not sure if that applies to packer (or if packer even does any checks at all).

finally, if you decide to build all of the VirtualBox configs, you can control how many it does in parallel:

Already found the maxprocs flags yesterday and started the build (which worked fine for the most part), but I ended up running into issues with a few of my ISO downloads stalling. I'll start another round in a few minutes where the remaining ISOs will hopefully finish downloading.

ladar commented 3 years ago

is there a particular reason why we prefer the ISO over the distributions own package for guest modules?

All of the configs have the ISO configured... but some of the box variants, use the system package, and then simply delete the ISO. There isn't a clear methodology on which to use... it usually comes down to which one works better. If the system package is reasonably up to date, and complete, then I don't mind using it instead.

In all likelihood, the reason the Debian boxes use the ISOs is because when I first created the box config, there was no system package, or it was out of date. Traditionally VirtualBox had trouble getting into F/OSS project repos because of licensing issues.

You'll recall we went through a similar with the Alpine boxes. As I recall the oldest Alpine still use ISOs, but newer Alpine variants now use the system package.

I know Vagrant does have a setting to enable presence-/out-of-date-checking of the guest modules, but I'm not sure if that applies to packer (or if packer even does any checks at all).

Actually, as I recall the warning has nothing to do with vagrant, but with the guest module version, and the host VirtualBox version. And I think the warning is triggered during setup/install not when its created via Vagrant.

Already found the maxprocs flags yesterday and started the build (which worked fine for the most part), but I ended up running into issues with a few of my ISO downloads stalling. I'll start another round in a few minutes where the remaining ISOs will hopefully finish downloading.

What I do on my build robots, is I use screen -R robox and create two virtual consoles. On the first, I run:

git pull ; ./robox.sh cleanup ; nice -n +19 packer build -parallel-builds=1 packer-cache.json  

And I give it a couple of minutes to get a head start on the ISO downloads. Then I run:

./robox.sh virtualbox

On the second console. Without that logic, you are correct the ISO downloads will overwhelm the system. Note that because Go spawns each component, for each job into it's own process, you can easily max out the default system limitations. That's why the provider.sh setup script increases those limits. To do that manually, run something like the following as root, assuming you run packer the as the user tim:

cat /etc/security/limits.d/50-tim.conf <<-EOF
tim      soft    memlock    16467224
tim      hard    memlock    16467224
tim      soft    nproc      65536
tim      hard    nproc      65536
tim      soft    nofile     1048576
tim      hard    nofile     1048576
tim      soft    stack      unlimited
tim      hard    stack      unlimited
EOF
timschumi commented 3 years ago

Found the culprit. There is actually an in-kernel vboxguest module that couldn't be unloaded while the system is running (since drm depends on it via vboxvideo). Fortunately, the scripts decide to return error code 2 in case they can't (re)load the module, so we can just ignore that one. It also seems like version 5.2.44 returns the same error code with a slightly more verbose message, so in the end it installs just fine. The resulting guest additions are properly recognized and shared folders appear to work just fine.

I'll push a work-in-progress Pull Request in a few minutes and add more fixes for broken boxes once they are built (or rather, failed to build).

I know Vagrant does have a setting to enable presence-/out-of-date-checking of the guest modules, but I'm not sure if that applies to packer (or if packer even does any checks at all).

Actually, as I recall the warning has nothing to do with vagrant, but with the guest module version, and the host VirtualBox version. And I think the warning is triggered during setup/install not when its created via Vagrant.

I meant that Vagrant runs a check on the host and on the box to see if and which version of VirtualBox they each use and warns the user if they are missing or mismatched. The appropriate option is documented here.

That information then shows up in the output of vagrant up when a user starts the box:

$ vagrant up
[...]
==> x64: Checking for guest additions in VM...
    x64: The guest additions on this VM do not match the installed version of
    x64: VirtualBox! In most cases this is fine, but in rare cases it can
    x64: prevent things such as shared folders from working properly. If you see
    x64: shared folder errors, please make sure the guest additions within the
    x64: virtual machine match the version of VirtualBox you have installed on
    x64: your host and reload your VM.
    x64: 
    x64: Guest Additions Version: 5.2.44
    x64: VirtualBox Version: 6.1
==> x64: Mounting shared folders...
[...]

The question is whether packer implemented something similiar or nothing at all. Since I haven't noticed any output regarding the version of the guest additions when building boxes and I can't find any mention of GuestAdd (which according to the Vagrant source is the property name that VirtualBox uses for guest additions data) in the packer source code I believe it's the latter.

git pull ; ./robox.sh cleanup ; nice -n +19 packer build -parallel-builds=1 packer-cache.json  

Oh, so that's what packer-cache.json is for. I always thought that it is weird for packer to have a cache file that had to be updated manually.

On the second console. Without that logic, you are correct the ISO downloads will overwhelm the system. Note that because Go spawns each component, for each job into it's own prcess, you can easily max out the default system limitations.

I think it was connection-related in my case. One of the ISOs downloaded at the speed of a few KB/s and others eventually stopped clearing from the queue although they were at 100% and their respective build already started in the background.

That's why the provider.sh setup script increases those limits. To do that manually, run something like the following as root, assuming you run packer the as the user tim:

cat /etc/security/limits.d/50-tim.conf <<-EOF
tim      soft    memlock    16467224
tim      hard    memlock    16467224
tim      soft    nproc      65536
tim      hard    nproc      65536
tim      soft    nofile     1048576
tim      hard    nofile     1048576
tim      soft    stack      unlimited
tim      hard    stack      unlimited
EOF

Noted, I'll apply those in case I run into issues again (one or two of those I already applied anyways back when I tried to run validate on all the JSONs).

ladar commented 3 years ago

I think it was connection-related in my case. One of the ISOs downloaded at the speed of a few KB/s and others eventually stopped clearing from the queue although they were at 100% and their respective build already started in the background.

I think that is actually a packer bug. There is a lock contention issue, which is why I like to give the cache build a head start. It grabs the Alpine ISO pretty quickly, how big a head start depends on many parallel jobs you plan to run.

As for why the log hangs... if it's not a lock contention, when the download is complete (or if the ISO is already in the cache), it still needs to generate a hash. That can takes awhile depending on how fast your disk is, how many tasks are fighting to read data, and how big the ISO file is. A few of the ISOs are in the gigabyte range, so they can appear to hang if you have a lot box builds/downloads running.

Finally, even with all that, I think there is still a bug that can cause the output to hang at 100% with a lot of jobs running. It doesn't appear to keep the process from finishing, or matter, so I haven't looked into it all that much.

Note the cache config uses fake VMware box configs. If you don't actually have VMWare installed, you can easily trick packer into thinking you have it installed by creating a couple of empty files on your system. Let me know if you need those.

ladar commented 3 years ago

Found the culprit. There is actually an in-kernel vboxguest module that couldn't be unloaded while the system is running (since drm depends on it via vboxvideo). Fortunately, the scripts decide to return error code 2 in case they can't (re)load the module, so we can just ignore that one. It also seems like version 5.2.44 returns the same error code with a slightly more verbose message, so in the end it installs just fine. The resulting guest additions are properly recognized and shared folders appear to work just fine. I'll push a work-in-progress Pull Request in a few minutes and add more fixes for broken boxes once they are built (or rather, failed to build).

I don't recall if it was VirtualBox but I ran into a similar problem once before... where running the install process a second time causes the install succeed. You might be able to bypass this bug/error by adding something like that here. Just add a || to the right command and run it again. Not sure if that will work here or not.

It's an ugly fix, but if it works, it's a lot better than simply ignoring the return code, etc. Because if a major problem ever does surface, at least in theory, running the installer a second time shouldn't make a difference. So it reduces the risk of accidentally releasing a broken box.

timschumi commented 3 years ago

I don't recall if it was VirtualBox but I ran into a similar problem once before... where running the install process a second time causes the install succeed. You might be able to bypass this bug/error by adding something like that here. Just add a || to the right command and run it again. Not sure if that will work here or not.

It's an ugly fix, but if it works, it's a lot better than simply ignoring the return code, etc. Because if a major problem ever does surface, at least in theory, running the installer a second time shouldn't make a difference. So it reduces the risk of accidentally releasing a broken box.

Running the installer twice is what the script did previously (so yes, it was VirtualBox), but that didn't work in this case.

Also, return code 2 is specifically noted as "already running modules prevented the new ones from loading", so I think that one is fine (and all the other error codes get caught as usual and abort the build).