lavabit / robox

The tools needed to robotically create/configure/provision a large number of operating systems, for a variety of hypervisors, using packer.
620 stars 139 forks source link

Public Cloud Donation #215

Closed VincentSaelzler closed 1 year ago

VincentSaelzler commented 2 years ago

From the readme: "If you represent a public cloud, and would like to provide infrastructure support, please contact us directly, or open a ticket."

I have a two spare servers and a symmetric gigabit connection available. Is there anything I can do to help?

Servers are Dell R720XD and Dell R620.

ladar commented 2 years ago

@VincentSaelzler sorry for taking so long to get back to you. I think you just missed the last merge window, and I generally can't keep up with my GitHub issue notifications. So they pile up. I then try to read/reply to the various issue/pull requests during every 2-3 months, aka what I call the merge window. In the interim all my time, plus a lot of time I don't have, is simply spent trying to keep the packer configs working, and the boxes building, so I can upload new versions every 1-3 weeks.

To answer your question, yes I still need help. Given what you offerred, can you take the lead on QA/QC for the box images? Once upon a time, I tried to use Travis for this role, but I could never get nested virtualization to work, so my Travis configs could only check whether a box downloaded properly. The exception being the Docker images, which did go through a simple testing processes.

Unfortunately the project got way to big, and was using too much bandwidth so the effort stalled, and I never bothered moving the Docker test script to the new Travis service.

Long story short, I have bash scripts which will download and test every box for a given provider, in a private repo. I've been meaning to merge those scripts into this public repo for awhile, but wanted to preserve the commit history of the previous repo, which is tricky. And I was also worried there might have been a VMware workstation license key and/or vagrant VMware plugin license in an old commit so it never happened. That said, I could grab the latest and add it to get you started.

Sadly I don't run those tests very often these days. In part because a full test for a single provider can occupy a server for 12 to 24 hours (I haven't timed it in awhile), and the project has gotten so large, that running them in sequence eventually starts failing because they hit the Vagrant bandwidth quota. That said, I last ran them 4-6 weeks ago, and all the libvirt/virtualbox images worked. A couple of the vmware boxes were failing but I never had time to investigate. And I don't recall the last time the Parallels/Hyper-V images were all tested.

So can you take on this role, and use your servers to QA/QC the boxes? In the short term it will mean running the test scripts, and most importantly investigating why the box failed to pass. Usually this means pulling up the console, and figuring out why vagrant up and or vagrant ssh couldn't connect, devising the required change, and adding to the build scripts.

Long term it will mean moving from my test all the boxes in sequence methodology to a strategy which looks for new robox releases, and subsequently tests the images shortly after I upload them for the most popular images, and moving to a strategy where less popular, or EOL boxes are tested less frequently. I always thought Jenkins would be perfect for this, since it could coordinate different test nodes to avoid provider conflicts, never had the time to pursue. We have a Lavabit Jenkins server, but it currently only uses/tests the Magma boxes. And doesn't have the capacity to take on what's needed. It would also have issues with provider conflicts since there is only a single test server node.

All that said, can you use those servers to help with QA/QC? Specifically can you take on the job of testing the Virtualbox/VMware/libvirt and Docker images? They can all be tested inside a Linux environment, which makes it easier, and would be a good place to start.

Finally, to follow up on the text you found above. We also desperately could use a hardware upgrade. Specifically, the most pressing need is for an arm server, for example one with an Ampere CPU to create libvirt Arm images and/or a Mac mini with an M1 so we can create arm variants for Parallels.

We will also, eventually need to upgrade the blade servers we use to build the images. They've done a good job, but the project keeps growing, and it's starting to take 36-48 hours for a server to build all the boxes for a given provider. Eventually we'll hit a point where we need better hardware, or we'll need to stop adding new box variants.

From a bandwidth perspective, the Lavabit environment has 20 gpbs, so it's good there. And we have physical space, but we're at the 80% mark on our power circuit, hence the need to replace the old blade servers with newer faster versions at some point.

VincentSaelzler commented 2 years ago

@ladar no worries on the delayed response!

I have a the two servers available, and I have the time and knowledge to get the CI/CD environment in place.

In terms of time commitment, I can do a bit of work around getting things kicked off, etc, but don't have the ability to do troubleshooting into why things aren't building.

If that's helpful, let me know the best way we can get in touch to cover details.

ladar commented 2 years ago

@VincentSaelzler I added the existing test and check scripts to the project, in the aptly named check directory. Inside you'll find the check.sh script, which is similar to the robox.sh in how it can be used. Normally the script only tells you if a box is working, or not. In contract the simple.sh script will print out all of the normal output from the vagrant commands, but is designed to run on a single box. To troubleshoot thsoe that failed when the check.sh script ran. The connect.sh script is likewise to designed to spin up a box, and establish an ssh connecttion to it for testing/troubleshooting purposes.

All of the scripts will create a vagrant.d underneath check, and configure vagrant to store all of its boxes, configs, and other related data in that directory. Essentially it's designed to provide a sandbox, which can be wiped, and rebuilt between runs, ensuring a consistently clean starting state, and allowing you to experiment with plugins/boxes without affecting what you have in the standard ~/.vagrant.d profile. Run ./check.sh cleanup to destroy and test boxes, vms, etc, and remove vagrant.d sandbox when you want to reset.

These scripts already have code that will check to ensure the vagrant-libvirt plugin has been installed, and install if necessary, whenever you try to use libvirt box. Likewise, the same applies whenever it's asked to use a vmware box. That said the latter logic needs updating. It wasn't written awhile ago, and doesn't know to look for the vagrant-vmware-workstation service that is now required. It also still thinks a license file is need to use the vmware plugin. That said, if run: ./check.sh plugin-vmware ; ./check.sh plugin-vmware it will say it has failed both times, but all of the setup needed will be completed, and future checks will pass, allowing you to test/check/use the vmware boxes.

It's also worth noting that if you try to test/check/use a generic box, then the Vagrantfile template in the check directory is used. This allows to test possible variations in how the box is configured, and/or test a proposed workaround that can then be merged into the embedded Vagrantfile shipped with each box. If you test/check/use the roboxes version of a box, it will setup the box using just the vagrant init command, and starting with what is an essentially empty Vagrantfile (and probably closer to what a real user would be starting with).

To test a group of boxes just run:

./check.sh generic-libvirt

Where generic is one of the org groups, and the latter parameter is the provider targeted. If you use the robox.sh script, it should be familiar. Run the script without any params to see the combinations.

As for keeping in touch, I'd like to keep as much of our discussion on GitHub as possible, so I can point to it in the future, and not retype my answers! That said, I tracked down your Gmail address, so I'll send you an email with my contact info, so we can also talk privately, when necessary.

Since my fingers feel like they are about to fall off, I'll end it here, and let you look through the code. Obviously feel free to ask questions once you're done.

VincentSaelzler commented 2 years ago

@ladar I got your email, and added you on XMPP.

I will look over the code, and see if I can figure out what's going on. In terms of the hardware requirements for the QA/QC environment: what are they?

For example:

Not looking for exact numbers here, but it would be helpful to have a ballpark figure! I don't want to invest a ton of time understanding the code just to find out that I can't run it on my hardware!

ladar commented 2 years ago

@VincentSaelzler the hardware requirements are relatively low. All you need is enough to run a single VM at a time. That said, the more resources you have, the more boxes you can test in parallel. At least until you hit the Vagrant Cloud quota. I think I mentioned this above, but you should be able to test all of the boxes for a given org/provider combination without a problem. In the past, I hit the limit somewhere in the middle of the 3rd org/provider combination, at which point you won't be able to download any more boxes from the same IP for 24/48 hours. Hence we'll need to eventually break up the currrent batch methodology, and move to a more efficient approach.

The software requirements are probably the most important. You'll need to the system setup with VirtualBox, VMware, and libvirt (ignoring the other 3 providers for now). The build robots run CentOS 7, so assuming you have it already installed, you can run the sudo res/providers/providers.sh all script inside the repo to set those providers up. Note, you'll want to change the "HUMAN" variable inside the script to your username (or the user, who will be running the unit tests). I'd also suggest running the script using sudo bash -xres/providers/providers.sh all` so you can see what it's doing. You can also use a more granular approach and install the providers individually using that script. The help for that script looks like:

 Configuration
  providers.sh {setup|limits} or

 Installers
  providers.sh {lxc|vmware|docker|packer|vagrant|libvirt|virtualbox} or

 Global
  providers.sh {all}

 Please select a target and run this command again.

Please note that the providers.sh script hasn't been updated to install/setup the relatively new Vagrant VMWare Plugin that is required on CentOS and other systems. And that you'll need to add the VMWare license key to the .credentialsrc file in the robox project root, to install VMWare workstation using the providers.sh script.

To answer your resource questions more specifically:

HDDs OK or SSDs required?

As for disk type, a single SSD will be 2x, 3x faster than a single HDD. If your using an R620 or R720, I'd suggest using a RAID-0 with 2 or 3 SSDs, so you max out the disk performance. In my experience, the bottlenecks, in order of importance tend to be disk IO, network bandwidth, RAM and then finally CPU. If the process is starved for disk IO, you'll see timeouts, and other ephemeral failures. That is why I tend to limit the build robots to building/testing only 2 boxes in parallel (they only have a single SSD to work with).

How much disk space? 100s of GBs? Multibple TBs?

The script will only keep the files for a given box image on disk if a test fails, so in theory 50GB of free space should be enough. That said, I'd recommend 200GB of free space, to ensure multiple box failures don't fill up the disk, causing more failures. For reference, all of the box files for the 3.5.4 release took up 654GB. That number will grow with the forthcoming 3.6.0 release, which will be adding 3 new distro variants. I expect the total to grow up to around 740GB. So, with all that info, I'd say 1TB is a good ideal to shoot for, since that will ensure you can keep multiple, and/or all of the box files around, with a little room for future proofing. Also, it's worth noting that depending on how we configure things long term, you probably won't need to store Hyper-V/Parallels images on the Linux test server, which will reduce the above numbers some.

RAM requirements?

I suggest a minimum of 8GB, with an additional 6GB for each additional box you plan to test in parallel. So to test 4 boxes in parallel, I'd suggest a minimum of 26GB. That said, more RAM is also good, since the excess will be used by the OS to cache disk data.

CPU requirements?

This gets tricky, right now the Vagrantfile embedded with the box files is configured to setup 2 CPUs for each box, which is what all of the roboxes tests will use. The generic tests will use the template files in the check directory, which are also currently configured to allocated 2 CPUs per box. So I'd suggest at least 3 available cores per parallel test case.

To get started, I'd suggest getting VirtualBox 5.2.44 and vagrant installed, and then running ./check.sh generic-virtualbox ... as a starting point. Assuming those all work, then you can try ./check.sh generic-libvirt ... both of those should work just fine. Note I recently discovered the cleanup function doesn't properly remove the libvirt disk templates using virsh. It's a bug in the regex, so you'll need to watch that or it could each up all your available disk space. The ./check.sh generic-vmware might be called the graduate course, since the plugin checks needs updating, 2-4 of those images may have problems.

P.S. The parallel command embedded in the check.sh script is currently set to run to 2 tests in parallel. Adjust the number of parallel jobs accordingly.

P.P.S. I've already added the forthcoming Alpine 3.15, Fedora 35, and Devuan 4 variants to the check script, but those repos/boxes won't be available until the 3.6.0 release is uploaded, which will happen in the next 3 to 7 days.

VincentSaelzler commented 2 years ago

@ladar great! I definitely have the hardware capacity to run that.

From a CPU and RAM perspective, we should be all good.

When it comes to disks:

When it comes to software installation, I rely heavily on Ansible. My main goal in terms of involvement with the project is to reliably set up a server that can run the tests! Just as an FYI, the config will be added to my existing homelab repo.

You'll need to the system setup with VirtualBox, VMware, and libvirt (ignoring the other 3 providers for now). The build robots run CentOS 7

I'm confused on this point. What's the bare-metal OS that needs to be installed?

ladar commented 2 years ago

I believe they are Intel Xeon E5 2620 v2

Hi @VincentSaelzler those processor should be just fine. Most of my rackmount servers use the same, although the blade servers I actually use to build the images are the generation prior to those. One technical note, you'll need to make sure you have the appropriate virtualization extensions enabled in the BIOS. There were a couple of settings in the OM manager as well which control cooling, power and performance priorities, which will impact performance. But the virtualization extensions are the most important.

It will either be a RAID 0 of ~3 SSDs or ~8 HDDs.

Virtual machines, and database servers both generation workloads made up almost entirely of random access reads. Traditional spinning disks require the platter and head to move, which means there is a seek time. SSDs generally have a seek time 0ms. So for this type of workload, having the virtual disk on an SSD will make a huge difference. The tradeoff is that building and testing the box images will generate an enormous amount of ephemeral data. This can wear out an SSD, bu the good news is that a failure shouldn't slow you down much, provided you can rebuild the config. That's where the name "Robot Boxes" or "Roboxes" comes from. A cluster of disposable, interchanges servers that can function as robotic build bots.

I rely heavily on Ansible. My main goal in terms of involvement with the project is to reliably set up a server that can run the tests! Just as an FYI, the config will be added to my existing homelab repo.

I've run across Ansible, mostly reverse engineering configs so I could replicate the important pieces elsewhere without the dependency. But whatever works best for you. I think the short term goal is to begin testing the boxes for each consecutive release, so we can figure out if any of the failures are consistent, and need to be investigated, or it's just a release specific fault that occurred randomly during the build process, and wasn't caught by config modules (scripts). Those are less pressing in my view.

Long term, the key will be managing the workflow, more than the server configurations. And expanding the unit tests beyond just making sure vagrant ssh works. I don't know what qualifies a Vagrant box image as being ideal, but we part of the process will be figuring that out, and testing for it.

I'm confused on this point. What's the bare-metal OS that needs to be installed?

The bare metal shouldn't matter. What matters is whether the provider (VirtualBox, VMware, etc) runs on that OS, and whether vagrant supports it. The reason I emphasized that the robots I'm using are CentOS 7, is because that's what is used to build the boxes(for 4 out of 6 platforms) which means the box worked, at least during the build phase. It's also the OS I've used the most to test the resulting images. There have been sporadic reports of issues the farther you go from that baseline. For example Gentoo/Arch tend to have bleeding edge versions of QEMU/libvirt which can cause problems. With MacOS the vagrant-libvirt plugin isn't fully supported yet, not to mention the variety of other issues. Windows is, well Windows. You can go back and read some of the open/closed issues to get an idea of what people have reported on GH. But know it's only a small fraction of what I've been told about elsewhere, or run into myself.

I think eventually, the ability to test the box files across a variety of different OS platforms would be great. But we have to start somewhere before we look at that issue.

ladar commented 1 year ago

@VincentSaelzler haven't heard anything on this in awhile, so closing the issue.