lavabit / robox

The tools needed to robotically create/configure/provision a large number of operating systems, for a variety of hypervisors, using packer.
634 stars 140 forks source link

Ideas for speeding up builds #20

Closed mcandre closed 5 years ago

mcandre commented 5 years ago

robox is a wonderfully comprehensive collection of base boxes covering a wide variety of different operating systems. I gather that time to pack all these boxes is a constraint on further development, so I wonder if we can improve packing time somehow.

For example, is there a significant restriction on build resources, RAM, CPU cores, and so on? Perhaps we could provide a donation link specifically for funding cloud resources used to build our images.

That could help with the hardware constraints on build time, by scaling vertically. Could we also scale horizontally? Perhaps building boxes from a pool of hosts.

Finally, what steps can we take to improve the build time of each particular box? I've done some (premature) optimizations on my own box templates, such as minifying boot_command contents, and compressing provisioning media, as keyboard input delivery and FTP can be run slowly on certain virtual guests.

Honestly, the bottlenecks of OS installation tend to be the long, uncontrolled process of running the install wizards. But what other things can we shave off, even just a few minutes from each build? Reducing boot splash timeouts, selecting faster virtual hardware (when guest compatible), and ensuring that installation media (ISO's, IMG's) are sourced from fast online caches. Any other ideas for accelerating builds?

ladar commented 5 years ago

For example, is there a significant restriction on build resources, RAM, CPU cores, and so on? Perhaps we could provide a donation link specifically for funding cloud resources used to build our images.

I've been meaning to setup a donation link, but haven't had a chance yet. Will soon.

That could help with the hardware constraints on build time, by scaling vertically. Could we also scale horizontally? Perhaps building boxes from a pool of hosts.

I already am scaling horizontally. I usually task anywhere from 4 to 8 machines with building the boxes (depending on what I can spare). My biggest bottleneck is the Win and Mac build bots. I only have 1 of each, and they are both 7+ year old notebooks. The HyperV and Parallels boxes already take 48 to 96 hours to finish. The HyperV/Parallels Gentoo boxes 2+ hours. Unless I get better/more hardware, I will probably have to selectively drop support for HyperV and Parallels in the future. On Linux side, I'm slightly better equipped, as I have several bots.

Ideally I'd like to get 1U servers for each platform (Win, Mac, Linux) that are optimized for virtualization, with lots of ram, cores, and ssd arrays. The servers I currently use all have old fashioned magnetic disks, so the end up bottle necking on disk IO, while the notebooks I task, all get bottlenecked by their limited number of cores/ram. Basically I'm limited to building 1 or 2 images at a time. Any more and things start breaking (boot command timing, getting updates, etc).

Unfortunately I can't justify spending 5k-10k for dedicated hardware. I'm trying to find find a sponsor, but haven't yet.

I've also tried renting machines on packet.net. Their hourly rates were reasonable enough. With their servers, I was able build 10 - 20 images in parallel, so each provider only took 1 or 2 hours to finish, which is only $1 or $2. But the cost of bandwidth didn't work. A full build generates roughly 1 TERAbyte of network traffic, which means each build probably cost $100 to $200 at this point.

Finally, what steps can we take to improve the build time of each particular box? I've done some (premature) optimizations on my own box templates, such as minifying boot_command contents, and compressing provisioning media, as keyboard input delivery and FTP can be run slowly on certain virtual guests.

I've spent a ton of time optimizing my current process. I created the robox.sh validate and robox.sh links commands so I can validate templates, and verify links are still good before a build begins. And I created the robox.sh cache command so I can download ISOs ahead of time. I recently removed the upload step from my packer configs as well. It seems that when I have 2+ robots building and uploading, I start running into problems with the Vagrant cloud, which was causing the process to stall, as packer re-tried the upload.

Honestly, the bottlenecks of OS installation tend to be the long, uncontrolled process of running the install wizards. But what other things can we shave off, even just a few minutes from each build? Reducing boot splash timeouts, selecting faster virtual hardware (when guest compatible), and ensuring that installation media (ISO's, IMG's) are sourced from fast online caches. Any other ideas for accelerating builds?

With 3 gigabit links, network access isn't much of bottleneck. It's the number of CPUs and disk IO with the hardware I have that limits my robots to building 1 or 2 images at a time.

There are certainly places where you can shave time, but shaving a minute or two off a 3 day process seems pointless, especially if it causes a build to break. If that happens packer ends up stalling for 1-2 hours before it times out and continues on.

Hence, I'm somewhat limited, until I get better hardware. If I do get better hardware (either fast notebooks, or 1U servers), I plan to add more distros. But I also want start generating docker, and OVA variants which people could download off roboxes.org. Right now I'm only building a few docker images, and one OVA. I'd also like to build the images weekly, instead of every 2-4 weeks.

mcandre commented 5 years ago

Wow, we're stretching our resources to the limit!

Curious if the Gentoo build could at least be sped up. I'm not sure if you're using a hardcore stage 3 install method? Perhaps the "Minimal Install CD" could provide a faster nothing-to-SSH packer build

https://gentoo.org/downloads/

ladar commented 5 years ago

@mcandre ...

Curious if the Gentoo build could at least be sped up. I'm not sure if you're using a hardcore stage 3 install method? Perhaps the "Minimal Install CD" could provide a faster nothing-to-SSH packer build.

I believe the Gentoo image is in fact using a stage 3 install disc, but I don't consider myself enough of a Gentoo expert to overhaul that build config. And while the Gentoo box itself takes 2x to 6x longer than average, the two Gentoo images for each provider, are combined, still less than 10% of the overall time.

I was able to get my hands on a blade chassis, with four blades, each with dual 6-core Xeons, but they need memory, and SSDs before I can start using them. (And I might need a Windows Server license as well.) I'm hoping to get those parts donated, but it hasn't happened yet.

I did add the donation links you suggested to the Github readme/website, but so far no takers.