Use CAPI image builder for base images

Callisto13 commented 2 years ago

Basically replace the method we have now and use the more standard one.

@richardcase to add any spike stuff, here or on a branch, even if untidy and unfinished.

richardcase commented 2 years ago

Notes: https://github.com/weaveworks-liquidmetal/image-builder/discussions/35

Callisto13 commented 1 year ago

FINE i will do this now

Callisto13 commented 1 year ago

Here are some notes and thoughts about switching our Microvm OS images. See the end for my wrap-up.

TLDR

Control plane node Microvms with the CAPI images become ready 19.36 seconds (39.84%) faster than those based on our own. Worker node Microvms with the CAPI images become ready 4.02 seconds (30.90%) slower than our own.

CAPI based images win on speed for control plane node readiness, but lose on worker node readiness.

Control plane node Microvms with the CAPI images start with 2.4GB (82.76%) more disk already claimed at boot. Worker node Microvms with the CAPI images start with 3.2GB (177.78%) more disk already claimed at boot.

CAPI based images lose across the board for disk hogging.

Image details

OG OS image

Our OG Microvm OS is a container image. The base-ubuntu is built from Ubuntu 20.04 and adds a few useful packages. The OS image is built from this, adding binaries (kubelet, kubeadm, kubectl and containerd), setting some sys config, enabling services etc. We do not pre-pull any images required by kubeadm to start the kubelet.

This image is 1.27GB.

CAPI OS image

This proposed microvm OS image uses the kubernetes image-builder, more specifically the raw CAPI option which is provided for bare-metal systems. The image-builder creates machine images NOT container images. To create a microvm OS image from this, we create the machine image, mount it, then import everything in that filesystem (including the kitchen sink) to a container image.

This resulting image is ready to use with all the binaries installed and ready to go, any sys config set, and pre-pulled kubeadm images. Reading the image-builder ansible (fun times), and inspecting the image shows that the builder does not install anything else, but because it is based of a machine image, which is based off a full-on ubuntu with a kernel and everything, the image is huge.

4.24GB to be exact.

Some of that can be attributed to the pre-pulled images. Some can be attributed to the kernel modules, which we can remove. With a bit of apt cleaning we can strip off a few more bytes and get just below 4GB but it is still a chonky boi.

What does this mean?

This size impacts in 3 ways:

Download and write time. Yes, it will only need to happen once per host device, but I know from working on the Pi rig that pulling/writing the OG 1.27 image took aaaaagggeesss. To the point that I thought something had gone wrong and started looking at logs.
Space on the host device. Yes, again it is just the one instance (... sortof, see point 3) but given we are targeting resource constrained environments, it's sloppy.
Pre-claimed space on mvms. The OS counts towards the root disk usage of the guests. So with a large image you are booting with at least 4.24GB out of a default of 10GB already in use. Those snapshots also end up taking space on the host's devmapper device. Again, sloppy.

On the other hand... having the kubeadm images pre-built will likely save on boot time. But this can be achieved without such a huge footprint.

Test environment and constraints

To illustrate these notes, I got some data.

I created a single c3.medium.x86 device in Equinix running flintlockd v0.4.0. The device was deployed to Amsterdam and had the following spec:

Ubuntu 20.04
1 x AMD EPYC 7402P
24 cores @ 2.8 GHz
64 GB RAM
480 GB SSD
10 Gbps NIC

I created several clusters (not at the same time) with just 1 control plane and 1 worker node each. They used the same kernel image: ghcr.io/weaveworks-liquidmetal/flintlock-kernel:5.10.77.

I tested the difference in boot times and init disk usage for Microvm k8s nodes created with two OS images:

ghcr.io/weaveworks-liquidmetal/capmvm-kubernetes:1.23.5
ghcr.io/weaveworks-liquidmetal/capmvm-kubernetes:1.23.10

The 1.23.5 image was created with our original base-ubuntu image. The 1.23.10 image was created with the the CAPI image builder raw ubuntu 20.04 image.

I used the "Ready" message which we supply in user-data as a somewhat lazy but easily found way to determine the overall "readiness" time of a Microvm+Node.

Findings

Speed

Control plane node Microvms created with our original base images were on average in a "ready" state within 48.60 seconds. Worker node Microvms created with our original base images were on average in a "ready" state within 13.01 seconds.

Control plane node Microvms created with the CAPI-based images were on average in a "ready" state within 29.24 seconds. Worker node Microvms created with the CAPI-based images were on average in a "ready" state within 17.03 seconds.

Control plane node Microvms with the CAPI images become ready 19.36 seconds (39.84%) faster than those based on our own.

Worker node Microvms with our original base images become ready 4.02 seconds (30.90%) faster than the CAPI ones.

The Microvms created with CAPI-based images boot slower in general, but start k8s faster.

Disk claim at boot

Note that the default Microvm disk size (as controlled by containerd) is 10GB (9.8 in real terms).

Note also that du does not report accurately across mounts. Or rather it does but we don't want to count them here. The df number is the one we are taking as that reflects actual usage on disk.

Control plane node Microvms created with the CAPI-based images report 5.3GB of used disk space at boot. This leaves just 4.1GB available for applications.

root@cluster-new-control-plane-k2jgz:~# du -hd1 /
0       /proc
98M     /opt
4.4G    /var
0       /sys
141M    /boot
5.0K    /mnt
0       /dev
48K     /tmp
12K     /media
4.7M    /etc
4.0K    /srv
24K     /home
2.6G    /usr
801M    /run
36K     /root
6.0G    /
root@cluster-new-control-plane-k2jgz:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       9.8G  5.3G  4.1G  57% /
root@cluster-new-control-plane-k2jgz:~# dd if=/dev/zero of=foo count=5 bs=1073741824
dd: error writing 'foo': No space left on device
5+0 records in
4+0 records out
4937887744 bytes (4.9 GB, 4.6 GiB) copied, 38.3963 s, 129 MB/s

Worker node Microvms created with the CAPI-based images report 5GB of used disk space at boot. This leaves just 4.4GB available for applications.

root@cluster-new-md-0-qwmst:~# du -hd1 /
0       /proc
98M     /opt
2.1G    /var
0       /sys
141M    /boot
4.0K    /mnt
0       /dev
48K     /tmp
12K     /media
4.5M    /etc
4.0K    /srv
24K     /home
2.6G    /usr
116M    /run
36K     /root
6.0G    /
root@cluster-new-md-0-qwmst:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       9.8G  5.0G  4.4G  54% /

Control plane node Microvms created with our original base images report 2.9GB of used disk space at boot. This leaves 6.5GB available for applications.

root@cluster-old-control-plane-pbssj:~# du -hd1 /
0       /proc
907M    /usr
4.0K    /mnt
4.0K    /srv
4.0K    /home
4.2M    /etc
1.3G    /run
48K     /tmp
38M     /boot
1.8G    /var
28K     /root
5.0K    /media
0       /sys
0       /dev
87M     /opt
4.1G    /
root@cluster-old-control-plane-pbssj:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       9.8G  2.9G  6.5G  31% /

Worker node Microvms created with our original base images report 1.8GB of used disk space at boot. This leaves 7.6GB available for applications.

root@cluster-old-md-0-rk5dp:~# du -hd1 /
0       /proc
907M    /usr
4.0K    /mnt
4.0K    /srv
4.0K    /home
3.0M    /etc
522M    /run
48K     /tmp
38M     /boot
727M    /var
28K     /root
4.0K    /media
0       /sys
0       /dev
87M     /opt
2.3G    /
root@cluster-old-md-0-rk5dp:~# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       9.8G  1.8G  7.6G  19% /

Control plane node Microvms with the CAPI images start with 2.4GB (82.76%) more disk already claimed at boot.

Worker node Microvms with the CAPI images start with 3.2GB (177.78%) more disk already claimed at boot.

What do we want from an OS image?

We went into this wanting to "stay aligned with the ecosystem", which is fair, but it is more important to consider the requirements for our usecase.

So what does a good OS image for Microvms on a high-performant resource-constrained environment look like:

Size: It does not encroach unreasonably on the host or guest disk space.
Speed: The guests come up as usable k8s nodes quickly.
Customisable: It's easy to create a custom image from the base.

We need to decide on numbers for the first 2, and agree to keep on top of any excess. Like 2GB size and <30s control plane start, <15s worker node start, or something.

For the last one, keeping in line with the ecosystem is all well and good, but if someone comes to us saying "why is the OS taking nearly half my disk here?" do we really want to say "afraid it has to be like this because that is how all the other CAPI images look."?

If we stick with the CAPI image plan here, we kind of end up with a "you can only have 2 of the 3 options at any one time" situation.

Options moving forward

We use the CAPI image and strip out as much as we can without impacting speed.
We don't use the CAPI image, and bake pre-pulled kubeadm images into our own.
We don't use the CAPI image, we don't pre-bake the pre-pulled images, rather we create a new image which has those kubeadm images and mount it in.
We use the CAPI image and strip out literally everything, mount in the pre-baked kubeadm image (not really a CAPI image at that point really tho).
We use the CAPI image as it is and don't care about the size hit.
We maintain both a CAPI image and our OG image so people can choose which matters to them.

richardcase commented 1 year ago

@Callisto13 - what fantastic work, bravo. This type of analysis and real quantifiable numbers is invaluable 🙇

It's strange that the control plane node ready up quicker with the CAPI but not the worker nodes. I would've expected them both to be the same (either faster or slower). I'd love to know why this was the case.

When we did the original image builder work (and flintlock/capmvm) we deliberately didn't do early optimization, so even if we stick with our own image building process, we could think about optimizing it.

I agree with the 3 requirement of size, speed and customisable. There is another aspect to using the CAPI images other than "stay aligned with the ecosystem" as a principal which is the CAPI images are designed for use with CAPBPK (kubeadm bootstrap) and so if changes are made to the provider that require something from the base image then we need to make sure we reflect that in our own images.

The size requirement is also interesting. Having less available space to the guest isn't such a problem (unless you are on really resource constrained machines) as we can just use bigger snapshot sizes which will give more space to the apps in the microvms.....but agreed that the initial image size on the host takes up more space (and takes more time to download).

My thoughts on the options:

I don't think we should do this. If we are going to use the CAPI image then let's use it without changing it so we don't get into the situation when debugging issues where we need to work out is it a problem with the CAPI image or because we took stuff out.
If we stay using our image then I think we should do this as it will speed up cluster formation
I don't think we should do this as we are baking in kubeadm to the base image and the pre-pulled images should match the version of kubeadm. Having them separate means we could drift.
I don't like this for the same reasons as 1)
As mentioned above the size hit is really only a problem in resource constrained machines where oversall storage is very limited.
We could do this but as you've mentioned previously this means we have more stuff to maintain

I would like to add another option:

We work upstream in the Kubernetes image-builder to create an optimized/slim CAPI image. The original work on the CAPI images didn't really consider optimizing the images.

For me, I favour options 2, 5 and 7 (no in that order)

Callisto13 commented 1 year ago

Ha yes forgot to add option 7.

I think that is my preference, but I am going to step away from this to now and complete the "mounting the kernel mod and bins separately" part of this image work. Then I'll circle back and think about this piece some more.

Callisto13 commented 1 year ago

Is https://github.com/weaveworks-liquidmetal/flintlock/pull/610 ready to be moved out of WIP @richardcase ?

richardcase commented 1 year ago

Is weaveworks-liquidmetal/flintlock#610 ready to be moved out of WIP @richardcase ?

@Callisto13 - i need to do another pass through it. Will try and get that done today 🤞

Callisto13 commented 1 year ago

🙏 I can get pretty far without it on the CAPMVM side with some guesswork and placeholders, but would be good to have by end of next week. I can pick up if you don't have time.

liquidmetal-dev / image-builder