Open Callisto13 opened 2 years ago
FINE i will do this now
Here are some notes and thoughts about switching our Microvm OS images. See the end for my wrap-up.
Control plane node Microvms with the CAPI images become ready 19.36 seconds (39.84%) faster than those based on our own. Worker node Microvms with the CAPI images become ready 4.02 seconds (30.90%) slower than our own.
CAPI based images win on speed for control plane node readiness, but lose on worker node readiness.
Control plane node Microvms with the CAPI images start with 2.4GB (82.76%) more disk already claimed at boot. Worker node Microvms with the CAPI images start with 3.2GB (177.78%) more disk already claimed at boot.
CAPI based images lose across the board for disk hogging.
Our OG Microvm OS is a container image. The base-ubuntu
is built from Ubuntu 20.04 and adds a few useful packages. The OS image is built from this, adding binaries (kubelet
, kubeadm
, kubectl
and containerd
), setting some sys config, enabling services etc. We do not pre-pull any images required by kubeadm to start the kubelet.
This image is 1.27GB.
This proposed microvm OS image uses the kubernetes image-builder, more specifically the raw
CAPI option which is provided for bare-metal systems. The image-builder creates machine images NOT container images. To create a microvm OS image from this, we create the machine image, mount it, then import everything in that filesystem (including the kitchen sink) to a container image.
This resulting image is ready to use with all the binaries installed and ready to go, any sys config set, and pre-pulled kubeadm images. Reading the image-builder ansible (fun times), and inspecting the image shows that the builder does not install anything else, but because it is based of a machine image, which is based off a full-on ubuntu with a kernel and everything, the image is huge.
4.24GB to be exact.
Some of that can be attributed to the pre-pulled images. Some can be attributed to the kernel modules, which we can remove. With a bit of apt clean
ing we can strip off a few more bytes and get just below 4GB but it is still a chonky boi.
This size impacts in 3 ways:
On the other hand... having the kubeadm
images pre-built will likely save on boot time. But this can be achieved without such a huge footprint.
To illustrate these notes, I got some data.
I created a single c3.medium.x86
device in Equinix running flintlockd v0.4.0
.
The device was deployed to Amsterdam and had the following spec:
I created several clusters (not at the same time) with just 1 control plane and 1 worker node each. They used the same kernel image: ghcr.io/weaveworks-liquidmetal/flintlock-kernel:5.10.77
.
I tested the difference in boot times and init disk usage for Microvm k8s nodes created with two OS images:
ghcr.io/weaveworks-liquidmetal/capmvm-kubernetes:1.23.5
ghcr.io/weaveworks-liquidmetal/capmvm-kubernetes:1.23.10
The 1.23.5
image was created with our original base-ubuntu
image. The 1.23.10
image was created with the the CAPI image builder raw ubuntu 20.04 image.
I used the "Ready" message which we supply in user-data as a somewhat lazy but easily found way to determine the overall "readiness" time of a Microvm+Node.
Control plane node Microvms created with our original base images were on average in a "ready" state within 48.60 seconds. Worker node Microvms created with our original base images were on average in a "ready" state within 13.01 seconds.
Control plane node Microvms created with the CAPI-based images were on average in a "ready" state within 29.24 seconds. Worker node Microvms created with the CAPI-based images were on average in a "ready" state within 17.03 seconds.
Control plane node Microvms with the CAPI images become ready 19.36 seconds (39.84%) faster than those based on our own.
Worker node Microvms with our original base images become ready 4.02 seconds (30.90%) faster than the CAPI ones.
The Microvms created with CAPI-based images boot slower in general, but start k8s faster.
Note that the default Microvm disk size (as controlled by containerd
) is 10GB (9.8 in real terms).
Note also that du
does not report accurately across mounts. Or rather it does but we don't want to count them here. The df
number is the one we are taking as that reflects actual usage on disk.
Control plane node Microvms created with the CAPI-based images report 5.3GB of used disk space at boot. This leaves just 4.1GB available for applications.
root@cluster-new-control-plane-k2jgz:~# du -hd1 /
0 /proc
98M /opt
4.4G /var
0 /sys
141M /boot
5.0K /mnt
0 /dev
48K /tmp
12K /media
4.7M /etc
4.0K /srv
24K /home
2.6G /usr
801M /run
36K /root
6.0G /
root@cluster-new-control-plane-k2jgz:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/root 9.8G 5.3G 4.1G 57% /
root@cluster-new-control-plane-k2jgz:~# dd if=/dev/zero of=foo count=5 bs=1073741824
dd: error writing 'foo': No space left on device
5+0 records in
4+0 records out
4937887744 bytes (4.9 GB, 4.6 GiB) copied, 38.3963 s, 129 MB/s
Worker node Microvms created with the CAPI-based images report 5GB of used disk space at boot. This leaves just 4.4GB available for applications.
root@cluster-new-md-0-qwmst:~# du -hd1 /
0 /proc
98M /opt
2.1G /var
0 /sys
141M /boot
4.0K /mnt
0 /dev
48K /tmp
12K /media
4.5M /etc
4.0K /srv
24K /home
2.6G /usr
116M /run
36K /root
6.0G /
root@cluster-new-md-0-qwmst:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/root 9.8G 5.0G 4.4G 54% /
Control plane node Microvms created with our original base images report 2.9GB of used disk space at boot. This leaves 6.5GB available for applications.
root@cluster-old-control-plane-pbssj:~# du -hd1 /
0 /proc
907M /usr
4.0K /mnt
4.0K /srv
4.0K /home
4.2M /etc
1.3G /run
48K /tmp
38M /boot
1.8G /var
28K /root
5.0K /media
0 /sys
0 /dev
87M /opt
4.1G /
root@cluster-old-control-plane-pbssj:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/root 9.8G 2.9G 6.5G 31% /
Worker node Microvms created with our original base images report 1.8GB of used disk space at boot. This leaves 7.6GB available for applications.
root@cluster-old-md-0-rk5dp:~# du -hd1 /
0 /proc
907M /usr
4.0K /mnt
4.0K /srv
4.0K /home
3.0M /etc
522M /run
48K /tmp
38M /boot
727M /var
28K /root
4.0K /media
0 /sys
0 /dev
87M /opt
2.3G /
root@cluster-old-md-0-rk5dp:~# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/root 9.8G 1.8G 7.6G 19% /
Control plane node Microvms with the CAPI images start with 2.4GB (82.76%) more disk already claimed at boot.
Worker node Microvms with the CAPI images start with 3.2GB (177.78%) more disk already claimed at boot.
We went into this wanting to "stay aligned with the ecosystem", which is fair, but it is more important to consider the requirements for our usecase.
So what does a good OS image for Microvms on a high-performant resource-constrained environment look like:
We need to decide on numbers for the first 2, and agree to keep on top of any excess. Like 2GB size and <30s control plane start, <15s worker node start, or something.
For the last one, keeping in line with the ecosystem is all well and good, but if someone comes to us saying "why is the OS taking nearly half my disk here?" do we really want to say "afraid it has to be like this because that is how all the other CAPI images look."?
If we stick with the CAPI image plan here, we kind of end up with a "you can only have 2 of the 3 options at any one time" situation.
@Callisto13 - what fantastic work, bravo. This type of analysis and real quantifiable numbers is invaluable 🙇
It's strange that the control plane node ready up quicker with the CAPI but not the worker nodes. I would've expected them both to be the same (either faster or slower). I'd love to know why this was the case.
When we did the original image builder work (and flintlock/capmvm) we deliberately didn't do early optimization, so even if we stick with our own image building process, we could think about optimizing it.
I agree with the 3 requirement of size, speed and customisable. There is another aspect to using the CAPI images other than "stay aligned with the ecosystem" as a principal which is the CAPI images are designed for use with CAPBPK (kubeadm bootstrap) and so if changes are made to the provider that require something from the base image then we need to make sure we reflect that in our own images.
The size requirement is also interesting. Having less available space to the guest isn't such a problem (unless you are on really resource constrained machines) as we can just use bigger snapshot sizes which will give more space to the apps in the microvms.....but agreed that the initial image size on the host takes up more space (and takes more time to download).
My thoughts on the options:
I would like to add another option:
For me, I favour options 2, 5 and 7 (no in that order)
Ha yes forgot to add option 7.
I think that is my preference, but I am going to step away from this to now and complete the "mounting the kernel mod and bins separately" part of this image work. Then I'll circle back and think about this piece some more.
Is https://github.com/weaveworks-liquidmetal/flintlock/pull/610 ready to be moved out of WIP @richardcase ?
Is weaveworks-liquidmetal/flintlock#610 ready to be moved out of WIP @richardcase ?
@Callisto13 - i need to do another pass through it. Will try and get that done today 🤞
🙏 I can get pretty far without it on the CAPMVM side with some guesswork and placeholders, but would be good to have by end of next week. I can pick up if you don't have time.
Basically replace the method we have now and use the more standard one.
@richardcase to add any spike stuff, here or on a branch, even if untidy and unfinished.