Open nprokopic opened 3 months ago
Edited the issue to add a summary and stakeholders and implementors sections.
I've had a little time to think about this but haven't yet dived into the code to verify this approach will work so keep that in mind for now...
My thought is as follows:
capi-image-builder
that allows us to define "tooling" such as Teleport. This should contain metadata such as the name of tool, the desired version and the supported providers.-gs
ending.This has the following benefits:
This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release. As things stand today this is a bit of a hassle but we've got an open issue to improve the visibility and discoverability of new available images that should make this problem trivial once done.
Edit:
This could also be done in small stages to make it a more manageable change.
Can we generate this hash together with the config? Like having it somewhere in the config itself or as a separate file? Because then people would always know which hash to use for which specific tooling version just by going through the Git history?
capi-image-builder
could than also just pick up this hash. Oooooor: Why not use the Git commit hash? First few digits are already unique and you could find the exact point in time. Something like this...
The hash isn't for a specific tooling version though. It's for the total combination of all tooling.
Yeah, sure, I was also thinking of the combination.
But who generates that hash and how? (Also why? When we can just automate it at buildtime)
I like your approach of having Renovate bumping versions. With a GitHub action / Tekton run triggering whenever something changes this config, you could trigger a capi-image-builder
run. And this then builds images with the exact commit hash or just a few digits of it. No need to generate a hash your own, it's already there.
Sure, but from a technical perspective that actually might be more work as I don't think we currently pass through the commit sha to the builder (we might do, would need to check) but generating a hash from a string is easy. I don't really care how we get the value, just that it should be automated and the same process regardless of automated or manual trigger.
Of course, I just had the transparency in mind. This way you would not need a dashboard or something like this. Just the commit hash from GitHub, that is already publicly visible for everyone. It's basically the same approach we already use for apps in test catalogs and makes the exact image being used predictable.
publicly visible for everyone.
The repo is actually private.
"Public" for Giant Swarm staff...
Could you imagine a need to (re)build images with same Flatcar version, same Kubernetes version, same tooling version but maybe a fix/change in capi-image-builder
itself? Because when its behavior changes but Flatcar, Kubernetes and tooling is still the same, that change would not be reflected in the image name with your approach, right?
Good point. That is totally a possibility but would more be bumping the version of image-builder (upstream) I imagine.
Ok. Yeah, short commit sha from capi-image-builder
would be the way to go though. But everything else the same, yeah?
Sure, sure! I really love your proposal! That commit hash was the only thing that came into my mind to make it a bit more transparent and easy to use. 🙂
As just discussed on Slack:
I guess we currently have something in place that checks if an image already exists and then does not rebuild and upload it, right? To save time, money, resources...
With adding the commit hash to the name, that would probably break. For that we could maybe also try to just use the tags/releases we already have in the capi-image-builder
repo.
With using tags/releases, engineers would have full control of images just via the workflow they are already used to (change something, forge a release what not).
So, I hadn't thought about that and that does introduce a bit of a problem as the Tekton pipelines/tasks aren't managed via tagged releases (instead gitops points to main
).
I'm not thinking of going back to a hash generated from the tooling config versions (with the addition of the version of capi-image-builder
itself added as this would handle the upstream image-builder version).
There is still a chance where changes to the pipelines/tasks cause a change to the produced image but generally speaking from past experience these are usually fixes performed by Tinkerers and in such cases we can manually overwrite images as required on a case-by-case basis. As we move towards more automation and abstraction over this process those instances should become less and less frequent anyway (not that we have many already).
I like almost everything in the @AverageMarcus's proposal above, just except one part - using the hash in the image name, because then that will bubble up all the way to how the releases work and it would make OS tooling versioning special in more than few places and cases.
This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release.
This part :point_up: specifically. It would mean that we would have to include a tooling hash in the Release resource. It would be the only place where we use the hash, as we use a regular version number for everything. This would make tooling version (or here hash) changes special, compared to everything else.
From some releases experience point of view:
We would also like to continue improving the automation and testing around releases. Few examples for which we don't have issues, but were mentioned/brainstormed in chats and meets:
Form the technical point of view, the work I can think that we would have to do:
@AverageMarcus coud you do almost everything like proposed, and just have an actual version number instead of a hash? Maybe even to make it a bit more manual in the beginning, e.g. you change teleport version in config file and you also change OS tooling version wherever it would be written? And then we improve it and automate it incrementally to get it as automated as possible.
Alternatively, could this config file live outside of the capi-image-builder
and be in it's own os-tooling
repo? So we version and release that repo, then both capi-image-builder
and Release resource use the repo version. Releases would use it like any other version of any app/component, and capi-image-builder
(if possible) would get the config file from the repo for the specific version. Then the workflow would be:
os-tooling
repo and make a new os-tooling
release.os-tooling
version in capi-image-builder
.os-tooling
version.We can also start with everything inside of capi-image-builder
and then incrementally change it towards having a separate repo, or whatever is the best and most convenient.
I see how using a hash can make it nice and simpler the capi-image-builder
itself, but a hash in the OS image name would add friction and more manual work on other levels of the product (like I tried to describe above in different examples), as hash changes do not tell you much and you cannot test and automate stuff around hashes like around regular semantic versions.
If there would be a way to map a variant / OS tooling version from a regular version into a hash (or whatever we have in the image name), then I don't care what we have in the image name :smile: the challenge here is that we assemble the image name in cluster-\<provider> charts, so we're limited with what Helm can do there. So far there we have been basically concatenating OS and Kubernetes version strings (with some provider-specific string pattern).
- Modify Release CRD to allow hashes for versions, as currently it allows only semver version strings.
- Alternatively, expand Release CRD to include something new just for tooling hash.
- As we're still delivering the same Release CRD to both vintage and CAPI, make sure that the Release CRD change does not break something somewhere 🙈
This is also partly why we didn't want to modify (improve) the Release CRD for CAPI, because we didn't want to create more unnecessary and potentially risky vintage-related work.
This will not always be the case, as vintage will be gone at some point, soon I hope, but it's still there today.
Personally, out of everything I mentioned above, the main concern for me is the automation, testing, communication and in general the whole experience around releases, since we want to keep improve all of that as much as we can. (I don't mind a lot dealing with some added technical work, as long as the priorities allow, but here many others have their saying as well, like POs / SIG Product.)
How is the hash in the name any different from what we have right now? The hash is just a string that we use with the suffix as we have been doing since we introduced the variant.
You have to look up the hash somewhere, which is not the same experience as using a new release.
You're not going to get away from this regardless. How else will you know what images we have available to use? You need to look up that we have an image with Kubernetes version X, Flatcar version Y and tooling version (or hash) Z. That's not automated anywhere, someone needs to manually get those values and put them into a new Release PR.
An os-tooling
repo wont work as you also need to consider the version of image-builder
and capi-image-builder
in that version as those (can) introduce changes to the resulting image.
How do you envision the semver version being decided on? A minor change to the Teleport binary is not the same as a minor change to the resulting image. What about image-builder that uses pre-v1 release versions? How are they taken into account?
Also keep in mind, right now I'm only talking about the name of the VM image that is used. There's absolutely no reason why the Release CR itself couldn't contain the actual versions of components. We just can't store all those in the image name as we're length limited.
Maybe to comment the last paragraph first:
right now I'm only talking about the name of the VM image that is used. There's absolutely no reason why the Release CR itself couldn't contain the actual versions of components.
We need a way to go from what we have in the Release CR to what we have in the image name, and that has to be implemented in Helm, as Helm chart is the place where we read Release CR and build the image name that we use.
If there is a way to map OS tooling version "x.y.z" into some 25383e1
hash and do that in automatically Helm (or have that somehow outside of Helm), then all good, it doesn't matter what we have in the image name, as I wrote above I don't care TBH :slightly_smiling_face: Not sure if this would be possible without some larger architectural changes.
We just can't store all those in the image name as we're length limited.
What do you mean by "all those"? We now have OS and k8s version, and here the proposal is to just add one more version as an umbrella version for all the tooling. OS image name string-length wise, adding one more version string should not be much different than adding a hash string. If we're already getting close to OS image name length limits for some provider, then we could also shorten "flatcar-stable" and "kube" parts of the name.
Now other points :slightly_smiling_face:
How is the hash in the name any different from what we have right now? The hash is just a string that we use with the suffix as we have been doing since we introduced the variant.
While the automation part in image building is definitely a big and needed step forward, some other parts around hash not being different, or the similarities with what we have now are the concern here, as the whole issue is about improving the whole experience and implementation, not keeping them similar, and about being able to use a regular version number like for everything else.
You need to look up that we have an image with Kubernetes version X, Flatcar version Y and tooling version (or hash) Z. That's not automated anywhere, someone needs to manually get those values and put them into a new Release PR.
We don't always have to look up images, I would say we rarely have to. We know that upstream Kubernetes and OS releases are triggering new automated image builds, so unless somebody is making a new release within minutes/hours (or maybe few days for Azure) after a Kubernetes/Flatcar release, it's usually safe to assume that we have an image and just bump the Kubernetes and Flatcar version in the Release and test it, and I think everyone is mostly doing that.
Therefore it should be the same experience for OS tooling - new OS tooling version (wherever it's stored) should trigger new automated builds, and we should be able to just use that OS tooling version in the new release, like for OS and Kubernetes.
An os-tooling repo wont work as you also need to consider the version of image-builder and capi-image-builder in that version as those (can) introduce changes to the resulting image.
I get that having this config in a separate repo would add more work on the capi-image-builder
side, which is why I am not pushing here for a separate repo for this. If it is possible though, then it would make the whole releases experience better, so IMO it's worth implementing it later if we can.
You mentioned having a config file with metadata such as the name of tool, the desired version and the supported providers. So I just meant having that config file in it's own repo, so it's more easily versioned and we can more easily consume it from multiple places.
So in the end, like somebody would today change the Teleport version in capi-image-builder
and make a new release with new image, we would instead
os-tooling
repo and release os-tooling
repoos-tooling
version in the capi-image-builder
capi-image-builder
can always use the latest available os-tooling
version, and build an image with that latest version in the nameos-tooling
versionHow do you envision the semver version being decided on? A minor change to the Teleport binary is not the same as a minor change to the resulting image. What about image-builder that uses pre-v1 release versions? How are they taken into account?
Not all components affect the releases in the same way. Patch of one component can have a higher impact than an app minor, e.g. OS/k8s/tooling patch will roll nodes, but app patch won't.
So we would have different test logic, depending on what is being checked, but in any case all checks would be based on semver (unless we somehow automatically dig into changelogs and try to analyse them, but I don't think that is gonna happen, at least not any time in the foreseeable future, as that would be great effort that requires a lot of time).
We talked about this in Turtles before, have some notes in Miro boards, but not yet specific plans and issues, as other stuff got higher priorities.
Today for Teleport it might not be super important to check if it was a patch or a minor, but we maybe/probably don't want to release new Teleport binary major in a patch or minor release. We also have teleport-kube-agent as a default app with its config being in public customer-facing API, so Teleport is in a way also exposed to customers here. We could generalize this check and check at least OS tooling version which would be an umbrella/aggregate for all binaries inside (so new Teleport patch means new OS Tooling patch).
Ideally, in the future, we should be able to check version changes to all included binaries (use OS tooling version to fetch the exact version of the contained binary). Not that long ago we talked about potentially baking in binaries into images in general, for both new and overriding existing OS binaries, and that was prompted by runc CVE which compromised containerd, so it would be good to be able to override containerd. If we do that some day, then we will have to make sure that we correctly bump containerd versions there, as we expose containerd config in the public API, so you cannot bump the containerd version however you want.
The general point of the above is that we can easily have a relationship between a binary on the node and a default app in a Release with default app's public API (Helm values) exposed in of cluster-\<provider>. Checking semver versions across all of those can tell you something, but checking a hash does not tell you anything, unless you can somehow get exact binary/tool version based on that hash.
All in all, until now we had 2 variables that affect which image is used:
Variant thing was a manual hack to abstract a change in a component version, but it should be just the 3rd variable alongside Flatcar and Kubernetes above, with the same interface and same (or at least very similar) experience around it:
Having to use a hash for something in the releases would make that thing a special flake, and that would fragment various parts of implementation, testing, automation and experience around releases.
There's more than what you're listed that effect the image that is built though. As i said in my comment, the version of upstream image-builder
makes a difference as most of the versions of OS components are actually set in there, the version of capi-image-builder
makes a difference as we consume and configure upstream image-builder, the "versions" (git commit) of the Tekton pipelines makes a difference as we change the configuration passed through to image-builder.
These would all need to be reflected in the Release CR with what you're suggesting. This also means a separate os-tooling
repo WONT WORK as I said because that wont be able to control the version of image-builder or capi-image-builder.
From what your saying, it sounds like you need all versions defined on the Release CR. If that's the case, then the Release should be used to generate the image and not the other way around. But that breaks a lot of the automation we already have and would mean we need to rebuild most of the stuff we currently have.
What is the benefit of having all this listed in the Release CR and used to LOOKUP the image to use? Why is it such a bad thing that the Release CR defines the actual image name?
There's more than what you're listed that effect the image that is built though. As i said in my comment, the version of upstream image-builder makes a difference as most of the versions of OS components are actually set in there, the version of capi-image-builder makes a difference as we consume and configure upstream image-builder, the "versions" (git commit) of the Tekton pipelines makes a difference as we change the configuration passed through to image-builder.
These would all need to be reflected in the Release CR with what you're suggesting.
Regardless of if there is a separate os-tooling
repo or not - image-builder
and capi-image-builder
are not exposed in our product and are not exposed to customers in any way. Also OS image name does not include anything about image-builder
and capi-image-builder
and it never did. On the other hand, for the tooling that we bake into nodes, almost like Flatcar and Kubernetes:
From what your saying, it sounds like you need all versions defined on the Release CR. If that's the case, then the Release should be used to generate the image and not the other way around. But that breaks a lot of the automation we already have and would mean we need to rebuild most of the stuff we currently have.
I am suggesting to add one more very specific version, not "all versions", not even sure what you mean with "all versions". One more version that would be an aggregate version for all binaries that Giant Swarm additionally bakes into the nodes, alongside everything we get with Flatcar and Kubernetes.
What is the benefit of having all this listed in the Release CR and used to LOOKUP the image to use? Why is it such a bad thing that the Release CR defines the actual image name?
I tried to write above in comments why I think that having a hash in the Release is not a good idea. Not sure what else to add to everything I wrote, at least can't think of anything else today.
To get more feedback and details from the capi-image-builder side of things:
For the described config with binaries metadata (binary names, versions, etc) in capi-image-builder, why cannot we have a version string to describe that config and then have that version string in the OS image name?
I see how a hash would work in capi-image-builder for this, but why the version string cannot work? It would work differently than hash, but I don't think that it's impossible to make it work with a version string.
Maybe there is a middle ground here :thinking:
I mentioned few times that I don't care if we have the hash in the OS image name. It does not have any downside on its own, if it helps in capi-image-builder then awesome. The part that I see as a problem is hash being an identifier of something in the release, and here hash would actually hide version strings, and we basically use Release CRs for version strings.
Now why do I repeat this again...
Given how it works today, with a very plain 1 on 1 relationship between variant number in the Release CR and variant number in the OS image name, and simple Helm that just takes variant from the Release CR and templates OS image name with it (and OS and k8s info), it seemed to me that hash in the OS image name requires that same hash in the Release CR, as well, it's not like you can guess the hash.
Then also this comment from Marcus :point_down: (unconsciously) solidified that thinking and I didn't question it anymore:
This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release.
But maybe we can guess the hash, or more precisely calculate the same hash in both places where we need it.
IIUC the config file in capi-image-builder
would have some tooling metadata info with binary names and their versions, e.g. something like this (exact format doesn't matter here much):
- name: teleport
version: 15.5.5
- name: foo
version: 1.0.0
- name: bar
version: 2.0.0
And then capi-image-builder
would calculate the hash from the above info.
We could maybe calculate the same hash in cluster chart in Helm, as Helm has few hashing functions, like sha256sum
and sha1sum
, and we're even already using some of these for some other hashing scenarios.
So if in both capi-image-builder and cluster chart we have the same info that we want to hash, and we use the same hash function, then Helm in cluster chart can render the correct OS image name with hash, and it can do it with Release CR with regular versions being the input for hashing.
This would mean that in the Release CR we have names and versions of actual binaries that we bake into the node, and not some abstract variant / OS tooling version. IMO this would be even better, as it makes it even more transparent easier to work with on the releases side.
This would also require that we maintain hashing to work in the same way on both sides, which should not be a problem I think, as this part should not change very often.
In which scenario would we need new images with the same version of OS, k8s and OS tooling, but built with the new version of capi-image-builder/image-builder? Did that ever happen already?
If this is the case, then OS image change/overwrite like that should trigger a rolling update in order to replace all the nodes, so we would need a new OS image name (as rewriting OS image in AWS/Azure/etc would not trigger anything on its own) and therefore we would need a new release. Meaning that capi-image-builder itself should then be a part of the Release CR.
In which scenario would we need new images with the same version of OS, k8s and OS tooling, but built with the new version of capi-image-builder/image-builder? Did that ever happen already?
The runc
vulnerability not long back required us to fix the problem in upstream image-builder then update our capi-image-builder
to use that version.
So, after being away from my keyboard for a while last night I had time to think over this and realised that we might be overcomplicating things. I think we could make use of the capi-image-builder
version as the "tooling" version in both the image name and Release CR and that should be enough.
My reasoning is...
As far as I can think, the things that decide if an image is different or not are as follows:
capi-image-builder
and only included when we publish a new release of the container imageimage-builder
- this is bundled into capi-image-builder
and only included when we publish a new release of the container imagecapi-image-builder
and only included when we publish a new release of the container imageAssuming I'm correct with the above, that means we have 3 versions we care about - Kubernetes, Flatcar and capi-image-builder. So a combination of these three in the image name gives us a distinct image.
The release CR could contain these three (semver) versions to then build the image name in the cluster chart. I suggest we call the capi-imge-builder
component something along the lines of build-tooling
.
Unfortunately this would mean that this...
This would mean that in the Release CR we have names and versions of actual binaries that we bake into the node, and not some abstract variant / OS tooling version. IMO this would be even better, as it makes it even more transparent easier to work with on the releases side.
... is not possible. And while I think it'd be nice to have everything defined in the Release I don't think it'd actually work in practice. It'd be hard to know for sure if a specific combination of all the versions have been built as an image or not and someone could come along and attempt to bump one of the components (like we do with the apps) and assume that the Release handles the installation of that specific component.
We can (and should) still include specific versions in release notes for customers. Just not in the Release CR itself. I think this is a good enough compromise.
One thing I think would be needed is clear documentation on how to appropriately version capi-image-builder
going forward and what sort of changes equal what semver level.
The extra benefit of this is its very easy for me to build this. 🤣 It's practically a 1 line change (almost) in the Tekton Pipeline.
Edit:
I just did a little test run (with this WiP PR) to see what this would look like and ended up with the following image:
flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs
(Note: I used a suffix of mn-gs
here so I could easily clean up after, normally the mn-
wouldn't be there)
To make use of this new image format in the Releases / cluster app the Release CR would include the 1.14.4
as the build-tooling component version and then the cluster chart would use tooling-[VERSION FROM CR]-
in the place where we currently handle the variant.
All things considering, including capi-image-builder
version in the image name and the Release CR sounds like the best option so far, as it covers all present and known future use cases (like adding new binaries), so AFAIC let's do that.
The extra benefit of this is its very easy for me to build this. 🤣 It's practically a 1 line change (almost) in the Tekton Pipeline.
It's similarly easy on the releases / charts side! :)
It would be great if we can get this done for CAPA v29.0.0 :grimacing:
I have one very unimportant question :smile:
flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs
Why do we have v
prefix just before the k8s version, but not before the flatcar (and now also OS tooling) version?
I have one very unimportant question 😄
flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs
Why do we have
v
prefix just before the k8s version, but not before the flatcar (and now also OS tooling) version?
Because that's the tagged versions used by upstream. Flatcar and our github releases don't use a v
prefix but Kubernetes does. If needed, we can remove it just from the image name. It's just that we're using the same variables we're using to lookup the packages while building.
Do we want to move ahead with https://github.com/giantswarm/capi-image-builder/pull/328 ?
I've updated the PR to remove the v
prefix. Looks like I was actually mistaken and we were adding it into the name as a hardcoded prefix. 🤷
Do we want to move ahead with giantswarm/capi-image-builder#328 ?
Yes! :rocket:
🎉 I'll get it merged now then.
How do you want to handle changes to the cluster apps? Are y'all going to handle that for a new release?
Are there any version combinations you want me to re-build with the new name format?
Are there any version combinations you want me to re-build with the new name format?
For now we will need new images at least for CAPA v29, so k8s 1.29.7, Flatcar 3815.2.5 and whatever is the capi-image-builder version after these changes. Not sure about CAPZ, will check with Turtles.
How do you want to handle changes to the cluster apps? Are y'all going to handle that for a new release?
I am adding a new helper to cluster chart (pushing PR now), then I can try it out in cluster-aws once there is a rebuilt image for CAPA v29.
Since we will have a change in cluster-aws, and newer cluster-aws version can theoretically end up being used also in older CAPA majors (in case some cluster-aws change/fix is needed again for the migration), that means that we might need to rebuild the images for older CAPA majors as well (so v25, v26, v27, v28).
P.S. Will update here for CAPZ. And not sure what k8s/OS versions are CAPV and CAPVCD using, but I guess we will also need the rebuild the images for the latest k8s/OS there.
Ok, I'll rebuild the CAPA 1.29.7 one for now and will let you know when it's available (and confirm the name).
Once you know what else is needed let me know and I'll get those too :)
New image name: flatcar-stable-3815.2.5-kube-1.29.7-tooling-1.15.0-gs
Oh, it's gonna be challenging to test this, at least in cluster-aws repo PR, as the cluster-aws change also requires a new component in the Release CR, and the e2e tests are using the latest v28 release (and we don't have OS tooling in the existing Release CRs at all).
OTOH it should be possible to test in the releases repo draft PR (with cluster-aws test version) :thinking:
Testing cluster chart (and cluster-aws) changes here in the releases repo PR https://github.com/giantswarm/releases/pull/1337 (with a Release CR that is using cluster-aws dev version) :crossed_fingers:
Just in case, and to get ahead of things, I've kicked off builds for the latest patch release for CAPA for v1.25 up to v1.30 so even for backport Releases we should (once built) have images using the new format available.
Edit: Just done the same for CAPZ too. Hopefully I haven't triggered too much at once 😅
cluster-aws have been updated to work with new image names. New cluster-aws version v2.0.0 has the change and it will be included in CAPA v29 release :tada:
@Gacko will also work on the same change in cluster-azure.
@giantswarm/team-rocket we will need the same changes in cluster-vsphere and cluster-cloud-director whenever the new OS image is needed and built for those (e.g. due to new Flatcar version or new k8s version).
For reference:
I think everything from Team Tinkerers side needed for this is now complete, yeah? I'm going to move it from our board into done. Let me know if you think I've missed anything or if we need to manually build any more images. 😄
🙈 OMG! I can't move it to our done column without it closing the issue.
We're done from our side. So no worries, close it!
@Gacko But what about this latest comment? https://github.com/giantswarm/roadmap/issues/3592#issuecomment-2273328357
It still needs adding to the other providers, doesn't it?
Yes, sure. Then we should may re-open and just assign to Rocket? I'll take care about cluster-azure
today, so it's completed for Phoenix providers.
I moved it to Rocket's inbox and removed it from other boards.
@Gacko I did not read the full thread, but it looks that turtles took care of phoenix providers but for rocket providers it gets assigned to Rocket? Was wondering why is that and if it's related to the use of the shared cluster chart. if so, we might just want to wait until rocket also uses the shared cluster chart?
Summary
TL;DR We are baking Teleport binaries into our OS images (for now just Teleport, others might be added later). Final outcome of this issue should be that we have a regular component version in the Release resource, and that version should represent all binaries that we bake into the node. Then we can easily bump that component version when there is a new Teleport (or other binary) version that requires a new release.
Sometimes we need a new version of the Teleport binary, while we keep the same OS and Kubernetes versions, so because OS image name has only OS and Kubernetes versions in it, in order not to overwrite the existing image, we are manually building a new image with a suffix (OS image "variant").
This OS image variant (which is a plain integer that basically abstracts/hides a binary version change) must be then specified somewhere, in order to use the new manually built image. It used to be specified in the cluster-\<provider> Helm values, and now it's specified in the Release CR.
The problem with this OS image variant is that it's sometimes added, sometimes is bumped, and sometimes it's removed. While this was maybe an acceptable temporary fix while this was done in Helm values, doing this for a component version in the Release resource is a bad and quite unexpected user experience, as it would be the only component that behaves like that. This behavior is also probably impossible to automate, which would limit us in what we can do for automation around releases.
OS image variant should be just another regular component version in the Release resources, next to existing Kubernetes and OS versions, and it should be used/bumped almost in the same way like any other app/component in the release, because behind the OS image variant what we actually have is a version change for a binary that we bake into nodes.
The above means that we should automatically build OS images with the variant always included.
The last, but not the least - "variant" has been a confusing concept and many people are still not aware how it works or that it even exists. We could use some name like "OS tooling" and specify a version of it, so it's perhaps more clear, because it's about changing a version of binaries in the OS image, and "variant" doesn't really imply that.
Stakeholders and implementors
capi-image-builder
and how to have "variant" / "OS tooling" always included in the image name.Motivation with current implementation and behavior details
Currently we are building our OS images with name that has the following pattern:
This means that OS image name includes Flatcar and Kubernetes versions, so we get new images when there are new Flatcar and Kubernetes versions. However, these are not the only components that might get a newer version for which we need a new image.
We are also embedding teleport binary in the OS image, and we need new images when there is a new teleport binary.
cluster-aws releases
Until now, when there would be a new teleport binary, we would manually build a new OS image with a custom image name suffix that would contain a "variant" number, and that new variant number would be then set in cluster-aws Helm values. To show how this looked like in practice:
If a variant Helm value would be specified, then a custom image name suffix would be used like this (example from cluster-aws 0.78.0):
In that same cluster-aws version 0.78.0 , OS image variant was
3
, see here. This means that AWS AMI was manually re-built 3 times in order to update the used teleport binary, so we had been using manually-built AMIs for a quite a few releases of cluster-aws.When we would get a newly built OS image with either new OS or new k8s, then the above OS image variant number would have to be removed from cluster-aws Helm values, because newly automatically built image would have the desired latest teleport binary, but the OS image name would be without variant number suffix.
New CAPI releases with Release CRs
Since this variant number is effectively just another component version, we have also included it in the Release CR for new CAPI releases, e.g. see here for v28.0.0 for AWS (CAPA).
Image name pattern is now different now than before (see it here), as it follows the new OS image naming patters (recent change from Team Tinkerers to include OS version in the AMI), but everything works basically the same like before:
For CAPA v25 we have used a manually built OS image which is using a variant number from the Release CR. Due to various reasons, it was a very high priority to release the newer k8s versions ASAP, so we kept using manually built images. Now that has changed for v29 WIP, so variant number has been removed from the Release CR and cluster-aws is using the automatically-built OS image.
There are more than few issues with this approach with variant number. Let's describe few scenarios.
Scenario 1: when we need a new Teleport binary, and OS and k8s versions do not change:
1
in the past.flatcar-variant
component in the new Release CR for the new release and set this to the appropriate value (like we would (re)add variant Helm value before in cluster-aws).flatcar-variant
component with version1.0.0
in case of variant suffix1
like in the example above, because Release components must be semantic versions, so we set1.0.0
and just use major number.Scenario 2: when there is a new OS or k8s version, and we previously had a custom image with
flatcar-variant
set:flatcar-variant
from the Release CR, in order to use automatically-build image. This is what we did for CAPA v29 ATM.Scenario 3: when there is a new teleport version, and we previously already had a custom image with
flatcar-variant
set:1
to2
.flatcar-variant
in the Release CR, e.g. from1.0.0
to2.0.0
.What is the problem here?
We have at least 3 different scenarios where we have 3 different ways to update the OS image name that we are using. While with new releases it's a bit different and more complex, it's essentially the same scenarios like before with cluster-aws Helm values, where we are manually building images and the variant number is sometimes added, sometimes incremented, sometimes removed, depending on if we have new k8s, new OS, or new Teleport version.
Also, some day we might also have other binaries, besides Teleport.
The variant number is basically abstracting Teleport version, so it's just another component version. Therefore for this we should have a component version in the Release CR, and that component version should be used like all other component versions - by simply increasing major/minor/patch version when needed, and not sometimes add it, sometimes bump it, sometimes remove it. So now we come to the proposal how to make all of this better.
Proposal
Interface to bumping OS tooling (like Teleport) version
Add new component called
os-tooling
in the Release CR, starting with version0.1.0
, so it would look like this:Then whenever there is a new version of teleport (or any other binary on the node), we just increase this version number like any other component version number. We don't mysteriously and unclearly remove it and re-add it like ATM.
Automatically include os-tooling version in OS image name
With the new always-present
os-tooling
component version, OS image name format would have to always include this new component version, so we would need OS image name to have a pattern like this for example:For the above to work, we would have to update image building in order to be able to specify
os-tooling
version as an abstract version for all binaries that we include in our images (currently only Teleport). So we need a way to coupleos-tooling
version with some binaries and their versions, so thatcapi-image-builder
knows which binaries it should install for a specificos-tooling
version.Few ideas how to accomplish the above.
os-tooling
repoos-tooling
repoos-tooling
repo as any other app/component repo.os-tooling
repo, we release a new version ofos-tooling
.capi-image-builder
to consumeos-tooling
with specified version (or simply check the latest and use the latest)capi-image-builder
to appendos-tooling
version in the image name, so we can automatically build new OS images without overwriting previous ones (when k8s and OS do not change, and just os-tooling changes).os-tooling
releases to trigger new image builds (like k8s and OS versions).Here
os-tooling
repo could have simple simple YAML with Teleport (and other binaries) version(s) if we could consume such "plain" YAML from ansible incapi-image-builder
, or we could even put Ansible tasks right inos-tooling
, if we can then easily include them incapi-image-builder
.Directly put
os-tooling
and Teleport versions incapi-image-builder
, and then directly in capi-image-builder have a map between os-tooling version and Teleport (and other binary) version.Actually not sure if this is simpler, or even more complex.
Ideas with any other way that would enable us to have a simple
os-tooling
component in the Release CR are more then welcome! :pray:The end result should be that, whenever there is a new Teleport release, image is built automatically, and we can simply bump a version number in the Release CR - so the behavior and user experience here should be effectively the same like for OS and Kubernetes versions.
TODO
Outcome
For Customer
Internal
os-tooling
version.Other
cc @tuladhar @vxav @AverageMarcus @Gacko @paurosello I chatted to you folks about this in Slack on few different occasions. Also the above idea was shared previously in Slack in Team Phoenix, Team Turtles, Team Tinkerers, Team BigMac and Team Rocket channels. So far nobody objected.