Improve and standardize OS variant suffix for new CAPI releases

nprokopic commented 3 months ago

Summary

[!NOTE] I tried to be concise in the summary, but the issue is very wide and touches on multiple places and teams, so it's hard to describe it correctly in just a few sentences, as just few sentences could easily lead to a misinterpretation of some parts of the problem.

TL;DR We are baking Teleport binaries into our OS images (for now just Teleport, others might be added later). Final outcome of this issue should be that we have a regular component version in the Release resource, and that version should represent all binaries that we bake into the node. Then we can easily bump that component version when there is a new Teleport (or other binary) version that requires a new release.

Sometimes we need a new version of the Teleport binary, while we keep the same OS and Kubernetes versions, so because OS image name has only OS and Kubernetes versions in it, in order not to overwrite the existing image, we are manually building a new image with a suffix (OS image "variant").

This OS image variant (which is a plain integer that basically abstracts/hides a binary version change) must be then specified somewhere, in order to use the new manually built image. It used to be specified in the cluster-\<provider> Helm values, and now it's specified in the Release CR.

The problem with this OS image variant is that it's sometimes added, sometimes is bumped, and sometimes it's removed. While this was maybe an acceptable temporary fix while this was done in Helm values, doing this for a component version in the Release resource is a bad and quite unexpected user experience, as it would be the only component that behaves like that. This behavior is also probably impossible to automate, which would limit us in what we can do for automation around releases.

OS image variant should be just another regular component version in the Release resources, next to existing Kubernetes and OS versions, and it should be used/bumped almost in the same way like any other app/component in the release, because behind the OS image variant what we actually have is a version change for a binary that we bake into nodes.

The above means that we should automatically build OS images with the variant always included.

The last, but not the least - "variant" has been a confusing concept and many people are still not aware how it works or that it even exists. We could use some name like "OS tooling" and specify a version of it, so it's perhaps more clear, because it's about changing a version of binaries in the OS image, and "variant" doesn't really imply that.

Stakeholders and implementors

@giantswarm/team-bigmac which owns and maintains Teleport implementation. You are the team that bumps "variant" / "OS tooling" version most often (at least for now).
@giantswarm/team-turtles which own releases. You would be the team that implements what is needed in the cluster chart and/or in cluster-\<provider> apps if needed (with coordination/help from @giantswarm/team-phoenix and @giantswarm/team-rocket), so all providers are aligned and all cluster-\<provider> charts are using the "variant" / "OS tooling" version from the Releases in the same way.
@giantswarm/team-tinkerers which owns image building pipelines. You would design and implement the change in capi-image-builder and how to have "variant" / "OS tooling" always included in the image name.

Motivation with current implementation and behavior details

Currently we are building our OS images with name that has the following pattern:

flatcar-stable-<flatcar version>-kube-v<k8s version>-gs

This means that OS image name includes Flatcar and Kubernetes versions, so we get new images when there are new Flatcar and Kubernetes versions. However, these are not the only components that might get a newer version for which we need a new image.

We are also embedding teleport binary in the OS image, and we need new images when there is a new teleport binary.

cluster-aws releases

Until now, when there would be a new teleport binary, we would manually build a new OS image with a custom image name suffix that would contain a "variant" number, and that new variant number would be then set in cluster-aws Helm values. To show how this looked like in practice:

If a variant Helm value would be specified, then a custom image name suffix would be used like this (example from cluster-aws 0.78.0):

{{- $suffix := .Values.global.providerSpecific.osImageVariant }}
{{- if $suffix }}
  {{- $suffix = printf "-%s-gs" $suffix }}
{{- else }}
  {{- $suffix = "-gs" }}
{{- end }}

In that same cluster-aws version 0.78.0 , OS image variant was 3, see here. This means that AWS AMI was manually re-built 3 times in order to update the used teleport binary, so we had been using manually-built AMIs for a quite a few releases of cluster-aws.

When we would get a newly built OS image with either new OS or new k8s, then the above OS image variant number would have to be removed from cluster-aws Helm values, because newly automatically built image would have the desired latest teleport binary, but the OS image name would be without variant number suffix.

New CAPI releases with Release CRs

Since this variant number is effectively just another component version, we have also included it in the Release CR for new CAPI releases, e.g. see here for v28.0.0 for AWS (CAPA).

Image name pattern is now different now than before (see it here), as it follows the new OS image naming patters (recent change from Team Tinkerers to include OS version in the AMI), but everything works basically the same like before:

If a variant is specified in the Release, a manually-built image with variant suffix is used.
If there is no variant in the Release CR, the automatically-built image is used (without any variant suffix).

For CAPA v25 we have used a manually built OS image which is using a variant number from the Release CR. Due to various reasons, it was a very high priority to release the newer k8s versions ASAP, so we kept using manually built images. Now that has changed for v29 WIP, so variant number has been removed from the Release CR and cluster-aws is using the automatically-built OS image.

There are more than few issues with this approach with variant number. Let's describe few scenarios.

Scenario 1: when we need a new Teleport binary, and OS and k8s versions do not change:

We have to manually build a new OS image (one or more, depending how many providers need this) where we will set a variant suffix in the custom image name, and this has been usually number 1 in the past.
Then we have to (re)add flatcar-variant component in the new Release CR for the new release and set this to the appropriate value (like we would (re)add variant Helm value before in cluster-aws).
- e.g. we would (re)add flatcar-variant component with version 1.0.0 in case of variant suffix 1 like in the example above, because Release components must be semantic versions, so we set 1.0.0 and just use major number.

Scenario 2: when there is a new OS or k8s version, and we previously had a custom image with flatcar-variant set:

Newly built OS image includes the latest teleport binary, so we now have to use automatically-built image without any custom suffix.
The above means that we now have to entirely remove flatcar-variant from the Release CR, in order to use automatically-build image. This is what we did for CAPA v29 ATM.

Scenario 3: when there is a new teleport version, and we previously already had a custom image with flatcar-variant set:

We have to manually build a new OS image (one or more, depending how many providers need this) where we will set the updated variant suffix in the custom image name. In the past for cluster-aws this would mean updating variant Helm value, e.g. from 1 to 2.
Then we would have to update flatcar-variant in the Release CR, e.g. from 1.0.0 to 2.0.0.

What is the problem here?

We have at least 3 different scenarios where we have 3 different ways to update the OS image name that we are using. While with new releases it's a bit different and more complex, it's essentially the same scenarios like before with cluster-aws Helm values, where we are manually building images and the variant number is sometimes added, sometimes incremented, sometimes removed, depending on if we have new k8s, new OS, or new Teleport version.

Also, some day we might also have other binaries, besides Teleport.

The variant number is basically abstracting Teleport version, so it's just another component version. Therefore for this we should have a component version in the Release CR, and that component version should be used like all other component versions - by simply increasing major/minor/patch version when needed, and not sometimes add it, sometimes bump it, sometimes remove it. So now we come to the proposal how to make all of this better.

Proposal

Interface to bumping OS tooling (like Teleport) version

Add new component called os-tooling in the Release CR, starting with version 0.1.0, so it would look like this:

spec:
  components:
  - name: os-tooling
    version: 0.1.0

Then whenever there is a new version of teleport (or any other binary on the node), we just increase this version number like any other component version number. We don't mysteriously and unclearly remove it and re-add it like ATM.

Automatically include os-tooling version in OS image name

With the new always-present os-tooling component version, OS image name format would have to always include this new component version, so we would need OS image name to have a pattern like this for example:

flatcar-stable-<flatcar version>-kube-v<k8s version>-gs-v<os tooling version>

For the above to work, we would have to update image building in order to be able to specify os-tooling version as an abstract version for all binaries that we include in our images (currently only Teleport). So we need a way to couple os-tooling version with some binaries and their versions, so that capi-image-builder knows which binaries it should install for a specific os-tooling version.

Few ideas how to accomplish the above.

os-tooling repo

Create os-tooling repo
- In this repo we would keep info about which binaries with which versions we are installing on the nodes. This info can be set in a simple YAML file.
- We version and release os-tooling repo as any other app/component repo.
- When we increase teleport binary version in the os-tooling repo, we release a new version of os-tooling.
Update capi-image-builder to consume os-tooling with specified version (or simply check the latest and use the latest)
- Update capi-image-builder to append os-tooling version in the image name, so we can automatically build new OS images without overwriting previous ones (when k8s and OS do not change, and just os-tooling changes).
- Configure os-tooling releases to trigger new image builds (like k8s and OS versions).

Here os-tooling repo could have simple simple YAML with Teleport (and other binaries) version(s) if we could consume such "plain" YAML from ansible in capi-image-builder, or we could even put Ansible tasks right in os-tooling, if we can then easily include them in capi-image-builder.

Maybe simpler (without os-tooling repo), not fully automated

Directly put os-tooling and Teleport versions in capi-image-builder, and then directly in capi-image-builder have a map between os-tooling version and Teleport (and other binary) version.

Actually not sure if this is simpler, or even more complex.

Some other way

Ideas with any other way that would enable us to have a simple os-tooling component in the Release CR are more then welcome! :pray:

The end result should be that, whenever there is a new Teleport release, image is built automatically, and we can simply bump a version number in the Release CR - so the behavior and user experience here should be effectively the same like for OS and Kubernetes versions.

TODO

[ ] Figure out which approach is the best for automating image building with "variant" / "OS tooling" version automatically included in the image name, and then implement it. @giantswarm/team-tinkerers would take care of this, with input/help from @giantswarm/team-bigmac, as BigMac is bumping Teleport version for now.
[ ] Update implementation in Releases and cluster-\<provider> apps to use the new image name pattern (with "variant" / "OS tooling" version in the name).
- [ ] This includes additions and changes in the cluster chart, e.g. to add new helpers to read "variant" / "OS tooling" version correctly. @giantswarm/team-turtles would handle this.
- [ ] This also includes changes to cluster-\<provider> apps, to correctly render the OS image name (with provider-independent helpers from cluster chart). @giantswarm/team-turtles, @giantswarm/team-phoenix and @giantswarm/team-rocket can coordinate here and agree who does which part.
[ ] Update relevant docs.

Outcome

For Customer

It's more transparent what we are changing in new releases when we have a new Teleport (or some other) binary.

Internal

We can very easily make new releases with new versions of Teleport.
We can very easily include new binaries in our nodes, and the existing workflow for bumping the versions stays the same because we have abstract os-tooling version.

Other

cc @tuladhar @vxav @AverageMarcus @Gacko @paurosello I chatted to you folks about this in Slack on few different occasions. Also the above idea was shared previously in Slack in Team Phoenix, Team Turtles, Team Tinkerers, Team BigMac and Team Rocket channels. So far nobody objected.

nprokopic commented 3 months ago

Edited the issue to add a summary and stakeholders and implementors sections.

AverageMarcus commented 3 months ago

I've had a little time to think about this but haven't yet dived into the code to verify this approach will work so keep that in mind for now...

My thought is as follows:

We introduce a new config file within capi-image-builder that allows us to define "tooling" such as Teleport. This should contain metadata such as the name of tool, the desired version and the supported providers.
This new config should then be used in the various places where the tool is installed to get the version from that config (single place to bump / see all these versions)
This config file is then used to generate a short hash (using the tool name/version combinations applicable to the provider) that will be used as the image suffix prior to the -gs ending.

This has the following benefits:

The config file can be handled by Renovate to open PRs bumping the tool versions when they become available
The suffix will always be generated in the same way, regardless of automated or manual build.
This can handle an unlimited number of possible tooling that we want to update without needing to make the image name super long.
The way that tools are installed can remain the same and be varied between tools. We just need to grab the version from the config file (something we can do in the pipeline tasks)

This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release. As things stand today this is a bit of a hassle but we've got an open issue to improve the visibility and discoverability of new available images that should make this problem trivial once done.

Edit:

This could also be done in small stages to make it a more manageable change.

Introduce the config and update the Teleport installation to consume it but leave everything else the same
Add Renovate support to the tooling config
Update the image name with a hash

Gacko commented 3 months ago

Can we generate this hash together with the config? Like having it somewhere in the config itself or as a separate file? Because then people would always know which hash to use for which specific tooling version just by going through the Git history?

capi-image-builder could than also just pick up this hash. Oooooor: Why not use the Git commit hash? First few digits are already unique and you could find the exact point in time. Something like this...

AverageMarcus commented 3 months ago

The hash isn't for a specific tooling version though. It's for the total combination of all tooling.

Gacko commented 3 months ago

Yeah, sure, I was also thinking of the combination.

AverageMarcus commented 3 months ago

But who generates that hash and how? (Also why? When we can just automate it at buildtime)

Gacko commented 3 months ago

I like your approach of having Renovate bumping versions. With a GitHub action / Tekton run triggering whenever something changes this config, you could trigger a capi-image-builder run. And this then builds images with the exact commit hash or just a few digits of it. No need to generate a hash your own, it's already there.

AverageMarcus commented 3 months ago

Sure, but from a technical perspective that actually might be more work as I don't think we currently pass through the commit sha to the builder (we might do, would need to check) but generating a hash from a string is easy. I don't really care how we get the value, just that it should be automated and the same process regardless of automated or manual trigger.

Gacko commented 3 months ago

Of course, I just had the transparency in mind. This way you would not need a dashboard or something like this. Just the commit hash from GitHub, that is already publicly visible for everyone. It's basically the same approach we already use for apps in test catalogs and makes the exact image being used predictable.

AverageMarcus commented 3 months ago

publicly visible for everyone.

The repo is actually private.

Gacko commented 3 months ago

"Public" for Giant Swarm staff...

Gacko commented 3 months ago

Could you imagine a need to (re)build images with same Flatcar version, same Kubernetes version, same tooling version but maybe a fix/change in capi-image-builder itself? Because when its behavior changes but Flatcar, Kubernetes and tooling is still the same, that change would not be reflected in the image name with your approach, right?

AverageMarcus commented 3 months ago

Good point. That is totally a possibility but would more be bumping the version of image-builder (upstream) I imagine.

Ok. Yeah, short commit sha from capi-image-builder would be the way to go though. But everything else the same, yeah?

Gacko commented 3 months ago

Sure, sure! I really love your proposal! That commit hash was the only thing that came into my mind to make it a bit more transparent and easy to use. 🙂

Gacko commented 3 months ago

As just discussed on Slack:

I guess we currently have something in place that checks if an image already exists and then does not rebuild and upload it, right? To save time, money, resources...

With adding the commit hash to the name, that would probably break. For that we could maybe also try to just use the tags/releases we already have in the capi-image-builder repo.

With using tags/releases, engineers would have full control of images just via the workflow they are already used to (change something, forge a release what not).

AverageMarcus commented 3 months ago

So, I hadn't thought about that and that does introduce a bit of a problem as the Tekton pipelines/tasks aren't managed via tagged releases (instead gitops points to main).

I'm not thinking of going back to a hash generated from the tooling config versions (with the addition of the version of capi-image-builder itself added as this would handle the upstream image-builder version).

There is still a chance where changes to the pipelines/tasks cause a change to the produced image but generally speaking from past experience these are usually fixes performed by Tinkerers and in such cases we can manually overwrite images as required on a case-by-case basis. As we move towards more automation and abstraction over this process those instances should become less and less frequent anyway (not that we have many already).

nprokopic commented 3 months ago

I like almost everything in the @AverageMarcus's proposal above, just except one part - using the hash in the image name, because then that will bubble up all the way to how the releases work and it would make OS tooling versioning special in more than few places and cases.

This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release.

This part :point_up: specifically. It would mean that we would have to include a tooling hash in the Release resource. It would be the only place where we use the hash, as we use a regular version number for everything. This would make tooling version (or here hash) changes special, compared to everything else.

From some releases experience point of view:

When we are creating a new release:
- You have to look up the hash somewhere, which is not the same experience as using a new release.
- Then for the person interested in the new release, seeing that the hash changed in some component does not convey the same message as when you see that there is a patch, minor or major change. It would be the only component for which we don't have the clear communication through the semantic versioning (we're trying to improve this experience for apps and components, to use semver more properly and be more mindful about what is a patch and what is a minor, so now using a hash for something would be basically the opposite).

We would also like to continue improving the automation and testing around releases. Few examples for which we don't have issues, but were mentioned/brainstormed in chats and meets:

When you create a new release, and you bump the app/component version in some way (patch vs minor vs major), then we should have a test that verifies if you have created an appropriate patch, minor or major release for that. With a hash we cannot do any kind of check, we're basically bind and don't know which type of change is made.
In order to improve the automation and experience around releases, we could try to use renovate for some use cases, e.g. open PR for the next major release where renovate bumps app and component versions, so we automatically pull in and test all changes from all teams for the next major release (e.g. now we have a v29 with 2 week-window open for changes, but adding v29 changes is still manual work). There is no issue for this or some official plan, but it's an easy possibility with how Releases have been working. Maybe renovate can change hashes as well, not sure, but seeing a bot (renovate) changing a hash does not tell much :see_no_evil:

Form the technical point of view, the work I can think that we would have to do:

Modify Release CRD to allow hashes for versions, as currently it allows only semver version strings.
- Alternatively, expand Release CRD to include something new just for tooling hash.
As we're still delivering the same Release CRD to both vintage and CAPI, make sure that the Release CRD change does not break something somewhere :see_no_evil:
Update Releases SDK to account for new Release CRD changes (in whatever way they're done)

@AverageMarcus coud you do almost everything like proposed, and just have an actual version number instead of a hash? Maybe even to make it a bit more manual in the beginning, e.g. you change teleport version in config file and you also change OS tooling version wherever it would be written? And then we improve it and automate it incrementally to get it as automated as possible.

Alternatively, could this config file live outside of the capi-image-builder and be in it's own os-tooling repo? So we version and release that repo, then both capi-image-builder and Release resource use the repo version. Releases would use it like any other version of any app/component, and capi-image-builder (if possible) would get the config file from the repo for the specific version. Then the workflow would be:

We bump Teleport (or other binary) version in os-tooling repo and make a new os-tooling release.
We bump os-tooling version in capi-image-builder.
We create a new release with new os-tooling version.

We can also start with everything inside of capi-image-builder and then incrementally change it towards having a separate repo, or whatever is the best and most convenient.

I see how using a hash can make it nice and simpler the capi-image-builder itself, but a hash in the OS image name would add friction and more manual work on other levels of the product (like I tried to describe above in different examples), as hash changes do not tell you much and you cannot test and automate stuff around hashes like around regular semantic versions.

nprokopic commented 3 months ago

If there would be a way to map a variant / OS tooling version from a regular version into a hash (or whatever we have in the image name), then I don't care what we have in the image name :smile: the challenge here is that we assemble the image name in cluster-\<provider> charts, so we're limited with what Helm can do there. So far there we have been basically concatenating OS and Kubernetes version strings (with some provider-specific string pattern).

nprokopic commented 3 months ago

Modify Release CRD to allow hashes for versions, as currently it allows only semver version strings.

Alternatively, expand Release CRD to include something new just for tooling hash.

As we're still delivering the same Release CRD to both vintage and CAPI, make sure that the Release CRD change does not break something somewhere 🙈

This is also partly why we didn't want to modify (improve) the Release CRD for CAPI, because we didn't want to create more unnecessary and potentially risky vintage-related work.

This will not always be the case, as vintage will be gone at some point, soon I hope, but it's still there today.

nprokopic commented 3 months ago

Personally, out of everything I mentioned above, the main concern for me is the automation, testing, communication and in general the whole experience around releases, since we want to keep improve all of that as much as we can. (I don't mind a lot dealing with some added technical work, as long as the priorities allow, but here many others have their saying as well, like POs / SIG Product.)

AverageMarcus commented 3 months ago

How is the hash in the name any different from what we have right now? The hash is just a string that we use with the suffix as we have been doing since we introduced the variant.

You have to look up the hash somewhere, which is not the same experience as using a new release.

You're not going to get away from this regardless. How else will you know what images we have available to use? You need to look up that we have an image with Kubernetes version X, Flatcar version Y and tooling version (or hash) Z. That's not automated anywhere, someone needs to manually get those values and put them into a new Release PR.

An os-tooling repo wont work as you also need to consider the version of image-builder and capi-image-builder in that version as those (can) introduce changes to the resulting image.

How do you envision the semver version being decided on? A minor change to the Teleport binary is not the same as a minor change to the resulting image. What about image-builder that uses pre-v1 release versions? How are they taken into account?

Also keep in mind, right now I'm only talking about the name of the VM image that is used. There's absolutely no reason why the Release CR itself couldn't contain the actual versions of components. We just can't store all those in the image name as we're length limited.

nprokopic commented 3 months ago

Maybe to comment the last paragraph first:

right now I'm only talking about the name of the VM image that is used. There's absolutely no reason why the Release CR itself couldn't contain the actual versions of components.

We need a way to go from what we have in the Release CR to what we have in the image name, and that has to be implemented in Helm, as Helm chart is the place where we read Release CR and build the image name that we use.

If there is a way to map OS tooling version "x.y.z" into some 25383e1 hash and do that in automatically Helm (or have that somehow outside of Helm), then all good, it doesn't matter what we have in the image name, as I wrote above I don't care TBH :slightly_smiling_face: Not sure if this would be possible without some larger architectural changes.

We just can't store all those in the image name as we're length limited.

What do you mean by "all those"? We now have OS and k8s version, and here the proposal is to just add one more version as an umbrella version for all the tooling. OS image name string-length wise, adding one more version string should not be much different than adding a hash string. If we're already getting close to OS image name length limits for some provider, then we could also shorten "flatcar-stable" and "kube" parts of the name.

Now other points :slightly_smiling_face:

How is the hash in the name any different from what we have right now? The hash is just a string that we use with the suffix as we have been doing since we introduced the variant.

While the automation part in image building is definitely a big and needed step forward, some other parts around hash not being different, or the similarities with what we have now are the concern here, as the whole issue is about improving the whole experience and implementation, not keeping them similar, and about being able to use a regular version number like for everything else.

You need to look up that we have an image with Kubernetes version X, Flatcar version Y and tooling version (or hash) Z. That's not automated anywhere, someone needs to manually get those values and put them into a new Release PR.

We don't always have to look up images, I would say we rarely have to. We know that upstream Kubernetes and OS releases are triggering new automated image builds, so unless somebody is making a new release within minutes/hours (or maybe few days for Azure) after a Kubernetes/Flatcar release, it's usually safe to assume that we have an image and just bump the Kubernetes and Flatcar version in the Release and test it, and I think everyone is mostly doing that.

Therefore it should be the same experience for OS tooling - new OS tooling version (wherever it's stored) should trigger new automated builds, and we should be able to just use that OS tooling version in the new release, like for OS and Kubernetes.

An os-tooling repo wont work as you also need to consider the version of image-builder and capi-image-builder in that version as those (can) introduce changes to the resulting image.

I get that having this config in a separate repo would add more work on the capi-image-builder side, which is why I am not pushing here for a separate repo for this. If it is possible though, then it would make the whole releases experience better, so IMO it's worth implementing it later if we can.

You mentioned having a config file with metadata such as the name of tool, the desired version and the supported providers. So I just meant having that config file in it's own repo, so it's more easily versioned and we can more easily consume it from multiple places.

So in the end, like somebody would today change the Teleport version in capi-image-builder and make a new release with new image, we would instead

change Teleport version in os-tooling repo and release os-tooling repo
change the os-tooling version in the capi-image-builder
- maybe we can automate this with renovate,
- or even better, if possible, capi-image-builder can always use the latest available os-tooling version, and build an image with that latest version in the name
make a new release with new os-tooling version

How do you envision the semver version being decided on? A minor change to the Teleport binary is not the same as a minor change to the resulting image. What about image-builder that uses pre-v1 release versions? How are they taken into account?

Not all components affect the releases in the same way. Patch of one component can have a higher impact than an app minor, e.g. OS/k8s/tooling patch will roll nodes, but app patch won't.

So we would have different test logic, depending on what is being checked, but in any case all checks would be based on semver (unless we somehow automatically dig into changelogs and try to analyse them, but I don't think that is gonna happen, at least not any time in the foreseeable future, as that would be great effort that requires a lot of time).

We talked about this in Turtles before, have some notes in Miro boards, but not yet specific plans and issues, as other stuff got higher priorities.

Today for Teleport it might not be super important to check if it was a patch or a minor, but we maybe/probably don't want to release new Teleport binary major in a patch or minor release. We also have teleport-kube-agent as a default app with its config being in public customer-facing API, so Teleport is in a way also exposed to customers here. We could generalize this check and check at least OS tooling version which would be an umbrella/aggregate for all binaries inside (so new Teleport patch means new OS Tooling patch).

Ideally, in the future, we should be able to check version changes to all included binaries (use OS tooling version to fetch the exact version of the contained binary). Not that long ago we talked about potentially baking in binaries into images in general, for both new and overriding existing OS binaries, and that was prompted by runc CVE which compromised containerd, so it would be good to be able to override containerd. If we do that some day, then we will have to make sure that we correctly bump containerd versions there, as we expose containerd config in the public API, so you cannot bump the containerd version however you want.

The general point of the above is that we can easily have a relationship between a binary on the node and a default app in a Release with default app's public API (Helm values) exposed in of cluster-\<provider>. Checking semver versions across all of those can tell you something, but checking a hash does not tell you anything, unless you can somehow get exact binary/tool version based on that hash.

All in all, until now we had 2 variables that affect which image is used:

Flatcar version, and
Kubernetes version

Variant thing was a manual hack to abstract a change in a component version, but it should be just the 3rd variable alongside Flatcar and Kubernetes above, with the same interface and same (or at least very similar) experience around it:

"From the outside", the end experience:
- We update variant / OS tooling version in capi-image-builder.
- We use that new version in the Release CR, like we use Flatcar and Kubernetes versions.
From the implementation in cluster-\<provider> point of view:
- We use OS tooling version in Helm to render the correct image name, like we use Flatcar and Kubernetes version.
From the automation point of view
- New OS tooling (variant) triggers new image builds and we get new images with new names, like new Kubernetes and Flatcar version trigger new image builds with new image names.
- As opposed to Flatcar and Kubernetes, where we just consume upstream releases, here it's Giant Swarm staff that also creates those OS tooling releases.

Having to use a hash for something in the releases would make that thing a special flake, and that would fragment various parts of implementation, testing, automation and experience around releases.

AverageMarcus commented 3 months ago

There's more than what you're listed that effect the image that is built though. As i said in my comment, the version of upstream image-builder makes a difference as most of the versions of OS components are actually set in there, the version of capi-image-builder makes a difference as we consume and configure upstream image-builder, the "versions" (git commit) of the Tekton pipelines makes a difference as we change the configuration passed through to image-builder.

These would all need to be reflected in the Release CR with what you're suggesting. This also means a separate os-tooling repo WONT WORK as I said because that wont be able to control the version of image-builder or capi-image-builder.

From what your saying, it sounds like you need all versions defined on the Release CR. If that's the case, then the Release should be used to generate the image and not the other way around. But that breaks a lot of the automation we already have and would mean we need to rebuild most of the stuff we currently have.

What is the benefit of having all this listed in the Release CR and used to LOOKUP the image to use? Why is it such a bad thing that the Release CR defines the actual image name?

nprokopic commented 3 months ago

There's more than what you're listed that effect the image that is built though. As i said in my comment, the version of upstream image-builder makes a difference as most of the versions of OS components are actually set in there, the version of capi-image-builder makes a difference as we consume and configure upstream image-builder, the "versions" (git commit) of the Tekton pipelines makes a difference as we change the configuration passed through to image-builder.

These would all need to be reflected in the Release CR with what you're suggesting.

Regardless of if there is a separate os-tooling repo or not - image-builder and capi-image-builder are not exposed in our product and are not exposed to customers in any way. Also OS image name does not include anything about image-builder and capi-image-builder and it never did. On the other hand, for the tooling that we bake into nodes, almost like Flatcar and Kubernetes:

It is a part of the image name.
It is exposed and visible to customers through the WC release.
It can also be exposed to the customers if those node binaries are related also to one of default apps, or to anything else that we expose in public API for customers (through Helm values).

From what your saying, it sounds like you need all versions defined on the Release CR. If that's the case, then the Release should be used to generate the image and not the other way around. But that breaks a lot of the automation we already have and would mean we need to rebuild most of the stuff we currently have.

I am suggesting to add one more very specific version, not "all versions", not even sure what you mean with "all versions". One more version that would be an aggregate version for all binaries that Giant Swarm additionally bakes into the nodes, alongside everything we get with Flatcar and Kubernetes.

nprokopic commented 3 months ago

What is the benefit of having all this listed in the Release CR and used to LOOKUP the image to use? Why is it such a bad thing that the Release CR defines the actual image name?

I tried to write above in comments why I think that having a hash in the Release is not a good idea. Not sure what else to add to everything I wrote, at least can't think of anything else today.

nprokopic commented 3 months ago

To get more feedback and details from the capi-image-builder side of things:

For the described config with binaries metadata (binary names, versions, etc) in capi-image-builder, why cannot we have a version string to describe that config and then have that version string in the OS image name?

I see how a hash would work in capi-image-builder for this, but why the version string cannot work? It would work differently than hash, but I don't think that it's impossible to make it work with a version string.

nprokopic commented 3 months ago

Maybe there is a middle ground here :thinking:

I mentioned few times that I don't care if we have the hash in the OS image name. It does not have any downside on its own, if it helps in capi-image-builder then awesome. The part that I see as a problem is hash being an identifier of something in the release, and here hash would actually hide version strings, and we basically use Release CRs for version strings.

Now why do I repeat this again...

Given how it works today, with a very plain 1 on 1 relationship between variant number in the Release CR and variant number in the OS image name, and simple Helm that just takes variant from the Release CR and templates OS image name with it (and OS and k8s info), it seemed to me that hash in the OS image name requires that same hash in the Release CR, as well, it's not like you can guess the hash.

Then also this comment from Marcus :point_down: (unconsciously) solidified that thinking and I didn't question it anymore:

This does have one slight downside though - the hash isn't guessable so would need to be looked up before it can be used in the next Release.

But maybe we can guess the hash, or more precisely calculate the same hash in both places where we need it.

IIUC the config file in capi-image-builder would have some tooling metadata info with binary names and their versions, e.g. something like this (exact format doesn't matter here much):

- name: teleport
  version: 15.5.5
- name: foo
  version: 1.0.0
- name: bar
  version: 2.0.0

And then capi-image-builder would calculate the hash from the above info.

We could maybe calculate the same hash in cluster chart in Helm, as Helm has few hashing functions, like sha256sum and sha1sum, and we're even already using some of these for some other hashing scenarios.

So if in both capi-image-builder and cluster chart we have the same info that we want to hash, and we use the same hash function, then Helm in cluster chart can render the correct OS image name with hash, and it can do it with Release CR with regular versions being the input for hashing.

This would mean that in the Release CR we have names and versions of actual binaries that we bake into the node, and not some abstract variant / OS tooling version. IMO this would be even better, as it makes it even more transparent easier to work with on the releases side.

This would also require that we maintain hashing to work in the same way on both sides, which should not be a problem I think, as this part should not change very often.

nprokopic commented 3 months ago

In which scenario would we need new images with the same version of OS, k8s and OS tooling, but built with the new version of capi-image-builder/image-builder? Did that ever happen already?

If this is the case, then OS image change/overwrite like that should trigger a rolling update in order to replace all the nodes, so we would need a new OS image name (as rewriting OS image in AWS/Azure/etc would not trigger anything on its own) and therefore we would need a new release. Meaning that capi-image-builder itself should then be a part of the Release CR.

AverageMarcus commented 3 months ago

In which scenario would we need new images with the same version of OS, k8s and OS tooling, but built with the new version of capi-image-builder/image-builder? Did that ever happen already?

The runc vulnerability not long back required us to fix the problem in upstream image-builder then update our capi-image-builder to use that version.

So, after being away from my keyboard for a while last night I had time to think over this and realised that we might be overcomplicating things. I think we could make use of the capi-image-builder version as the "tooling" version in both the image name and Release CR and that should be enough.

My reasoning is...

As far as I can think, the things that decide if an image is different or not are as follows:

Kubernetes version - this is passed in at build time and included in the image name already (nothing to do here)
Flatcar version - this is passed in at build time and included in the image name already (nothing to do here)
Teleport version - this is bundled into capi-image-builder and only included when we publish a new release of the container image
Upstream image-builder - this is bundled into capi-image-builder and only included when we publish a new release of the container image
Theoretical future OS tools - this is bundled into capi-image-builder and only included when we publish a new release of the container image
Tekton pipelines / configuration - this is the only thing not versions but generally speaking I don't think it needs to be

Assuming I'm correct with the above, that means we have 3 versions we care about - Kubernetes, Flatcar and capi-image-builder. So a combination of these three in the image name gives us a distinct image.

The release CR could contain these three (semver) versions to then build the image name in the cluster chart. I suggest we call the capi-imge-builder component something along the lines of build-tooling.

Unfortunately this would mean that this...

This would mean that in the Release CR we have names and versions of actual binaries that we bake into the node, and not some abstract variant / OS tooling version. IMO this would be even better, as it makes it even more transparent easier to work with on the releases side.

... is not possible. And while I think it'd be nice to have everything defined in the Release I don't think it'd actually work in practice. It'd be hard to know for sure if a specific combination of all the versions have been built as an image or not and someone could come along and attempt to bump one of the components (like we do with the apps) and assume that the Release handles the installation of that specific component.

We can (and should) still include specific versions in release notes for customers. Just not in the Release CR itself. I think this is a good enough compromise.

One thing I think would be needed is clear documentation on how to appropriately version capi-image-builder going forward and what sort of changes equal what semver level.

The extra benefit of this is its very easy for me to build this. 🤣 It's practically a 1 line change (almost) in the Tekton Pipeline.

Edit:

I just did a little test run (with this WiP PR) to see what this would look like and ended up with the following image:

flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs

(Note: I used a suffix of mn-gs here so I could easily clean up after, normally the mn- wouldn't be there)

To make use of this new image format in the Releases / cluster app the Release CR would include the 1.14.4 as the build-tooling component version and then the cluster chart would use tooling-[VERSION FROM CR]- in the place where we currently handle the variant.

nprokopic commented 3 months ago

All things considering, including capi-image-builder version in the image name and the Release CR sounds like the best option so far, as it covers all present and known future use cases (like adding new binaries), so AFAIC let's do that.

The extra benefit of this is its very easy for me to build this. 🤣 It's practically a 1 line change (almost) in the Tekton Pipeline.

It's similarly easy on the releases / charts side! :)

It would be great if we can get this done for CAPA v29.0.0 :grimacing:

nprokopic commented 2 months ago

I have one very unimportant question :smile:

flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs

Why do we have v prefix just before the k8s version, but not before the flatcar (and now also OS tooling) version?

AverageMarcus commented 2 months ago

I have one very unimportant question 😄

flatcar-stable-3815.2.5-kube-v1.29.1-tooling-1.14.4-mn-gs

Why do we have v prefix just before the k8s version, but not before the flatcar (and now also OS tooling) version?

Because that's the tagged versions used by upstream. Flatcar and our github releases don't use a v prefix but Kubernetes does. If needed, we can remove it just from the image name. It's just that we're using the same variables we're using to lookup the packages while building.

Do we want to move ahead with https://github.com/giantswarm/capi-image-builder/pull/328 ?

AverageMarcus commented 2 months ago

I've updated the PR to remove the v prefix. Looks like I was actually mistaken and we were adding it into the name as a hardcoded prefix. 🤷

nprokopic commented 2 months ago

Do we want to move ahead with giantswarm/capi-image-builder#328 ?

Yes! :rocket:

AverageMarcus commented 2 months ago

🎉 I'll get it merged now then.

How do you want to handle changes to the cluster apps? Are y'all going to handle that for a new release?

Are there any version combinations you want me to re-build with the new name format?

nprokopic commented 2 months ago

Are there any version combinations you want me to re-build with the new name format?

For now we will need new images at least for CAPA v29, so k8s 1.29.7, Flatcar 3815.2.5 and whatever is the capi-image-builder version after these changes. Not sure about CAPZ, will check with Turtles.

How do you want to handle changes to the cluster apps? Are y'all going to handle that for a new release?

I am adding a new helper to cluster chart (pushing PR now), then I can try it out in cluster-aws once there is a rebuilt image for CAPA v29.

Since we will have a change in cluster-aws, and newer cluster-aws version can theoretically end up being used also in older CAPA majors (in case some cluster-aws change/fix is needed again for the migration), that means that we might need to rebuild the images for older CAPA majors as well (so v25, v26, v27, v28).

P.S. Will update here for CAPZ. And not sure what k8s/OS versions are CAPV and CAPVCD using, but I guess we will also need the rebuild the images for the latest k8s/OS there.

AverageMarcus commented 2 months ago

Ok, I'll rebuild the CAPA 1.29.7 one for now and will let you know when it's available (and confirm the name).

Once you know what else is needed let me know and I'll get those too :)

AverageMarcus commented 2 months ago

New image name: flatcar-stable-3815.2.5-kube-1.29.7-tooling-1.15.0-gs

nprokopic commented 2 months ago

Oh, it's gonna be challenging to test this, at least in cluster-aws repo PR, as the cluster-aws change also requires a new component in the Release CR, and the e2e tests are using the latest v28 release (and we don't have OS tooling in the existing Release CRs at all).

OTOH it should be possible to test in the releases repo draft PR (with cluster-aws test version) :thinking:

nprokopic commented 2 months ago

Testing cluster chart (and cluster-aws) changes here in the releases repo PR https://github.com/giantswarm/releases/pull/1337 (with a Release CR that is using cluster-aws dev version) :crossed_fingers:

AverageMarcus commented 2 months ago

Just in case, and to get ahead of things, I've kicked off builds for the latest patch release for CAPA for v1.25 up to v1.30 so even for backport Releases we should (once built) have images using the new format available.

Edit: Just done the same for CAPZ too. Hopefully I haven't triggered too much at once 😅

nprokopic commented 2 months ago

cluster-aws have been updated to work with new image names. New cluster-aws version v2.0.0 has the change and it will be included in CAPA v29 release :tada:

@Gacko will also work on the same change in cluster-azure.

nprokopic commented 2 months ago

@giantswarm/team-rocket we will need the same changes in cluster-vsphere and cluster-cloud-director whenever the new OS image is needed and built for those (e.g. due to new Flatcar version or new k8s version).

For reference:

This is how the change is done in cluster-aws https://github.com/giantswarm/cluster-aws/pull/747. cluster-aws is using provider-independent named templates from cluster chart.
Here you can find the docs for the OS-related cluster chart templates that you can use in cluster-vsphere and cluster-cloud-director.
- These templates can be used also without releases in the releases repo, but cluster-\<provider> chart has to use cluster chart, at least to a certain extend where OS-related provider-integration Helm values are set.

AverageMarcus commented 2 months ago

I think everything from Team Tinkerers side needed for this is now complete, yeah? I'm going to move it from our board into done. Let me know if you think I've missed anything or if we need to manually build any more images. 😄

AverageMarcus commented 2 months ago

🙈 OMG! I can't move it to our done column without it closing the issue.

Gacko commented 2 months ago

We're done from our side. So no worries, close it!

AverageMarcus commented 2 months ago

@Gacko But what about this latest comment? https://github.com/giantswarm/roadmap/issues/3592#issuecomment-2273328357

It still needs adding to the other providers, doesn't it?

Gacko commented 2 months ago

Yes, sure. Then we should may re-open and just assign to Rocket? I'll take care about cluster-azure today, so it's completed for Phoenix providers.

Gacko commented 2 months ago

I moved it to Rocket's inbox and removed it from other boards.

gawertm commented 2 months ago

@Gacko I did not read the full thread, but it looks that turtles took care of phoenix providers but for rocket providers it gets assigned to Rocket? Was wondering why is that and if it's related to the use of the shared cluster chart. if so, we might just want to wait until rocket also uses the shared cluster chart?

giantswarm / roadmap