siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
498 stars 27 forks source link

[proposal] Support for installing Kubernetes apps using Omni #622

Open smira opened 6 days ago

smira commented 6 days ago

Rationale

Omni allows to define cluster fully via the cluster templates, which allows to install machines, bring them into the cluster, assert they are ready and healthy. Cluster templates also allow to configure Talos Linux (and transitively, Kubernetes).

Sometimes there's an additional requirement to make cluster up and running, e.g.:

Today the only way to install Kubernetes apps is by using Talos machine config "extra bootstrap manifests" feature, but this feature is not based on helm, so the installed manifests are not tracked as installed by helm, and can't be easily managed later by helm. This adds extra bloat to the Talos machine configuration, which is not needed.

Omni can be in a perfect position to manage Kubernetes apps in the cluster: it is a single instance (vs. Talos controlplane machines which can be multiple for a cluster), it already has information about cluster health (knows when it's safe to install), it already has a language to describe the cluster (cluster templates).

Proposed Solution

As much as we are not happy with Helm, Helm is the de-facto standard.

For the initial phase, in order to simplify things, let's limit ourselves to the initial installation of Helm charts (skipping upgrades, changing chart values, etc.), as this is more simple, less risky, and solves the immediate problem of fully bootstrapping the cluster. In the future work, we might support updating charts as well.

As cluster templates are text YAML files, we should try to preserve this simple approach friendly to version control, expansion, templating, etc. The proposal is to use Helmfile as a language to describe what has to be installed.

We can add a field strategy and force it to be set to bootstrap-only to indicate that right now the charts are installed only once.

The initial scope is to support only charts available to Omni without auth or special setup, that is Omni should be able to download the charts from public repositories.

Cluster templates should sync the Helm instructions to an Omni resource (per cluster) describing charts to be installed.

Omni should have a controller which watches cluster status, and as soon as the cluster is ready (Kubernetes API is available), performs helm installation. Omni keeps the status of the install, and if the install was done, and strategy is bootstrap-only, it skips any work on this cluster/Helm chart.

Omni might keep a cache of downloaded Helm charts.

Future Work

utkuozdemir commented 6 days ago

I like it. My only concern would be Helmfile - I used it for a while on my homelab some years ago, but hit some issues and stopped using it. But I hope it's way better now, as it is actively developed.

My initial idea was to use Flux CD for it, but maybe we leave it to the cluster operators as it is way more complex and CRD based - Helmfile seems to give us the declarative language we need, without entering into the CRDs territory.

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

smira commented 6 days ago

Another item in the future work could be, although it is loosely related, some sort of secrets management for these workloads.

Yes, there's an issue: #572 .

I think we should support sops, includes and templating in cluster templates (but that deserves a separate issue)

smira commented 6 days ago

Potential problems:

To avoid upgrades for each iteration of helm, the helmfile executable delegates to helm - as a result, helm must be installed.

rsmitty commented 6 days ago

One random thought I had earlier about longer term implementation here. We should take care to design how we'll sync all clusters if we decide to support ongoing rollouts. In the case of, say, 1000+ clusters using this feature, we should make sure that if we sync every 15m or so we should have some random splay or batching or some other mechanism so that Omni isn't trying to update all 1000+ at once.

Totally not for the initial work here, but just wanted to capture it somewhere.

smira commented 5 days ago

Totally not for the initial work here, but just wanted to capture it somewhere.

Good point, this should mostly work by design, as the controller has a fixed set of worker slots, the concurrency of the operation should be controlled by the number of slots in the controller applying Helmfiles.

smira commented 4 days ago

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

utkuozdemir commented 4 days ago

I can't say that I like it, but another idea might be to run something like helmfile-controller inside the workload cluster configured by a ConfigMap for example, and Omni simply pushes the ConfigMap, and waits for the controller to do its job.

This might simplify some requirements (e.g. having different versions of the controller), or having access to private helm charts, but it takes away some resources from the workload cluster, the controller has to run with host networking (to install CNI), etc.

If we decided to go that route, we could use flux instead. I'd rather leave those things to the cluster operator, and do the helmfile part completely from Omni, so the clusters would stay "vanilla".