[Roadmap] Improve kubeadm support for declarative approaches/git-ops

fabriziopandini commented 4 years ago

Kubeadm, being a CLI, does not play well with declarative approaches/git-ops workflows.

Assuming that kubeadm is divided in two main parts

bootstrapping a node (transforming a machine into a node: init, join)
managing an existing node (e.g. upgrades, renew certs, changing a node)

This issue is about collecting ideas and define a viable path for making 2 possible using declarative approaches, sometimes referred also as in-place mutations.

For this first iteration, I consider 1 out of scope, mainly because bootstrapping nodes with a declarative approach is already covered by Cluster API and it is clearly out of the scope of kubeadm.

fabriziopandini commented 4 years ago

Prior discussion from https://github.com/kubernetes/kubeadm/issues/1698

@timothysc

As a Kubernetes Operator I would like to enable be able to declaratively control configuration changes, and upgrades in a systematic fashion.

@fabriziopandini

IMO the kubeadm operator should be responsible for two things

In place mutations of kubeadm generated artifacts

Orchestration of such mutations across nodes Instead, I think that we should consider out of scope everything that fits under the management of infrastructure or it is related to the management of "immutable" nodes (where "Immutable" = any operation done deploying a new node and removing the old one)

@neolit123

my other top question, this can end up being not-so-secure.

fabriziopandini commented 4 years ago

For the kubeadm operator, I think we should focus on the first use case, "declaratively control configuration changes", given that this is not supported by kubeadm now and it was a top priority in the recent survey

fabriziopandini commented 3 years ago

Draft KEP https://docs.google.com/document/d/14Cb2fQfRVpPSQuNz0MYbtGmO63xBglT7x7h57ZeK5PI/edit?usp=sharing

jhughes2112 commented 3 years ago

Interesting proposal. What I've struggled with over the past year using kubeadm is specifically what I addressed with my own scripts that (procedurally) builds clusters (on github: k8smaker). The best practices for configuring a bare metal cluster is pretty complex. Doing so with AWS is too. The preconditioning script depends strongly on the underlying OS involved. But having built it modularly, I can see how the cluster construction (init, join) can be made completely extensible while providing a fully declarative interface to the user.

It requires:

ssh credentials to access any new node with sudo privileges (from whence kubeadm or the proposed operator executes)
a preconditioning script that configures the OS from a clean state
a decommissioning script that resets the OS to an unused state
a configuration CRD that describes the nodes that should be part of the cluster

I offer an opinion: I realize this was specifically stated as out-of-scope for this proposal. I'm suggesting it should be the focus instead of day 2 operations. It seems like a lot of k8s admins have a procedure where upgrading an existing production cluster tends to be (much) more dangerous than building a new one. By automating more of the upgrades, it adds a "magicalness" to that process which results in inevitable breakage being more severe rather than less. Whereas automating the construction process drives towards a very desirable workflow for automating and simplifying the process: simply remove nodes from an existing production cluster description and add them to the new cluster.

Thanks for the consideration.

neolit123 commented 3 years ago

It seems like a lot of k8s admins have a procedure where upgrading an existing production cluster tends to be (much) more dangerous than building a new one.

i think no matter what we do with kubernetes upgrades we will not be able to fully guarantee zero failures to the users, unless this is fully managed by some high level tooling that understand everything that the user has and wants - including node host details, infrastructure availability and all caveats of the current and next k8s version.

kubeadm or the operator can encode some details about the next k8s version or the node host, but that's all.

the so called "blue / green" cluster upgrades may seem as the better option in the eyes of the user, since the user has the control to scrap the old cluster only once the new cluster is fully working. but they also require infrastructure that some users on self hosted bare metal simply don't have.

Whereas automating the construction process drives towards a very desirable workflow for automating and simplifying the process: simply remove nodes from an existing production cluster description and add them to the new cluster.

we call these node re-place upgrades and the Cluster API is doing them. your project may have recreated parts of Cluster API, kubespray or kops, which are tools that are higher level than kubeadm.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

neolit123 commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

neolit123 commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

fabriziopandini commented 3 years ago

/remove-lifecycle stale

neolit123 commented 2 years ago

xref related discussion about cert rotation https://github.com/kubernetes/kubeadm/issues/2652

pacoxu commented 2 years ago

Not sure if this is the right place to discuss on kubeadm operator. There are some threads in https://github.com/kubernetes/enhancements/issues/2505.

I write a simple kubelet-reloader as a tool for kubeadm operator.

kubelet-reloader will watch on /usr/bin/kubelet-new.
once there is a different version of kubelet-new, the reloader will replace /usr/bin/kubelet and restart kubelet.

Currently the kubeadm-operator v0.1.0 can support upgrade cross versions like v1.22 to v1.24.

kubeadm operator will download kubectl/kubelet/kubeadm and upgrade.
kubelet will be placed in /usr/bin/kubelet-new for kubelet reloader.

See quick-start.

Some thoughts on the next steps

https://github.com/pacoxu/kubeadm-operator/issues/88: a kubeadm operator CRD with the target version of this cluster. The controller can then create operations for it automatically.
https://github.com/pacoxu/kubeadm-operator/issues/87 offline supports
https://github.com/pacoxu/kubeadm-operator/issues/86

kubernetes / kubeadm

[Roadmap] Improve kubeadm support for declarative approaches/git-ops #2317