Automate stacked etcd into kubeadm join --control-plane workflow

fabriziopandini commented 6 years ago

Stacked etcd is as a manual procedure described in https://kubernetes.io/docs/setup/independent/high-availability/.

However, kubeadm could automate the stacked etcd procedure as new step of the kubeadm join --control-plane workflow.

Some design decision should be taken before implementing.

A) Stacked etcd should be a “trasparent” evolution of current local etcd mode or B) users will be requested to explicitly opt-in on stacked etcd e.g. by using a dedicated config type?
C) The number of Stacked etcd should be “tied” to the number of controlplane instances or D) we would like to scale etcd be separated from control plane scaling (e.g kubeadm join --etcd)

Considering the goal of keeping kubeadm simple and maintainable, IMO preferred options are A) and C)… wdyt?

cc @detiber @chuckha @timothysc

chuckha commented 6 years ago

C seems like the simplest solution, but I'd love to hear more about A. I think we've really got a couple of use cases, stacked control plane nodes scale out to some number n nodes before etcd needs to have dedicated hosts, then it would be great if we had a path to get switch to external/dedicated hosts.

I'd rule out B and D for now unless there is a compelling reason to add that complexity.

fabriziopandini commented 6 years ago

@chuckha

I'd love to hear more about A (Stacked etcd should be a “trasparent” evolution of current local etcd mode)

From what I understand stacked etcd is an etcd instance like local etcd, with the difference that it listen on a public IP instead of 127.0.0.1 and it has a bunch of additional flags/certificate sans . Why not changing local etcd static pod manifest to be equal to stacked etcd manifest?

in case of new cluster, the the "new local etcd" will work with a single etcd member, like "old local etcd" (no regression), and all new cluster will be natively ready for adding new control plane & etcd members
in case of existing cluster, kubeadm upgrade will automatically turn "old local etcd" into "new local etcd", so basically all v1.13 will be ready for scaling up control plane and etcd

Does this sound reasonable to you?

it would be great if we had a path to get switch to external/dedicated hosts

Great suggestions, let's keep this in mind as well

detiber commented 6 years ago

A) Stacked etcd should be a “trasparent” evolution of current local etcd mode or

If I'm understanding this option, it would basically just extend the existing local etcd mode to support the additional flags, SANs, etc that the stacked deployment currently uses and is mainly about providing an upgrade path for existing local etcd-based deployments rather than providing HA support itself. Is that correct?

That said, it would require config changes to make work, since we would need to expand the per-node configuration to include etcd config/overrides for things such as which IP, hostname, or SANs to use (if the defaults are not sufficient).

B) users will be requested to explicitly opt-in on stacked etcd e.g. by using a dedicated config type?

I don't like this option as it requires users to make a decision for HA/non-HA support before starting.

C) The number of Stacked etcd should be “tied” to the number of controlplane instances or

+1 for this, if there is a need to have a different number of etcd hosts vs controlplane instances, then external etcd should be used instead.

D) we would like to scale etcd be separated from control plane scaling (e.g kubeadm join --etcd)

While I could see some value in this, the ability to use it would be limited since we don't provide a way to init a single etcd instance. I would expect that workflow to look like the following:

<host 1> kubeadm init --etcd
<host 2> kubeadm join --etcd
<host 3> kubeadm join --etcd

Where the entire etcd cluster is bootstrapped prior to bootstrapping control plane instances. Currently that workflow would require that kubeadm now have access to the client certificate to manipulate etcd, which is not currently the case. I'm not exactly sure how we are currently handling this for extending the control plane.

The nice thing about this approach is that it would simplify the external etcd story as well, but I think it should be in addition to C rather than in place of C if we support that workflow. I think we'd also probably want to break them out into separate high level commands, since we wouldn't necessarily be fully configuring the kubelets to join the overall cluster in that use case.

fabriziopandini commented 6 years ago

@detiber happy to see we are on that same page here!

it would basically just extend the existing local etcd mode ...rather than providing HA support itself

Yes, but with the addition than when before adding etcd members we are going to call etcdctl member add on one of the existing members.

This will increase HA of the cluster, with the caveat that each API server use only the etcd endpoint of its own local etcd (instead of the list of etcd endpoints). So if an etcd member fails, all the control plane components on the same node will fail and everything will be switched to another control plane node.

NB. This can improved up to a certain extent by passing to the API server the list of etcd endpoints known at the moment of join

it would require config changes to make work

Yes but I consider this changes less invasive than creating a whole new etcd type. On top of that I think that we can use advertise address and hostname as a reasonable defaults, so the user will be required to set additional config options only in few cases

I think we'd also probably want to break them out into separate high level commands I think it should be in addition to C rather than in place of C

+1 if we want to have a sound story around etcd alone this should be addressed properly. For the time being I will be more than happy to improve the story about control plane and/tied to etcd, that is part of kubeadm since it's inception

detiber commented 6 years ago

@fabriziopandini For the issue with the control plane being fully dependent on the local etcd, there is an issue to track the lack of etcd auto sync support within Kubernetes itself: https://github.com/kubernetes/kubernetes/issues/64742

fabriziopandini commented 6 years ago

/lifecycle active

@detiber @chuckha @timothysc I have a working prototype of the approach discussed above 😃

kubeadm init > creates a local etcd instance similar to the one described here. The main difference vs now is that it uses another IP address instead of 127.0.0.1

- etcd
- --advertise-client-urls=https://10.10.10.11:2379  
- --initial-advertise-peer-urls=https://10.10.10.11:2380
- --initial-cluster=master1=https://10.10.10.11:2380
- --listen-client-urls=https://127.0.0.1:2379, https://10.10.10.11:2379
- --listen-peer-urls=https://10.10.10.11:2380
....

kubeadm join --control-plane > adds a second etcd instance similar to the one described here. In case of joining etcd instances, the etcd manifest is slightly different and contains all the existing etcd members + the joining one, and also the --initial-cluster-state flag is set to existing
```
- etcd
- --initial-cluster=master1=https://10.10.10.11:2380,master2=https://10.10.10.12:2380
- --initial-cluster-state=existing
....
```

So far so good.

Now the tricky question. kubeadm upgrade....

When kubeadm executes upgrades it will recreate the etcd manifest. Are there any settings I should take care of because I'm upgrading an etcd cluster instead of an etcd single instance? More specifically, are there any recommended values for --initial-cluster and --initial-cluster-state or simply I don't care because my etcd cluster already exists and I'm basically changing only the etcd binary?

fabriziopandini commented 6 years ago

@detiber @chuckha @timothysc from coreos doc

--initial prefix flags are used in bootstrapping (static bootstrap, discovery-service bootstrap or runtime reconfiguration) a new member, and ignored when restarting an existing member.

so it doesn't matter which values I assign to --initial-cluster and --initial-cluster-state.

Considering this my idea is to keep upgrade workflow "simple" and generate the new etcd manifest without compiling the --initial-cluster with the full list of etcd members.

Opinions?

fabriziopandini commented 6 years ago

Last bit: What IP address should we use for etcd? if we are going to use the API server advertise address for etcd as well this will simplify things a lot...

fabriziopandini commented 6 years ago

/close

k8s-ci-robot commented 6 years ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes/kubeadm/issues/1123#issuecomment-435291184): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kubeadm

Automate stacked etcd into kubeadm join --control-plane workflow #1123