kubernetes / kubeadm

Aggregator for issues filed against kubeadm
Apache License 2.0
3.75k stars 715 forks source link

RFE: use learner mode for joining etcd members #1793

Open fabriziopandini opened 5 years ago

fabriziopandini commented 5 years ago

Growing a local etcd cluster is a complex operation, and in the past, we already faced some issues like e.g. https://github.com/kubernetes-sigs/kind/issues/588

Now that the implementation of the etcd learner mode is progressing, we should start considering if to use it in kubeadm in order to make join --control-plane implementation more robust.

at a high level what we would like to achieve is:

Ref docs:


(edit by neolit123)

1.26:

1.27(alpha):

1.29(beta):

1.32(GA):

1.33:

neolit123 commented 10 months ago

I suppose that we should graduate this feature later in v1.31+ and get more feedback before GA.

So no action item for v1.30.

i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info. learner mode was broken for them in some way.

i will see if i can message them about this after NY.

pacoxu commented 10 months ago

One issue that I may imagine is timeout for a step of promotion ready waiting or promotion may be a problem.

pacoxu commented 8 months ago

https://github.com/kubernetes/kubeadm/issues/2997#issuecomment-1899856003 We have a short discussion about if we need to add progress percentage of synced in logging.

https://github.com/kubernetes/kubeadm/issues/2997#issuecomment-1899805047 Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)

neolit123 commented 8 months ago

We have a short discussion about if we need to add progress percentage of synced in logging.

ok, i don't think it's GA blocking.

Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)

+0 as well from me. our 2 minutes timeout will apply to all etcd client calls by default.

neolit123 commented 8 months ago

i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info. learner mode was broken for them in some way.

they did not log an issue...

pacoxu commented 8 months ago

I updated beta related PRs in this issue description.

I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.

pacoxu commented 6 months ago

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-addons-before-controlplane-1-29-latest/1777803167627481088

A flake:

I0409 21:02:09.391263     303 local.go:165] Updated etcd member list: [{kinder-upgrade-addons-before-controlplane-control-plane-2 https://172.17.0.3:2380/} {kinder-upgrade-addons-before-controlplane-control-plane-1 https://172.17.0.2:2380/}]
I0409 21:02:09.428719     303 etcd.go:508] [etcd] Promoting a learner as a voting member: a8f2efe87cf50990
{"level":"warn","ts":"2024-04-09T21:02:09.438938Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.439065     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.547427Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.547508     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.700675Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.700752     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.942281Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.942448     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:10.307133Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:10.307218     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:10.832218Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:10.832305     303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
I0409 21:02:11.660333     303 etcd.go:530] [etcd] The learner was promoted as a voting member: a8f2efe87cf50990
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s

40s timeout for waiting to be ready to promote a learner.

EDITED misunderstand the log here.

I0409 21:02:10.832305 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader I0409 21:02:11.660333 303 etcd.go:530] [etcd] The learner was promoted as a voting member: a8f2efe87cf50990

The promote success later and failed for node not ready. I will dig into it.

neolit123 commented 6 months ago

40s timeout for waiting to be ready to promote a learner.

should we increase this time to 2 minutes, or more by default?

pacoxu commented 6 months ago

https://github.com/kubernetes/kubernetes/blob/227c2e7c2b2c05a9c8b2885460e28e4da25cf558/cmd/kubeadm/app/util/etcd/etcd.go#L531-L557

already 2m.

I miss the log that The learner was promoted as a voting member success finally. Sorry for disturb.

neolit123 commented 6 months ago

the flakes on https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-addons-before-controlplane-1-29-latest seem like slow infra problems, 5 minutes should be plenty of time for a few nodes to join and be ready :/

neolit123 commented 3 months ago

I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.

@pacoxu should we GA this in 1.32?

pacoxu commented 3 months ago

Agree