Open fabriziopandini opened 5 years ago
I suppose that we should graduate this feature later in v1.31+ and get more feedback before GA.
So no action item for v1.30.
i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info. learner mode was broken for them in some way.
i will see if i can message them about this after NY.
One issue that I may imagine is timeout for a step of promotion ready waiting or promotion may be a problem.
https://github.com/kubernetes/kubeadm/issues/2997#issuecomment-1899856003 We have a short discussion about if we need to add progress percentage of synced in logging.
https://github.com/kubernetes/kubeadm/issues/2997#issuecomment-1899805047 Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)
We have a short discussion about if we need to add progress percentage of synced in logging.
ok, i don't think it's GA blocking.
Another potentiel improvement is adding a configurable timeout for etcd learner ready for promoting. There are already a lot of timeout configuration in v1beta4 timeouts structs. (+0 for this as 2 min should be enough for most scanerios.)
+0 as well from me. our 2 minutes timeout will apply to all etcd client calls by default.
i got contacted in slack by a person that had feedback about learner mode in kubeadm, but they never send me the info. learner mode was broken for them in some way.
they did not log an issue...
I updated beta related PRs in this issue description.
I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.
A flake:
I0409 21:02:09.391263 303 local.go:165] Updated etcd member list: [{kinder-upgrade-addons-before-controlplane-control-plane-2 https://172.17.0.3:2380/} {kinder-upgrade-addons-before-controlplane-control-plane-1 https://172.17.0.2:2380/}]
I0409 21:02:09.428719 303 etcd.go:508] [etcd] Promoting a learner as a voting member: a8f2efe87cf50990
{"level":"warn","ts":"2024-04-09T21:02:09.438938Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.439065 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.547427Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.547508 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.700675Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.700752 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:09.942281Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:09.942448 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:10.307133Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:10.307218 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
{"level":"warn","ts":"2024-04-09T21:02:10.832218Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007de1c0/172.17.0.2:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0409 21:02:10.832305 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader
I0409 21:02:11.660333 303 etcd.go:530] [etcd] The learner was promoted as a voting member: a8f2efe87cf50990
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
40s timeout for waiting to be ready to promote a learner.
EDITED misunderstand the log here.
I0409 21:02:10.832305 303 etcd.go:533] [etcd] Promoting the learner a8f2efe87cf50990 failed: etcdserver: can only promote a learner member which is in sync with leader I0409 21:02:11.660333 303 etcd.go:530] [etcd] The learner was promoted as a voting member: a8f2efe87cf50990
The promote success later and failed for node not ready. I will dig into it.
40s timeout for waiting to be ready to promote a learner.
should we increase this time to 2 minutes, or more by default?
already 2m.
I miss the log that The learner was promoted as a voting member
success finally. Sorry for disturb.
the flakes on https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-addons-before-controlplane-1-29-latest seem like slow infra problems, 5 minutes should be plenty of time for a few nodes to join and be ready :/
I think we may wait for at least another 1 or 2 release cycles for feedbacks to make this GA, as most users are not using v1.29 yet, which make it beta, by default enabled.
@pacoxu should we GA this in 1.32?
Agree
Growing a local etcd cluster is a complex operation, and in the past, we already faced some issues like e.g. https://github.com/kubernetes-sigs/kind/issues/588
Now that the implementation of the etcd learner mode is progressing, we should start considering if to use it in kubeadm in order to make join --control-plane implementation more robust.
at a high level what we would like to achieve is:
Ref docs:
(edit by neolit123)
1.26:
1.27(alpha):
1.29(beta):
1.32(GA):
1.33: