Open neolit123 opened 4 years ago
first PR is here: https://github.com/kubernetes/kubernetes/pull/94398
we spoke about the kubelet.conf in the office hours today:
i'm going to experiment and see how it goes, but this cannot be backported to older releases as it is a breaking change to phase users.
This breaks the rules, the controlPlaneEndpoint
maybe a domain, if this is a domain, so it will not run ok after your code
This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code
can you clarify with examples?
@jdef added a note that that some comments were left invalid after the recent change: https://github.com/kubernetes/kubernetes/pull/94398/files/d9441906c4155173ce1a75421d8fcd1d2f79c471#r486252360
this should be fixed in master.
some else added a comment on https://github.com/kubernetes/kubernetes/pull/94398 but later deleted it:
when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5
the error occurs like
I0910 15:15:54.436430 52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters[currentCluster].Server: https://apiserver.cluster.local:6443
config.Clusters[expectedCluster].Server: https://192.168.160.243:6443
a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.
This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code
can you clarify with examples?
you could see this doc https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-first-control-plane-node
--control-plane-endpoint "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT"
i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you? what error output are you seeing? i did test this during my work on the changes and it worked fine.
some else added a comment on kubernetes/kubernetes#94398 but later deleted it:
when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5 the error occurs like I0910 15:15:54.436430 52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf currentConfig.Clusters[currentCluster].Server: https://apiserver.cluster.local:6443 config.Clusters[expectedCluster].Server: https://192.168.160.243:6443 a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.
yes, please. this just bit us when testing a workaround in a pre-1.19.1 cluster whereby we tried manually updating clusters[].cluster.server
in (scheduler, controller-manager .conf) to point to localhost instead of the official controlplane endpoint.
i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you?
yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint
to the LOAD_BALANCER_DNS
instead of LOAD_BALANCER ip
yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip
what error are you getting?
yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip
what error are you getting?
I add some code for the log print, this is the error
I0910 13:14:53.017570 21006 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters https://apiserver.cluster.local:6443
config.Clusters: https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
ok, so you have the same error as the user reporting above.
we can fix this for 1.19.2
one workaround is:
Both kube-scheduler and kube-controller-manager can use localhost and loadblance to connect to kube-apiserver, but users cannot be forced to use localhost, and warnning can be used instead of error
@neolit123 I'm +1 to relax the checks on the address in the existing kubeconfig file. We can either remove the check or make it more flexible by checking if the address is either CPE or LAPI
@neolit123 here is the example. i just edit to add log print. https://github.com/neolit123/kubernetes/blob/d9441906c4155173ce1a75421d8fcd1d2f79c471/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L225
fmt.Println("currentConfig.Clusters[currentCluster].Server:", currentConfig.Clusters[currentCluster].Server, "\nconfig.Clusters[expectedCluster].Server: ", config.Clusters[expectedCluster].Server)
use method CreateJoinControlPlaneKubeConfigFiles
with controlPlaneEndpoint
to genrate kube-scheduler
and kube-controller-manager
, in this situation , set controlPlaneEndpoint
as LOAD_BALANCER_DNS:LOAD_BALANCER_PORT
. it is best to set LOAD_BALANCER_DNS
instead of IP.
then to run kubeadm init
with LOAD_BALANCER_DNS:LOAD_BALANCER_PORT
. the result is.
./kubeadm init --control-plane-endpoint apiserver.cluster.local:6443
W0911 09:36:17.922135 63517 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.1
[preflight] Running pre-flight checks
[WARNING FileExisting-socat]: socat not found in system path
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using existing ca certificate authority
[certs] Using existing apiserver certificate and key on disk
[certs] Using existing apiserver-kubelet-client certificate and key on disk
[certs] Using existing front-proxy-ca certificate authority
[certs] Using existing front-proxy-client certificate and key on disk
[certs] Using existing etcd/ca certificate authority
[certs] Using existing etcd/server certificate and key on disk
[certs] Using existing etcd/peer certificate and key on disk
[certs] Using existing etcd/healthcheck-client certificate and key on disk
[certs] Using existing apiserver-etcd-client certificate and key on disk
[certs] Using the existing "sa" key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/kubelet.conf"
currentConfig.Clusters[currentCluster].Server: https://apiserver.cluster.local:6443
config.Clusters[expectedCluster].Server: https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
To see the stack trace of this error execute with --v=5 or higher
i will send the PR in the next couple of days. edit: https://github.com/kubernetes/kubernetes/pull/94816
fix for 1.19.2 is here: https://github.com/kubernetes/kubernetes/pull/94890
to further summarize what is happening. after the changes above, kubeadm will no longer error out if the server URL in custom provided kubeconfig files does not match the expected one. it will only show a warning.
example:
foo:6443
in scheduler.conf
scheduler.conf
to point to e.g. 192.168.0.108:6443
(local api server endpoint)fix for 1.19.2 is here: kubernetes/kubernetes#94890
1.19.2 is already out. So this fix will target 1.19.3, yes?
Indeed, they pushed it out 2 days ago. Should be out with 1.19.3 then.
@neolit123 This issue came up as I'm working on graduating the EndpointSlice API to GA (https://github.com/kubernetes/kubernetes/pull/96318). I'm trying to determine if it's safe to also upgrade consumers like kube-proxy or kube-controller-manager to also use the v1 API in the same release. If I'm understanding this issue correctly, making that change in upstream could potentially result in issues here when version skew exists. Do you think this will be resolved in time for the 1.20 release cycle?
@robscott i will comment on https://github.com/kubernetes/kubernetes/pull/96318
/remove-kind bug /kind feature design
re:
optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.
i experimented with this and couldn't get it to work under normal conditions with a patched kubeadm binary.
procedure:
on the second CP node (note: phases are re-ordered here, compared to non-patched kubeadm):
TLS bootstrap fails and the kubelet reports a 400 and a:
Unexpected error when reading response body
kubelet client certs are never written in /var/lib/kubelet/pki/
.
alternatively if the second CP node signs it's own kubelet client certificates (since it has the ca.key) with disabled rotation the Node object ends up being created properly, but this sort of defeats the bootstrap token method for joining CP nodes and means one can just join using the "--certificate-key" that fetches the CA.
the static pods on the second node are running fine. etcd cluster looks healthy. i do not see anything interesting in the server and KCM logs, but i wonder if this somehow due to leader election.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
from my POV the last item in the TODOs here is not easily doable. summary above. if someone wants to investigate this further, please go ahead.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We have been facing hairpin issue in capz private cluster when using control plane endpoint for interactions between control plane components. To overcome this we do the following:
preKubeadmCommands
postKubeadmCommands
I was trying to remove this workaround and see if kubeadm is able to successfully initialize the node, and observed the following:
kubeadm init
run with config from /run/kubeadm/kubeadm.yaml
when trying to get leases
Feb 08 23:49:11 capz-e2e-xuqdh3-private-control-plane-fjmnz kubelet[6360]: E0208 23:49:11.596858 6360 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://apiserver.capz-e2e-xuqdh3-private.capz.io:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/capz-e2e-xuqdh3-private-control-plane-fjmnz?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
when trying to register node
Feb 08 23:49:25 capz-e2e-xuqdh3-private-control-plane-fjmnz kubelet[6360]: E0208 23:49:25.790815 6360 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://apiserver.capz-e2e-xuqdh3-private.capz.io:6443/api/v1/nodes\": dial tcp 10.255.0.100:6443: i/o timeout" node="capz-e2e-xuqdh3-private-control-plane-fjmnz"
This happens because kubelet.conf
uses control plane endpoint for api server, and hence kubelet is unable to contact the api server because of the hairpin issue mentioned above.
An interesting thing to note is that the workaround of mapping cp endpoint to localhost is done in postKubeadm
for joining nodes. This means that the nodes were able to join the cluster successfully even with cp endpoint in kubelet.conf. This could be because there are more than one nodes now, and the hairpin failure's probability drops below 100%.
Ideally we want a way for kubelet.conf
to point to local ip when there is only one instance of control plane running and then switch to cp endpoint once other nodes join. Not sure if it's even possible, and even if it was, we'd still have packet losses when using dns.
wdyt? @neolit123 @fabriziopandini
Ideally we want a way for kubelet.conf to point to local ip when there is only one instance of control plane running and then switch to cp endpoint once other nodes join. Not sure if it's even possible, and even if it was, we'd still have packet losses when using dns.
does that mean having the kubelet.conf
generated during kubeadm init
to point to localhost and the one for kubeadm join
to point to the LB?
i think this is already supported with init phases. try the following:
kubeadm init phase certs ca
this will generate a CAkubeadm init phase kubeconfig kubelet --config ...
. this will generate the kubelet kubeconfigsed
kubeadm init --skip-phases kubeconfig/kubelet,certs/ca --config ...
yet....if i'm understanding the problem correctly, i actually do not recommend doing that on the init
node, because that kubelet might not be able to rotate its client certificate after ~8 months if the localhost API server / KCM are no longer leaders.
that can happen if the node restarted at some point and leaders switched.
just a speculation, but i'm not in favor of making this change in kubeadm init
by default until we understand the CSR signing / leader election problem better.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node
can this still happen (a new KCM talk to an old API server)? if so, this is breaking https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager constraints and needs fixing
if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node
can this still happen (a new KCM talk to an old API server)? if so, this is breaking https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager constraints and needs fixing
i am not aware if it can still happen in the future. the original trigger for logging the issue was the CSR v1 graduation which was near 3 years ago.
the problem in kubeadm and CAPI where the kubelets on CP nodes talk to the LB remains and i could not find a sane solution.
yeah, sorry, I meant an 1.(n+1) KCM talking to a 1.n API server, not 1.19/1.18 specifically
KCM and Scheduler always talk with the API server on the same machine, which is of the same version (as far as I remember this decision was a trade-off between HA and user experience for upgrades).
Kubelet is the only component going through the load balancer, it is the last open point of this issue
maybe https://github.com/kubernetes/kubernetes/pull/116570/files#r1179273639 was what I was thinking of, which was due to upgrade order rather than load-balancer communication
optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.
I think this is a good idea to fix this problem. @neolit123 :)
IMO, this is no HA design for kubelet to connect the local APIServer only on control plane nodes. And for bootstrap (I mean /etc/kubernetes/bootstrap-kubelet.conf
), the local apiserver is not ready yet.
As kubelet must not be newer than kube-apiserver, we should upgrade all control planes at first and then upgrade the kubelet in control plane nodes. This is enough for me.
would it be possible to also do this for generated admin.conf
on CP?
would it be possible to also do this for generated
admin.conf
on CP?
the point of admin.conf is to reach the lb sitting in front of the servers. in case of failure or during upgrade it is best to keep it that way IMO.
you could sign a custom kubeconf that talks to localhost:port.
To tackle the last point of this issue:
- [ ] optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished. see sig-cluster-lifecycle: best practices for immutable upgrades kubernetes#80774 for a related discussion
The TL/DR for this change is that we have to adjust the tlsBootstrapCfg
which gets written for the kubelet to disk here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L123
to point to localhost.
This change alone does not work, because it would create a chicken-egg problem:
Because of that this requires a change to kubeadm and its phases to fix the order of some actions so it can succeed.
Note: With that said, I think this change cannot be simply merged to the codebase and may require getting activated over time by graduating it via a feature gate or similar to get the new default behaviour after some time and including well-written release-notes for this change to make users aware of it.
To solve the above chicken-egg issue we have to reorder some subphases / add some extra phases to kubeadm:
To summarise the change:
KubeletStart
phase into KubeletStart
and KubeletWaitBootstrap
(orange). The KubeletStart
phase ends after starting the kubelet and the KubeletWaitBootstrap
(orange) phase does the rest which previously was embedded in KubeletStart
ControlPlaneJoinPhase
into ControlPlaneJoinPhase
(cyan) and ControlplaneJoinEtcdPhase
(green) by extracting the etcd relevant subphase EtcdLocalSubphase
which then is the ControlplaneJoinEtcdPhase
KubeletWaitBootstrap
) -> (ControlplaneJoinEtcdPhase
*) -> ControlPlaneJoinPhase
-> ...KubeletStart
-> ControlplaneJoinEtcdPhase
-> KubeletWaitBootstrap
-> ControlPlaneJoinPhase
-> ...* Note: The EtcdLocalSubphase
was directly at the beginning of ControlPlaneJoinPhase
.
The addition to the phases changes:
localhost
/ 127.0.0.1
instead of the load balanced API Server endpointI have a dirty POC implementation here: https://github.com/kubernetes/kubernetes/compare/master...chrischdi:kubernetes:pr-experiment-kubeadm-kubelet-localhost which I used for testing the implementation.
I also stress-tested this implementation by using kinder:
kindest/node
image (only change is the updated kubeadm binary)kinder create cluster --name kinder-test --image kindest/node:v1.28.0-test --control-plane-nodes 3
#!/bin/bash
set -o errexit
set -o nounset
set -o pipefail
I=0
while true; do
if [[ $(($I % 10)) -eq 0 ]]; then
echo ""
echo "Starting iteration $I"
fi
echo -n '.'
kinder do kubeadm-init --name kinder-test >stdout.txt 2>stderr.txt
kinder do kubeadm-join --name kinder-test >stdout.txt 2>stderr.txt
kinder do kubeadm-reset --name kinder-test >stdout.txt 2>stderr.txt
I=$((I+1))
done
If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)
thanks for all the work on this @chrischdi we could target 1.30 for it as it has been a long standing task, however it's still not clear to me how exactly users are affected. it's the hairpin mode LB, correct?
we should probably talk more about it in the kubeadm office hours this week.
If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)
given a FG was suggested and given it's a complex change, that is 1) breaking for users that anticipate a certain kubeadm phase order, and also 2) needs tests - i guess we need a KEP.
@pacoxu @SataQiu WDYT about this overall? we need agreement on it, obviously.
my vote is +1, but i hope we don't break users in ways that cannot be recoverable.
if we agree on a KEP and a way forward you can omit the PRR (prod readiness review) as it's a non-target for kubeadm. https://github.com/kubernetes/enhancements/blob/master/keps/README.md
During joining a new control-plane node, in the step of new EtcdLocalSubphase
, is kubelet running in standalone mode at first?
It sounds like doable.
For upgrade progress, should we add logic for kubelet config to point to localhost?
it's the hairpin mode LB, correct?
I think I lack context on what "hairpin mode LB" is :-)
During joining a new control-plane node, in the step of new EtcdLocalSubphase, is kubelet running in standalone mode at first?
Yes, in the targeted implementation, kubelet starts already, but cannot yet join the cluster (because the referenced kube-apiserver will not get healthy unless etcd is started and joined the cluster). During EtcdLocalSubphase we then place the etcd static pod manifest and join etcd to the cluster. After it joined, kube-apiserver gets healthy and the kubelet bootstraps itself, while kubeadm starts to wait for bootstrap to complete.
it's the hairpin mode LB, correct?
I think I lack context on what "hairpin mode LB" is :-)
i think the CAPZ and the Azure LB were affected: https://github.com/microsoft/Azure-ILB-hairpin
if we agree that this needs a KEP it can cover what problems we are trying to solve. it's a disruptive change, thus it needs to be waranted.
in CAPI immutable upgrades we saw a problem where a 1.19 joining node cannot bootstrap, if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node. this happens because in 1.19 the CSR API graduated to v1 and a KCM is supposed to talk to a N or N+1 API server only.
a better explanation here: https://kubernetes.slack.com/archives/C8TSNPY4T/p1598907959059100?thread_ts=1598899864.038100&cid=C8TSNPY4T
optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished. see https://github.com/kubernetes/kubernetes/issues/80774 for a related discussion
this change requires a more detailed plan, a feature gate and a KEP
1.31
KEP:
1.32
TODO. Move the FG to beta?