neolit123 commented 4 years ago

in CAPI immutable upgrades we saw a problem where a 1.19 joining node cannot bootstrap, if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node. this happens because in 1.19 the CSR API graduated to v1 and a KCM is supposed to talk to a N or N+1 API server only.

a better explanation here: https://kubernetes.slack.com/archives/C8TSNPY4T/p1598907959059100?thread_ts=1598899864.038100&cid=C8TSNPY4T

[x] we should make the controller-manager.conf and scheduler.conf that kubeadm generates talk to the local API server and not to the controlPlaneEndpoint (CPE, e.g. LB). PR for 1.20: https://github.com/kubernetes/kubernetes/pull/94398 PR for 1.19: https://github.com/kubernetes/kubernetes/pull/94442
[x] relax the server URL validation in kubeconfig files: https://github.com/kubernetes/kubeadm/issues/2271#issuecomment-690822335 https://github.com/kubernetes/kubernetes/pull/94816

optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished. see https://github.com/kubernetes/kubernetes/issues/80774 for a related discussion

this change requires a more detailed plan, a feature gate and a KEP

1.31

KEP:

https://github.com/kubernetes/enhancements/issues/4471 (adds the ControlPlaneKubeletLocalMode FG) k/k PRs:
https://github.com/kubernetes/kubernetes/pull/125582
https://github.com/kubernetes/kubernetes/pull/125780 k/website PR:
https://github.com/kubernetes/website/pull/46945 e2e test:
https://github.com/kubernetes/kubeadm/pull/3080
https://github.com/kubernetes/test-infra/pull/32858

1.32

TODO. Move the FG to beta?

neolit123 commented 4 years ago

first PR is here: https://github.com/kubernetes/kubernetes/pull/94398

neolit123 commented 4 years ago

we spoke about the kubelet.conf in the office hours today:

Pointing the kubelet to the local api server should work, but the kubelet-start phase has to happen after the control-plane manifests are written on disk for CP nodes.
Requires phase reorder and we are considering using a feature gate.
This avoids skew problems of a new kubelet trying to bootstrap against an old api-server.
One less component to point to the CPE.

i'm going to experiment and see how it goes, but this cannot be backported to older releases as it is a breaking change to phase users.

zhangguanzhang commented 4 years ago

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

neolit123 commented 4 years ago

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

can you clarify with examples?

neolit123 commented 4 years ago

@jdef added a note that that some comments were left invalid after the recent change: https://github.com/kubernetes/kubernetes/pull/94398/files/d9441906c4155173ce1a75421d8fcd1d2f79c471#r486252360

this should be fixed in master.

neolit123 commented 4 years ago

some else added a comment on https://github.com/kubernetes/kubernetes/pull/94398 but later deleted it:

when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5
the error occurs like

I0910 15:15:54.436430   52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters[currentCluster].Server:  https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.

zhangguanzhang commented 4 years ago

This breaks the rules, the controlPlaneEndpoint maybe a domain, if this is a domain, so it will not run ok after your code

can you clarify with examples?

you could see this doc https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-first-control-plane-node

--control-plane-endpoint "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT"

neolit123 commented 4 years ago

i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you? what error output are you seeing? i did test this during my work on the changes and it worked fine.

jdef commented 4 years ago

some else added a comment on kubernetes/kubernetes#94398 but later deleted it:

when use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint like apiserver.cluster.local to generate config files. and use kubeadm init --config=/root/kubeadm-config.yaml --upload-certs -v 5
the error occurs like

I0910 15:15:54.436430   52511 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters[currentCluster].Server:  https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

this validation should be turned into a warning instead of an error. then components would fail if they don't point to a valid API server, so the user would know.

yes, please. this just bit us when testing a workaround in a pre-1.19.1 cluster whereby we tried manually updating clusters[].cluster.server in (scheduler, controller-manager .conf) to point to localhost instead of the official controlplane endpoint.

zhangguanzhang commented 4 years ago

i do know about that doc. are you saying that using "DNS-name:port" is completely broken now for you?

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

neolit123 commented 4 years ago

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

what error are you getting?

zhangguanzhang commented 4 years ago

yes, if you want to deploy a HA cluster, it is best to set controlPlaneEndpoint to the LOAD_BALANCER_DNS instead of LOAD_BALANCER ip

what error are you getting?

I add some code for the log print, this is the error

I0910 13:14:53.017570   21006 kubeconfig.go:84] creating kubeconfig file for controller-manager.conf
currentConfig.Clusters https://apiserver.cluster.local:6443 
config.Clusters:  https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL

neolit123 commented 4 years ago

ok, so you have the same error as the user reporting above.

we can fix this for 1.19.2

one workaround is:

start kubeadm "init" with kubeconfig files using the local endpoint (instead of control-plane-endpoint)
wait for init to finish
modify the kubeconfig files again
restart the kube-scheduler and kube-controller-manager

zhangguanzhang commented 4 years ago

Both kube-scheduler and kube-controller-manager can use localhost and loadblance to connect to kube-apiserver, but users cannot be forced to use localhost, and warnning can be used instead of error

fabriziopandini commented 4 years ago

@neolit123 I'm +1 to relax the checks on the address in the existing kubeconfig file. We can either remove the check or make it more flexible by checking if the address is either CPE or LAPI

oldthreefeng commented 4 years ago

@neolit123 here is the example. i just edit to add log print. https://github.com/neolit123/kubernetes/blob/d9441906c4155173ce1a75421d8fcd1d2f79c471/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L225

fmt.Println("currentConfig.Clusters[currentCluster].Server:", currentConfig.Clusters[currentCluster].Server, "\nconfig.Clusters[expectedCluster].Server: ", config.Clusters[expectedCluster].Server)

use method CreateJoinControlPlaneKubeConfigFiles with controlPlaneEndpoint to genrate kube-schedulerand kube-controller-manager , in this situation , set controlPlaneEndpoint as LOAD_BALANCER_DNS:LOAD_BALANCER_PORT . it is best to set LOAD_BALANCER_DNS instead of IP. then to run kubeadm init with LOAD_BALANCER_DNS:LOAD_BALANCER_PORT. the result is.

./kubeadm  init  --control-plane-endpoint  apiserver.cluster.local:6443
W0911 09:36:17.922135   63517 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.1
[preflight] Running pre-flight checks
    [WARNING FileExisting-socat]: socat not found in system path
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Using existing ca certificate authority
[certs] Using existing apiserver certificate and key on disk
[certs] Using existing apiserver-kubelet-client certificate and key on disk
[certs] Using existing front-proxy-ca certificate authority
[certs] Using existing front-proxy-client certificate and key on disk
[certs] Using existing etcd/ca certificate authority
[certs] Using existing etcd/server certificate and key on disk
[certs] Using existing etcd/peer certificate and key on disk
[certs] Using existing etcd/healthcheck-client certificate and key on disk
[certs] Using existing apiserver-etcd-client certificate and key on disk
[certs] Using the existing "sa" key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
[kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/kubelet.conf"
currentConfig.Clusters[currentCluster].Server: https://apiserver.cluster.local:6443 
config.Clusters[expectedCluster].Server:  https://192.168.160.243:6443
error execution phase kubeconfig/controller-manager: a kubeconfig file "/etc/kubernetes/controller-manager.conf" exists already but has got the wrong API Server URL
To see the stack trace of this error execute with --v=5 or higher

neolit123 commented 4 years ago

i will send the PR in the next couple of days. edit: https://github.com/kubernetes/kubernetes/pull/94816

neolit123 commented 4 years ago

fix for 1.19.2 is here: https://github.com/kubernetes/kubernetes/pull/94890

neolit123 commented 4 years ago

to further summarize what is happening. after the changes above, kubeadm will no longer error out if the server URL in custom provided kubeconfig files does not match the expected one. it will only show a warning.

example:

you have something like foo:6443 in scheduler.conf
kubeadm expects scheduler.conf to point to e.g. 192.168.0.108:6443 (local api server endpoint)
kubeadm will show you a warning when reading the provided kubeconfig file.
this allows you to modify the topology of your control-plane components, but you need to make sure the components work after such a customization.

jdef commented 4 years ago

fix for 1.19.2 is here: kubernetes/kubernetes#94890

1.19.2 is already out. So this fix will target 1.19.3, yes?

neolit123 commented 4 years ago

Indeed, they pushed it out 2 days ago. Should be out with 1.19.3 then.

robscott commented 4 years ago

@neolit123 This issue came up as I'm working on graduating the EndpointSlice API to GA (https://github.com/kubernetes/kubernetes/pull/96318). I'm trying to determine if it's safe to also upgrade consumers like kube-proxy or kube-controller-manager to also use the v1 API in the same release. If I'm understanding this issue correctly, making that change in upstream could potentially result in issues here when version skew exists. Do you think this will be resolved in time for the 1.20 release cycle?

neolit123 commented 4 years ago

@robscott i will comment on https://github.com/kubernetes/kubernetes/pull/96318

neolit123 commented 3 years ago

/remove-kind bug /kind feature design

neolit123 commented 3 years ago

re:

optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.

i experimented with this and couldn't get it to work under normal conditions with a patched kubeadm binary.

procedure:

create primary CP node

on the second CP node (note: phases are re-ordered here, compared to non-patched kubeadm):

prepare the static pods
add etcd member
write a bootstrap-kubelet.conf that points to local second node API server
start kubelet (starts static pods too)

TLS bootstrap fails and the kubelet reports a 400 and a:

Unexpected error when reading response body

``` Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: I0201 23:16:45.515999 1477 certificate_manager.go:412] Rotating certificates Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: I0201 23:16:45.519288 1477 request.go:1105] Request Body: Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000000 6b 38 73 00 0a 33 0a 16 63 65 72 74 69 66 69 63 |k8s..3..certific| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000010 61 74 65 73 2e 6b 38 73 2e 69 6f 2f 76 31 12 19 |ates.k8s.io/v1..| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000020 43 65 72 74 69 66 69 63 61 74 65 53 69 67 6e 69 |CertificateSigni| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000030 6e 67 52 65 71 75 65 73 74 12 b3 04 0a 16 0a 00 |ngRequest.......| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000040 12 04 63 73 72 2d 1a 00 22 00 2a 00 32 00 38 00 |..csr-..".*.2.8.| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000050 42 00 7a 00 12 96 04 0a b0 03 2d 2d 2d 2d 2d 42 |B.z.......-----B| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000060 45 47 49 4e 20 43 45 52 54 49 46 49 43 41 54 45 |EGIN CERTIFICATE| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000070 20 52 45 51 55 45 53 54 2d 2d 2d 2d 2d 0a 4d 49 | REQUEST-----.MI| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000080 49 42 42 7a 43 42 72 67 49 42 41 44 42 4d 4d 52 |IBBzCBrgIBADBMMR| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000090 55 77 45 77 59 44 56 51 51 4b 45 77 78 7a 65 58 |UwEwYDVQQKEwxzeX| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000a0 4e 30 5a 57 30 36 62 6d 39 6b 5a 58 4d 78 4d 7a |N0ZW06bm9kZXMxMz| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000b0 41 78 42 67 4e 56 42 41 4d 54 4b 6e 4e 35 0a 63 |AxBgNVBAMTKnN5.c| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000c0 33 52 6c 62 54 70 75 62 32 52 6c 4f 6d 74 70 62 |3RlbTpub2RlOmtpb| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000d0 6d 52 6c 63 69 31 79 5a 57 64 31 62 47 46 79 4c |mRlci1yZWd1bGFyL| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000e0 57 4e 76 62 6e 52 79 62 32 77 74 63 47 78 68 62 |WNvbnRyb2wtcGxhb| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000000f0 6d 55 74 4d 6a 42 5a 4d 42 4d 47 42 79 71 47 0a |mUtMjBZMBMGByqG.| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000100 53 4d 34 39 41 67 45 47 43 43 71 47 53 4d 34 39 |SM49AgEGCCqGSM49| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000110 41 77 45 48 41 30 49 41 42 4a 79 42 30 56 53 70 |AwEHA0IABJyB0VSp| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000120 41 78 6e 57 45 50 2f 64 68 6d 76 4f 72 69 47 4c |AxnWEP/dhmvOriGL| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000130 59 39 64 31 62 4e 69 70 72 46 77 63 4a 76 71 6e |Y9d1bNiprFwcJvqn| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000140 0a 45 45 38 43 42 72 56 77 61 47 6f 34 34 66 61 |.EE8CBrVwaGo44fa| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000150 48 48 48 48 34 48 54 57 79 33 4b 42 65 62 31 70 |HHHH4HTWy3KBeb1p| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000160 35 6c 49 78 54 62 6a 62 6e 2f 2f 52 4d 32 69 53 |5lIxTbjbn//RM2iS| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000170 67 41 44 41 4b 42 67 67 71 68 6b 6a 4f 50 51 51 |gADAKBggqhkjOPQQ| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000180 44 0a 41 67 4e 49 41 44 42 46 41 69 45 41 31 55 |D.AgNIADBFAiEA1U| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000190 49 58 59 7a 76 6e 38 79 71 31 65 47 41 2f 66 46 |IXYzvn8yq1eGA/fF| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001a0 64 76 74 6c 2f 76 73 39 6d 66 62 62 65 35 31 54 |dvtl/vs9mfbbe51T| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001b0 71 45 58 48 76 32 76 2b 34 43 49 47 59 4c 59 35 |qEXHv2v+4CIGYLY5| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001c0 57 47 0a 4d 72 64 63 66 71 41 2f 58 43 75 67 6c |WG.MrdcfqA/XCugl| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001d0 54 34 76 58 47 51 57 61 74 6f 54 74 56 4d 73 57 |T4vXGQWatoTtVMsW| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001e0 68 69 72 58 77 62 68 0a 2d 2d 2d 2d 2d 45 4e 44 |hirXwbh.-----END| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 000001f0 20 43 45 52 54 49 46 49 43 41 54 45 20 52 45 51 | CERTIFICATE REQ| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000200 55 45 53 54 2d 2d 2d 2d 2d 0a 12 00 1a 00 2a 11 |UEST-----.....*.| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000210 64 69 67 69 74 61 6c 20 73 69 67 6e 61 74 75 72 |digital signatur| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000220 65 2a 10 6b 65 79 20 65 6e 63 69 70 68 65 72 6d |e*.key encipherm| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000230 65 6e 74 2a 0b 63 6c 69 65 6e 74 20 61 75 74 68 |ent*.client auth| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000240 3a 2b 6b 75 62 65 72 6e 65 74 65 73 2e 69 6f 2f |:+kubernetes.io/| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000250 6b 75 62 65 2d 61 70 69 73 65 72 76 65 72 2d 63 |kube-apiserver-c| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000260 6c 69 65 6e 74 2d 6b 75 62 65 6c 65 74 1a 00 1a |lient-kubelet...| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: 00000270 00 22 00 |.".| Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: I0201 23:16:45.519355 1477 round_trippers.go:425] curl -k -v -XPOST -H "Accept: application/vnd.kubernetes.protobuf,application/json" -H "Content-Type: application/vnd.kubernetes.protobuf" -H "User-Agent: kubelet/v1.20.2 (linux/amd64) kubernetes/faecb19" 'http://127.0.0.1:6443/apis/certificates.k8s.io/v1/certificatesigningrequests' Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: I0201 23:16:45.519720 1477 round_trippers.go:445] POST http://127.0.0.1:6443/apis/certificates.k8s.io/v1/certificatesigningrequests 400 Bad Request in 0 milliseconds Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: I0201 23:16:45.519728 1477 round_trippers.go:451] Response Headers: Feb 01 23:16:45 kinder-regular-control-plane-2 kubelet[1477]: E0201 23:16:45.519830 1477 request.go:1011] Unexpected error when reading response body: read tcp 127.0.0.1:33144->127.0.0.1:6443: read: connection reset by peer ```

kubelet client certs are never written in /var/lib/kubelet/pki/.

alternatively if the second CP node signs it's own kubelet client certificates (since it has the ca.key) with disabled rotation the Node object ends up being created properly, but this sort of defeats the bootstrap token method for joining CP nodes and means one can just join using the "--certificate-key" that fetches the CA.

the static pods on the second node are running fine. etcd cluster looks healthy. i do not see anything interesting in the server and KCM logs, but i wonder if this somehow due to leader election.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fabriziopandini commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

neolit123 commented 3 years ago

from my POV the last item in the TODOs here is not easily doable. summary above. if someone wants to investigate this further, please go ahead.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shysank commented 2 years ago

We have been facing hairpin issue in capz private cluster when using control plane endpoint for interactions between control plane components. To overcome this we do the following:

For the first cp component, we map cp endpoint to localhost in preKubeadmCommands
For the subsequents cp components, we do the same in postKubeadmCommands

I was trying to remove this workaround and see if kubeadm is able to successfully initialize the node, and observed the following:

kubeadm init run with config from /run/kubeadm/kubeadm.yaml
Static pod manifests are installed successfully (api-server, scheduler, controller-manager)
kubelet boot upTimes out after a 4 minutes.
on looking at the logs observed that it fails in multiple places when trying to reach api server, for example:

when trying to get leases

Feb 08 23:49:11 capz-e2e-xuqdh3-private-control-plane-fjmnz kubelet[6360]: E0208 23:49:11.596858    6360 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://apiserver.capz-e2e-xuqdh3-private.capz.io:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/capz-e2e-xuqdh3-private-control-plane-fjmnz?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

when trying to register node

Feb 08 23:49:25 capz-e2e-xuqdh3-private-control-plane-fjmnz kubelet[6360]: E0208 23:49:25.790815    6360 kubelet_node_status.go:93] "Unable to register node with API server" err="Post \"https://apiserver.capz-e2e-xuqdh3-private.capz.io:6443/api/v1/nodes\": dial tcp 10.255.0.100:6443: i/o timeout" node="capz-e2e-xuqdh3-private-control-plane-fjmnz"

This happens because kubelet.conf uses control plane endpoint for api server, and hence kubelet is unable to contact the api server because of the hairpin issue mentioned above.

An interesting thing to note is that the workaround of mapping cp endpoint to localhost is done in postKubeadm for joining nodes. This means that the nodes were able to join the cluster successfully even with cp endpoint in kubelet.conf. This could be because there are more than one nodes now, and the hairpin failure's probability drops below 100%.

Ideally we want a way for kubelet.conf to point to local ip when there is only one instance of control plane running and then switch to cp endpoint once other nodes join. Not sure if it's even possible, and even if it was, we'd still have packet losses when using dns.

wdyt? @neolit123 @fabriziopandini

neolit123 commented 2 years ago

Ideally we want a way for kubelet.conf to point to local ip when there is only one instance of control plane running and then switch to cp endpoint once other nodes join. Not sure if it's even possible, and even if it was, we'd still have packet losses when using dns.

does that mean having the kubelet.conf generated during kubeadm init to point to localhost and the one for kubeadm join to point to the LB?

i think this is already supported with init phases. try the following:

kubeadm init phase certs ca this will generate a CA
kubeadm init phase kubeconfig kubelet --config .... this will generate the kubelet kubeconfig
patch the server name in the kubelet kubeconfig file with sed
kubeadm init --skip-phases kubeconfig/kubelet,certs/ca --config ...

neolit123 commented 2 years ago

yet....if i'm understanding the problem correctly, i actually do not recommend doing that on the init node, because that kubelet might not be able to rotate its client certificate after ~8 months if the localhost API server / KCM are no longer leaders. that can happen if the node restarted at some point and leaders switched.

just a speculation, but i'm not in favor of making this change in kubeadm init by default until we understand the CSR signing / leader election problem better.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

SataQiu commented 1 year ago

/remove-lifecycle stale

liggitt commented 1 year ago

if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node

can this still happen (a new KCM talk to an old API server)? if so, this is breaking https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager constraints and needs fixing

neolit123 commented 1 year ago

if a 1.19 KCM takes leadership and tries to send a CSR to a 1.18 API server on an existing Node

can this still happen (a new KCM talk to an old API server)? if so, this is breaking https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager constraints and needs fixing

i am not aware if it can still happen in the future. the original trigger for logging the issue was the CSR v1 graduation which was near 3 years ago.

the problem in kubeadm and CAPI where the kubelets on CP nodes talk to the LB remains and i could not find a sane solution.

liggitt commented 1 year ago

yeah, sorry, I meant an 1.(n+1) KCM talking to a 1.n API server, not 1.19/1.18 specifically

fabriziopandini commented 1 year ago

KCM and Scheduler always talk with the API server on the same machine, which is of the same version (as far as I remember this decision was a trade-off between HA and user experience for upgrades).

Kubelet is the only component going through the load balancer, it is the last open point of this issue

liggitt commented 1 year ago

maybe https://github.com/kubernetes/kubernetes/pull/116570/files#r1179273639 was what I was thinking of, which was due to upgrade order rather than load-balancer communication

oldthreefeng commented 1 year ago

optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished.

I think this is a good idea to fix this problem. @neolit123 :)

pacoxu commented 1 year ago

IMO, this is no HA design for kubelet to connect the local APIServer only on control plane nodes. And for bootstrap (I mean /etc/kubernetes/bootstrap-kubelet.conf), the local apiserver is not ready yet.

If the local apiserver is restarting or crashed, the kubelet will fail and the node will be not ready then.
For KCM and scheduler, that is a different topic, and the local connection is stable for them and node status will not be affected.

As kubelet must not be newer than kube-apiserver, we should upgrade all control planes at first and then upgrade the kubelet in control plane nodes. This is enough for me.

pavels commented 1 year ago

would it be possible to also do this for generated admin.conf on CP?

neolit123 commented 1 year ago

would it be possible to also do this for generated admin.conf on CP?

the point of admin.conf is to reach the lb sitting in front of the servers. in case of failure or during upgrade it is best to keep it that way IMO.

you could sign a custom kubeconf that talks to localhost:port.

chrischdi commented 10 months ago

To tackle the last point of this issue:

[ ] optionally we should see if we can make the kubelet on control-plane Nodes bootstrap via the local API server instead of using the CPE. this might be a bit tricky and needs investigation. we could at least post-fix the kubelet.conf to point to the local API server after the bootstrap has finished. see sig-cluster-lifecycle: best practices for immutable upgrades kubernetes#80774 for a related discussion

The TL/DR for this change is that we have to adjust the tlsBootstrapCfg which gets written for the kubelet to disk here:

https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L123

to point to localhost.

This change alone does not work, because it would create a chicken-egg problem:

The kubelet tries to bootstrap itself while trying to talk to the kube-apiserver which gets started by the kubelet itself as static pod.
The kube-apiserver relies on the etcd pod which is also running as a static pod.
The etcd pod does get created in a later phase

Because of that this requires a change to kubeadm and its phases to fix the order of some actions so it can succeed.

Note: With that said, I think this change cannot be simply merged to the codebase and may require getting activated over time by graduating it via a feature gate or similar to get the new default behaviour after some time and including well-written release-notes for this change to make users aware of it.

Proposed solution

To solve the above chicken-egg issue we have to reorder some subphases / add some extra phases to kubeadm:

To summarise the change:

Split up the KubeletStart phase into KubeletStart and KubeletWaitBootstrap (orange). The KubeletStart phase ends after starting the kubelet and the KubeletWaitBootstrap (orange) phase does the rest which previously was embedded in KubeletStart
Split up the ControlPlaneJoinPhase into ControlPlaneJoinPhase (cyan) and ControlplaneJoinEtcdPhase (green) by extracting the etcd relevant subphase EtcdLocalSubphase which then is the ControlplaneJoinEtcdPhase
Reorder with the new phases:
- From: ... -> KubeletStart -> (KubeletWaitBootstrap) -> (ControlplaneJoinEtcdPhase*) -> ControlPlaneJoinPhase -> ...
- To: ... -> KubeletStart -> ControlplaneJoinEtcdPhase -> KubeletWaitBootstrap -> ControlPlaneJoinPhase -> ...

* Note: The EtcdLocalSubphase was directly at the beginning of ControlPlaneJoinPhase.

The addition to the phases changes:

Write the kubeconfig for kubelet bootstrap using localhost / 127.0.0.1 instead of the load balanced API Server endpoint
Ensure that we still use the load balanced API Server endpoint for the check if the node already exists:
- https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L158-L159

more information

I have a dirty POC implementation here: https://github.com/kubernetes/kubernetes/compare/master...chrischdi:kubernetes:pr-experiment-kubeadm-kubelet-localhost which I used for testing the implementation.

I also stress-tested this implementation by using kinder:

By creating a custom v1.28.0 kindest/node image (only change is the updated kubeadm binary)
Then created cluster via kinder: kinder create cluster --name kinder-test --image kindest/node:v1.28.0-test --control-plane-nodes 3
Then run the following script for the stress-test:

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

I=0
while true; do 
  if [[ $(($I % 10)) -eq 0 ]]; then
    echo ""
    echo "Starting iteration $I"
  fi
  echo -n '.'
  kinder do kubeadm-init --name kinder-test >stdout.txt 2>stderr.txt
  kinder do kubeadm-join --name kinder-test >stdout.txt 2>stderr.txt
  kinder do kubeadm-reset --name kinder-test >stdout.txt 2>stderr.txt
  I=$((I+1))
done

If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)

neolit123 commented 10 months ago

thanks for all the work on this @chrischdi we could target 1.30 for it as it has been a long standing task, however it's still not clear to me how exactly users are affected. it's the hairpin mode LB, correct?

we should probably talk more about it in the kubeadm office hours this week.

If this sounds good, I would be happy to help driving this forward. I don't know if this requires a KEP first?! Happy to receive some feedback :-)

given a FG was suggested and given it's a complex change, that is 1) breaking for users that anticipate a certain kubeadm phase order, and also 2) needs tests - i guess we need a KEP.

@pacoxu @SataQiu WDYT about this overall? we need agreement on it, obviously.

my vote is +1, but i hope we don't break users in ways that cannot be recoverable.

if we agree on a KEP and a way forward you can omit the PRR (prod readiness review) as it's a non-target for kubeadm. https://github.com/kubernetes/enhancements/blob/master/keps/README.md

pacoxu commented 10 months ago

During joining a new control-plane node, in the step of new EtcdLocalSubphase, is kubelet running in standalone mode at first?

It sounds like doable.

For upgrade progress, should we add logic for kubelet config to point to localhost?

chrischdi commented 10 months ago

it's the hairpin mode LB, correct?

I think I lack context on what "hairpin mode LB" is :-)

During joining a new control-plane node, in the step of new EtcdLocalSubphase, is kubelet running in standalone mode at first?

Yes, in the targeted implementation, kubelet starts already, but cannot yet join the cluster (because the referenced kube-apiserver will not get healthy unless etcd is started and joined the cluster). During EtcdLocalSubphase we then place the etcd static pod manifest and join etcd to the cluster. After it joined, kube-apiserver gets healthy and the kubelet bootstraps itself, while kubeadm starts to wait for bootstrap to complete.

neolit123 commented 10 months ago

it's the hairpin mode LB, correct?

I think I lack context on what "hairpin mode LB" is :-)

i think the CAPZ and the Azure LB were affected: https://github.com/microsoft/Azure-ILB-hairpin

if we agree that this needs a KEP it can cover what problems we are trying to solve. it's a disruptive change, thus it needs to be waranted.

kubernetes / kubeadm

make components on control-plane nodes point to the local API server endpoint #2271

Proposed solution

more information