Problem when changing control_plane node (upgrade from Debian bullseye to bookworm)

ccaillet1974 commented 10 months ago

Environment:

Baremetal cluster upgrading member from bullseye to bookworm, the following version are exectuted on the kubespray deployment machine
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Linux 5.10.0-23-amd64 x86_64 PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
Version of Ansible (ansible --version): ansible [core 2.14.11] config file = /homes/totof/myk8s/kubespray-master/ansible.cfg configured module search path = ['/homes/totof/myk8s/kubespray-master/library'] ansible python module location = /usr/local/lib/python3.9/dist-packages/ansible ansible collection location = /homes/totof/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3) jinja version = 3.1.2 libyaml = True
Version of Python (python --version): Python 3.9.2

Kubespray version (commit) (git rev-parse --short HEAD): 7dcc22fe8

Network plugin used: cilium

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"): inventory-variables.txt

Command used to invoke ansible: ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Output of ansible run:

TASK [etcd : Configure | Ensure etcd is running] ***********************************************************************************************************************************************************************************************
ok: [lyo0-k8s-testm02]
ok: [lyo0-k8s-testm01]
fatal: [lyo0-k8s-testm00]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because a timeout was exceeded.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"}
Thursday 26 October 2023  10:40:33 +0200 (0:01:30.885)       0:09:46.376 ******
Thursday 26 October 2023  10:40:33 +0200 (0:00:00.071)       0:09:46.448 ******
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: [lyo0-k8s-testm01]: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm01]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.042724", "end": "2023-10-26 10:41:24.339102", "msg": "non-zero return code", "rc": 1, "start": "2023-10-26 10:41:19.296378", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-26T10:41:24.329966+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000344fc0/10.141.10.65:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=496  changed=47   unreachable=0    failed=1    skipped=494  rescued=0    ignored=1
lyo0-k8s-testm01           : ok=508  changed=9    unreachable=0    failed=1    skipped=574  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=484  changed=10   unreachable=0    failed=0    skipped=501  rescued=0    ignored=0

Thursday 26 October 2023  10:41:24 +0200 (0:00:51.257)       0:10:37.705 ******
===============================================================================
etcd : Configure | Ensure etcd is running ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 90.89s
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 51.26s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 18.48s
download : Download_container | Download image if required ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14.25s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 12.02s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.66s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.66s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.57s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.50s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.28s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 6.18s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.08s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.57s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.56s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.54s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.50s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.45s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 5.15s

Description of the problem : I've run the following command to remove one control_plane node which is currently in bullseye (Debian 11) ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K remove-node.yml -e node=lyo0-k8s-testm00.

I previously change the order on the inventory file as described in docs/nodes.md. After that I freshly installed the node with bookworm. And after that I try to add again my node which is now on bookworm with the command ansible-playbook -i inventory/test-l2-multi/hosts.yml --become --become-user=root -K --limit=kube_control_plane cluster.yml

Here are the logs on the new control_plane node for etcd entries :

2023-10-26T11:03:32.800850+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.800034+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
2023-10-26T11:03:32.801193+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801124+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
2023-10-26T11:03:32.801382+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.80132+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b
at term 1"}
2023-10-26T11:03:32.801584+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801524+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to c101cbbb43bf28a0 at term 1"}
2023-10-26T11:03:32.801780+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"info","ts":"2023-10-26T11:03:32.801721+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote reques
t to eb484ede068d3a18 at term 1"}
2023-10-26T11:03:34.838611+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838023+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839025+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838013+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839093+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838049+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT
_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
2023-10-26T11:03:34.839234+02:00 lyo0-k8s-testm00 etcd[35620]: {"level":"warn","ts":"2023-10-26T11:03:34.838979+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAP
SHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}

The command failed at etcd stage (deployed as host method) withe the message described bellow. I've already done this kind of operation but only for changing hardware for control plane nodes and with kubespray release-2.22 branch and all is working well.

Any assistance will be appreciated for resolving this etcd problem

ccaillet1974 commented 10 months ago

ADDENDUM :

Same problem when trying to add the node in bullseye. For testing I reinstalled the node in bullseye and I had the same issue when trying to add my node in control-plane with the command described bellow.

blackluck commented 10 months ago

Hello, not clear for me, master nodes also etcd nodes or not? If it is then maybe you should also use etcd group in limit like: '--limit=kube_control_plane,etcd'

ccaillet1974 commented 10 months ago

Ok ... I'll test the adding part with etcd limit but for me control_plane include etcd ... give infos in about 20 minutes

EDIT : Same error with etcd included in limit parameter.

blackluck commented 10 months ago

Doc also mentions for etcd nodes you need to set -e ignore_assert_errors=yes

ccaillet1974 commented 10 months ago

Same with the -e ignore_asset_errors=yes

As I said earlier it seems that there is a problem with cert generation because I've "bad tls certificate" on logs on my new node

EDIT : this process to change a node in the control_plane worked well with release-2.22 branch ... maybe a regression ?

EDIT 2 :

logs on other nodes :

Oct 26 16:17:10 lyo0-k8s-testm02 etcd[1375044]: {"level":"warn","ts":"2023-10-26T16:17:10.13056+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:53502","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}

Oct 26 16:19:08 lyo0-k8s-testm01 etcd[1374901]: {"level":"warn","ts":"2023-10-26T16:19:08.146019+0200","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"10.141.10.64:60964","server-name":"","error":"tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}

It seems that there is a big problem on certificate generation for the new node IP : 10.141.10.64 :'(

blackluck commented 10 months ago

Is that possible that you run it before changed host order in inventory also? That could cause generating new certs if run on empty master (because that's the first node)

ccaillet1974 commented 10 months ago

Yes i'll test it now

EDIT : same result when new node is in first place

TASK [etcd : Configure | Wait for etcd cluster to be healthy] **********************************************************************************************************************************************************************************
fatal: [lyo0-k8s-testm00]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.033871", "end": "2023-10-27 08:56:01.179553", "msg": "non-zero return code", "rc": 1, "start": "2023-10-27 08:55:56.145682", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-10-27T08:56:01.175361+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.9/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000388fc0/10.141.10.64:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection closed before server preface received\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************************************************************************************************************

PLAY RECAP *************************************************************************************************************************************************************************************************************************************
lyo0-k8s-testm00           : ok=489  changed=13   unreachable=0    failed=1    skipped=593  rescued=0    ignored=0
lyo0-k8s-testm01           : ok=470  changed=12   unreachable=0    failed=0    skipped=515  rescued=0    ignored=0
lyo0-k8s-testm02           : ok=471  changed=13   unreachable=0    failed=0    skipped=514  rescued=0    ignored=0

Friday 27 October 2023  08:56:01 +0200 (0:00:42.656)       0:08:10.829 ********
===============================================================================
etcd : Configure | Wait for etcd cluster to be healthy --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 42.66s
etcd : Gen_certs | Write etcd member/admin and kube_control_plane client certs to other etcd nodes ------------------------------------------------------------------------------------------------------------------------------------- 16.84s
etcd : Gen_certs | Write node certs to other etcd nodes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 11.85s
download : Download_container | Download image if required ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 7.45s
container-engine/containerd : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 7.37s
download : Download_file | Download item ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.90s
container-engine/runc : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.86s
container-engine/nerdctl : Download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.76s
etcdctl_etcdutl : Download_file | Download item ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.73s
container-engine/crictl : Download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.72s
etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 6.39s
container-engine/crictl : Extract_file | Unpacking archive ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 6.35s
container-engine/validate-container-engine : Populate service facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.90s
container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.73s
etcd : Gen_certs | Gather etcd member/admin and kube_control_plane client certs from first etcd node ------------------------------------------------------------------------------------------------------------------------------------ 5.61s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.40s
container-engine/containerd : Containerd | Unpack containerd archive -------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.70s
container-engine/runc : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.61s
container-engine/containerd : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.49s
etcdctl_etcdutl : Download_file | Validate mirrors -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.41s

Here is the syslog on new node when try to start etcd

Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.032429+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b is starting a new election at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033375+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b became pre-candidate at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033601+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b received MsgPreVoteResp from 7d81e23d9d41da1b at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.033838+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to c101cbbb43bf28a0 at term 1"}
Oct 27 08:57:24 lyo0-k8s-testm00 etcd[32704]: {"level":"info","ts":"2023-10-27T08:57:24.034049+0200","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"7d81e23d9d41da1b [logterm: 1, index: 3] sent MsgPreVote request to eb484ede068d3a18 at term 1"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.067894+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.069697+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.072727+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:25 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:25.074858+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.048835+0200","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"7d81e23d9d41da1b","local-member-attributes":"{Name:etcd1 ClientURLs:[https://10.141.10.64:2379]}","request-path":"/0/members/7d81e23d9d41da1b/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.073383+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074327+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.074809+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c101cbbb43bf28a0","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 etcd[32704]: {"level":"warn","ts":"2023-10-27T08:57:30.075396+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"eb484ede068d3a18","rtt":"0s","error":"remote error: tls: bad certificate"}
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: start operation timed out. Terminating.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Failed with result 'timeout'.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: Failed to start etcd.
Oct 27 08:57:30 lyo0-k8s-testm00 systemd[1]: etcd.service: Consumed 18.670s CPU time.

blackluck commented 10 months ago

Sorry I wasn't asking you to do it this way. I was asking if you already did it earlier, because if you try to add an empty new node as it is the first master, then kubespray won't find certs on that and handle it as kinda installing a new cluster so creating new certs that copy to other masters just probably won't restart components on them, so until they run they will have in memory the old certs, but on filesystem new ones. And if it try to join a new master it will copy the new certs which won't match the still running components. I would say check if there is a backup of certs and they changed or not.

ccaillet1974 commented 10 months ago

As said earlier the "new node" is done by : remove the node (with the remove-node.yml) after that I upgrade my node from bullseye to bookworm then I add again my node in the cluster with the appropriate command.

This process have been already done with kubespray (branch release-2.22) at this time I'd changed my control_plane node on another cluster from VM to baremetal servers using the process described in docs/nodes.md and all worked perfectly.

And the node IS not the first aster because nodes.md describe how to remove/add the first control_plane node and I followed this documentation.

For me there is some problems in release-2.23 branch .. don't try with master branch.

Actually I was working on another solution for upgrading my nodes : 1- drain the node 2- upgrading via apt full-upgrade 3- reboot the node 4- uncordon the node

FingerlessGlov3s commented 8 months ago

Actually I was working on another solution for upgrading my nodes 1- drain the node 2- upgrading via apt full-upgrade 3- reboot the node 4- uncordon the node

Any update on this? or did you go about upgrading the OS itself differently?

ccaillet1974 commented 8 months ago

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

FingerlessGlov3s commented 8 months ago

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

So whats your process now?

remove the node, OS upgrade, then add it back?

ccaillet1974 commented 8 months ago

Yes it is

— Cordialement Christophe Caillet

De : FingerlessGloves @.> Envoyé : Thursday, December 7, 2023 12:13:06 PM À : kubernetes-sigs/kubespray @.> Cc : CAILLET, Christophe @.>; Author @.> Objet : Re: [kubernetes-sigs/kubespray] Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) (Issue #10560)

Hi,

Sorry for the delay :)

All is working, with this method. I've also tested when all node are on the same distro version (all in debian bookworm) and now deleting control_plane node and readding it works.

So maybe the issue is due to the version mismatch between control_plane nodes, my two cents :)

Regards

So whats your process now?

remove the node, OS upgrade, then add it back?

— Reply to this email directly, view it on GitHubhttps://github.com/kubernetes-sigs/kubespray/issues/10560#issuecomment-1845153110, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AW2PVQPP3TFJOWRPV5UX66LYIGQEFAVCNFSM6AAAAAA6Q2CVDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVGE2TGMJRGA. You are receiving this because you authored the thread.Message ID: @.***>

VannTen commented 6 months ago

Possibly related #upgrade #10808

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/10560#issuecomment-2212408238): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / kubespray

Problem when changing control_plane node (upgrade from Debian bullseye to bookworm) #10560