kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.14k stars 6.47k forks source link

Discussion: Etcd scaling up process #11645

Open ugur99 opened 2 weeks ago

ugur99 commented 2 weeks ago

What happened?

Currently, when scaling up etcd instances with Kubespray, it restarts all etcd nodes simultaneously after generating new certificates and taking backups for each instance. This approach results in downtime for the entire cluster. Is it possible to restart the etcd instances one by one, to avoid causing any downtime for the cluster? Or what is the motivation behind this approach?

What did you expect to happen?

-

How can we reproduce it (as minimally and precisely as possible)?

-

OS

-

Version of Ansible

-

Version of Python

-

Version of Kubespray (commit)

-

Network plugin used

cilium

Full inventory with variables

-

Command used to invoke ansible

-

Output of ansible run

-

Anything else we need to know

No response

VannTen commented 2 weeks ago

Could you fill the template ? Which playbook ? There is no much information to go on there ^

ugur99 commented 2 weeks ago

it is not a bug report it is a discussion topic thats why I did not add other infos; actually it is not a new topic I assumed you are already aware of this issue @VannTen see here

VannTen commented 2 weeks ago

I think the linked issue is about upgrading, not scaling up though ?

What playbook has that behavior ? This seems a bit weird to me, because I've migrated whole clusters to new machines (for migrating from rhel7 to 8), including external etcd and control planes, without downtime (unless we missed it) about 2 years ago. The docs/operations/nodes.md files has the relevant docs if I remember correctly.

ugur99 commented 2 weeks ago

I think the same playbook has been used for both upgrading and scaling up the cluster;

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/cluster.yml#L19-L20

https://github.com/kubernetes-sigs/kubespray/blob/master/playbooks/upgrade_cluster.yml#L38-L39

Here is the output of the scale-up operation from 3 controlplane nodes to 5 nodes:

‘‘‘ TASK [etcd : Install etcd] ******************************************************************************************************************************************************************************************* included: /root/kubespray/roles/etcd/tasks/install_host.yml for node1, node2, node3, node4, node5 Friday 18 October 2024 09:37:05 +0200 (0:00:00.247) 0:07:26.116 ******** TASK [etcd : Get currently-deployed etcd version] ******************************************************************************************************************************************************************** fatal: [node4]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcd --version", "msg": "[Errno 2] No such file or directory: b'/usr/local/bin/etcd'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring fatal: [node5]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcd --version", "msg": "[Errno 2] No such file or directory: b'/usr/local/bin/etcd'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring changed: [node2] changed: [node1] changed: [node3] Friday 18 October 2024 09:37:05 +0200 (0:00:00.353) 0:07:26.470 ******** TASK [etcd : Restart etcd if necessary] ****************************************************************************************************************************************************************************** changed: [node4] changed: [node5] Friday 18 October 2024 09:37:05 +0200 (0:00:00.243) 0:07:26.713 ******** Friday 18 October 2024 09:37:05 +0200 (0:00:00.092) 0:07:26.806 ******** TASK [etcd : Install | Copy etcd binary from download dir] *********************************************************************************************************************************************************** ok: [node1] => (item=etcd) ok: [node2] => (item=etcd) changed: [node4] => (item=etcd) changed: [node5] => (item=etcd) ok: [node3] => (item=etcd) Friday 18 October 2024 09:37:06 +0200 (0:00:00.632) 0:07:27.439 ******** TASK [etcd : Configure etcd] ***************************************************************************************************************************************************************************************** included: /root/kubespray/roles/etcd/tasks/configure.yml for node1, node2, node3, node4, node5 Friday 18 October 2024 09:37:06 +0200 (0:00:00.139) 0:07:27.578 ******** TASK [etcd : Configure | Check if etcd cluster is healthy] *********************************************************************************************************************************************************** ok: [node1] Friday 18 October 2024 09:37:07 +0200 (0:00:00.351) 0:07:27.929 ******** Friday 18 October 2024 09:37:07 +0200 (0:00:00.021) 0:07:27.950 ******** TASK [etcd : Configure | Refresh etcd config] ************************************************************************************************************************************************************************ included: /root/kubespray/roles/etcd/tasks/refresh_config.yml for node1, node2, node3, node4, node5 Friday 18 October 2024 09:37:07 +0200 (0:00:00.125) 0:07:28.076 ******** TASK [etcd : Refresh config | Create etcd config file] *************************************************************************************************************************************************************** changed: [node1] changed: [node3] changed: [node5] changed: [node4] changed: [node2] Friday 18 October 2024 09:37:08 +0200 (0:00:01.096) 0:07:29.172 ******** Friday 18 October 2024 09:37:08 +0200 (0:00:00.078) 0:07:29.251 ******** TASK [etcd : Configure | Copy etcd.service systemd file] ************************************************************************************************************************************************************* ok: [node2] ok: [node1] changed: [node5] changed: [node4] ok: [node3] Friday 18 October 2024 09:37:09 +0200 (0:00:00.578) 0:07:29.829 ******** Friday 18 October 2024 09:37:09 +0200 (0:00:00.085) 0:07:29.914 ******** TASK [etcd : Configure | reload systemd] ***************************************************************************************************************************************************************************** ok: [node5] ok: [node4] ok: [node2] ok: [node1] ok: [node3] Friday 18 October 2024 09:37:09 +0200 (0:00:00.871) 0:07:30.786 ******** TASK [etcd : Configure | Ensure etcd is running] ********************************************************************************************************************************************************************* ok: [node2] ok: [node1] ok: [node3] fatal: [node5]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"} ...ignoring fatal: [node4]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xeu etcd.service\" for details.\n"} ...ignoring Friday 18 October 2024 09:37:11 +0200 (0:00:01.037) 0:07:31.823 ******** Friday 18 October 2024 09:37:11 +0200 (0:00:00.121) 0:07:31.945 ******** TASK [etcd : Configure | Wait for etcd cluster to be healthy] ******************************************************************************************************************************************************** ok: [node1] Friday 18 October 2024 09:37:11 +0200 (0:00:00.391) 0:07:32.336 ******** Friday 18 October 2024 09:37:11 +0200 (0:00:00.024) 0:07:32.361 ******** TASK [etcd : Configure | Check if member is in etcd cluster] ********************************************************************************************************************************************************* ok: [node1] ok: [node3] ok: [node2] fatal: [node4]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl member list | grep -w -q 192.168.64.160", "delta": "0:00:00.047067", "end": "2024-10-18 09:37:12.064918", "msg": "non-zero return code", "rc": 1, "start": "2024-10-18 09:37:12.017851", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring fatal: [node5]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl member list | grep -w -q 192.168.64.161", "delta": "0:00:00.054740", "end": "2024-10-18 09:37:12.063456", "msg": "non-zero return code", "rc": 1, "start": "2024-10-18 09:37:12.008716", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring Friday 18 October 2024 09:37:12 +0200 (0:00:00.583) 0:07:32.944 ******** Friday 18 October 2024 09:37:12 +0200 (0:00:00.078) 0:07:33.023 ******** TASK [etcd : Configure | Join member(s) to etcd cluster one at a time] *********************************************************************************************************************************************** included: /root/kubespray/roles/etcd/tasks/join_etcd_member.yml for node4 => (item=node4) included: /root/kubespray/roles/etcd/tasks/join_etcd_member.yml for node5 => (item=node5) Friday 18 October 2024 09:37:12 +0200 (0:00:00.324) 0:07:33.347 ******** TASK [etcd : Join Member | Add member to etcd cluster] *************************************************************************************************************************************************************** changed: [node4] Friday 18 October 2024 09:37:12 +0200 (0:00:00.355) 0:07:33.702 ******** TASK [etcd : Join Member | Refresh etcd config] ********************************************************************************************************************************************************************** included: /root/kubespray/roles/etcd/tasks/refresh_config.yml for node4 Friday 18 October 2024 09:37:12 +0200 (0:00:00.082) 0:07:33.784 ******** TASK [etcd : Refresh config | Create etcd config file] *************************************************************************************************************************************************************** changed: [node4] Friday 18 October 2024 09:37:13 +0200 (0:00:00.577) 0:07:34.362 ******** Friday 18 October 2024 09:37:13 +0200 (0:00:00.045) 0:07:34.407 ******** TASK [etcd : Join Member | Ensure member is in etcd cluster] ********************************************************************************************************************************************************* ok: [node4] Friday 18 October 2024 09:37:13 +0200 (0:00:00.282) 0:07:34.690 ******** TASK [etcd : Configure | Ensure etcd is running] ********************************************************************************************************************************************************************* ok: [node4] Friday 18 October 2024 09:37:14 +0200 (0:00:00.263) 0:07:34.953 ******** FAILED - RETRYING: [node5]: Join Member | Add member to etcd cluster (4 retries left). FAILED - RETRYING: [node5]: Join Member | Add member to etcd cluster (3 retries left). TASK [etcd : Join Member | Add member to etcd cluster] *************************************************************************************************************************************************************** changed: [node5] Friday 18 October 2024 09:37:28 +0200 (0:00:14.799) 0:07:49.752 ******** TASK [etcd : Join Member | Refresh etcd config] ********************************************************************************************************************************************************************** included: /root/kubespray/roles/etcd/tasks/refresh_config.yml for node5 Friday 18 October 2024 09:37:29 +0200 (0:00:00.191) 0:07:49.943 ******** TASK [etcd : Refresh config | Create etcd config file] *************************************************************************************************************************************************************** ok: [node5] Friday 18 October 2024 09:37:29 +0200 (0:00:00.522) 0:07:50.465 ******** Friday 18 October 2024 09:37:29 +0200 (0:00:00.044) 0:07:50.510 ******** TASK [etcd : Join Member | Ensure member is in etcd cluster] ********************************************************************************************************************************************************* ok: [node5] Friday 18 October 2024 09:37:29 +0200 (0:00:00.267) 0:07:50.778 ******** TASK [etcd : Configure | Ensure etcd is running] ********************************************************************************************************************************************************************* ok: [node5] Friday 18 October 2024 09:37:30 +0200 (0:00:00.256) 0:07:51.035 ******** Friday 18 October 2024 09:37:30 +0200 (0:00:00.128) 0:07:51.163 ******** TASK [etcd : Refresh etcd config] ************************************************************************************************************************************************************************************ included: /root/kubespray/roles/etcd/tasks/refresh_config.yml for node1, node2, node3, node4, node5 Friday 18 October 2024 09:37:30 +0200 (0:00:00.215) 0:07:51.379 ******** TASK [etcd : Refresh config | Create etcd config file] *************************************************************************************************************************************************************** ok: [node2] ok: [node1] changed: [node4] ok: [node5] ok: [node3] Friday 18 October 2024 09:37:31 +0200 (0:00:01.190) 0:07:52.569 ******** Friday 18 October 2024 09:37:31 +0200 (0:00:00.073) 0:07:52.643 ******** Friday 18 October 2024 09:37:31 +0200 (0:00:00.073) 0:07:52.716 ******** Friday 18 October 2024 09:37:31 +0200 (0:00:00.083) 0:07:52.799 ******** TASK [etcd : Refresh etcd config again for idempotency] ************************************************************************************************************************************************************** included: /root/kubespray/roles/etcd/tasks/refresh_config.yml for node1, node2, node3, node4, node5 Friday 18 October 2024 09:37:32 +0200 (0:00:00.216) 0:07:53.015 ******** TASK [etcd : Refresh config | Create etcd config file] *************************************************************************************************************************************************************** ok: [node5] ok: [node4] ok: [node1] ok: [node3] ok: [node2] Friday 18 October 2024 09:37:33 +0200 (0:00:00.876) 0:07:53.892 ******** Friday 18 October 2024 09:37:33 +0200 (0:00:00.236) 0:07:54.128 ******** RUNNING HANDLER [etcd : Refresh Time Fact] *************************************************************************************************************************************************************************** ok: [node5] ok: [node4] ok: [node2] ok: [node1] ok: [node3] Friday 18 October 2024 09:37:34 +0200 (0:00:01.118) 0:07:55.246 ******** RUNNING HANDLER [etcd : Set Backup Directory] ************************************************************************************************************************************************************************ ok: [node1] ok: [node2] ok: [node3] ok: [node4] ok: [node5] Friday 18 October 2024 09:37:34 +0200 (0:00:00.114) 0:07:55.360 ******** RUNNING HANDLER [etcd : Create Backup Directory] ********************************************************************************************************************************************************************* changed: [node2] changed: [node1] changed: [node3] changed: [node4] changed: [node5] Friday 18 October 2024 09:37:34 +0200 (0:00:00.287) 0:07:55.648 ******** RUNNING HANDLER [etcd : Stat etcd v2 data directory] ***************************************************************************************************************************************************************** ok: [node2] ok: [node3] ok: [node1] ok: [node4] ok: [node5] Friday 18 October 2024 09:37:35 +0200 (0:00:00.308) 0:07:55.957 ******** RUNNING HANDLER [etcd : Backup etcd v2 data] ************************************************************************************************************************************************************************* changed: [node4] changed: [node5] changed: [node2] changed: [node1] changed: [node3] Friday 18 October 2024 09:37:35 +0200 (0:00:00.649) 0:07:56.606 ******** RUNNING HANDLER [etcd : Backup etcd v3 data] ************************************************************************************************************************************************************************* changed: [node4] changed: [node2] changed: [node5] changed: [node3] changed: [node1] Friday 18 October 2024 09:37:37 +0200 (0:00:01.263) 0:07:57.869 ******** RUNNING HANDLER [etcd : Etcd | reload systemd] *********************************************************************************************************************************************************************** ok: [node4] ok: [node5] ok: [node2] ok: [node1] ok: [node3] Friday 18 October 2024 09:37:37 +0200 (0:00:00.869) 0:07:58.739 ******** RUNNING HANDLER [etcd : Reload etcd] ********************************************************************************************************************************************************************************* changed: [node5] changed: [node4] changed: [node3] changed: [node1] changed: [node2] Friday 18 October 2024 09:38:10 +0200 (0:00:32.341) 0:08:31.081 ******** RUNNING HANDLER [etcd : Wait for etcd up] **************************************************************************************************************************************************************************** ok: [node4] ok: [node2] ok: [node5] ok: [node3] ok: [node1] Friday 18 October 2024 09:38:11 +0200 (0:00:01.477) 0:08:32.559 ******** Friday 18 October 2024 09:38:11 +0200 (0:00:00.245) 0:08:32.804 ******** Friday 18 October 2024 09:38:12 +0200 (0:00:00.079) 0:08:32.883 ******** RUNNING HANDLER [etcd : Set etcd_secret_changed] ********************************************************************************************************************************************************************* ok: [node1] ‘‘‘
VannTen commented 2 weeks ago
RUNNING HANDLER [etcd : Reload etcd] *********************************************************************************************************************************************************************************
changed: [node5]
changed: [node4]
changed: [node3]
changed: [node1]
changed: [node2]
Friday 18 October 2024 09:38:10 +0200 (0:00:32.341) 0:08:31.081 ********

AFAICT from that log, this would be the problem right ? (This is not, in fact, a reload : https://github.com/kubernetes-sigs/kubespray/blob/5aea2abc40f9a7cbee0c0ad6bf32ec97f1ef3acf/roles/etcd/handlers/main.yml#L12-L17)

(I don't this a huge problem, because most of the time the window where etcd is unavailable would be very small).

I think this can be fixed with throttle on that case (with throttle being something like {{ groups['etcd'] | length // 2 }} (so we keep quorum but still go as fast as possible). host mode use Type=notify in the systemd service, so that's enough for that. Not sure about docker mode.

ugur99 commented 2 weeks ago

unfortunately for the relatively large clusters it can take ~2 mins to recover each etcd instances; and for prod clusters it is a serious problem :(

throttling is an option; but maybe restarting old replicas with the new certs before joining new members to the etcd cluster would be the cleanest way. I opened this discussion on the etcd side.

VannTen commented 2 weeks ago

unfortunately for the relatively large clusters it can take ~2 mins to recover each etcd instances; and for prod clusters it is a serious problem :(

Yeah 2 mins is bad. How large are your clusters ? I don't have more than 200 nodes so that might be why I've never seen this. (Although, maybe cluster size is not the only factor, quantity of objects might be more relevant for etcd)

VannTen commented 2 weeks ago

throttling is an option; but maybe restarting old replicas with the new certs before joining new members to the etcd cluster would be the cleanest way.

Aren't the etcd nodes trusting the CA ? Not sure what you mean with this exactly, in the scaling up cases if we do:

-> generate certs signed by CA for new members -> join cluster for new members

There should not be any problems ?

(I'm not completely sure what the etcd role do exactly, it's been a while since I've worked on that and it was not my primary concern, and that role is not very readable)

ugur99 commented 2 weeks ago

ah you are so right; we dont need to restart old etcds 👍 sorry for my confusion