Closed shikam closed 2 years ago
You could start by sharing some information according to the bug template so we know what version you are running, what OS, ansible and python version and you ansible inventory and ansible inventory variables.
also reproduced , same error. python 3.8.10 ansible 5.5.0 Ubuntu 20.04.4 LTS
TASK [etcd : Configure | Wait for etcd cluster to be healthy] ***** fatal: [co-node-1-127.mtr.labs.mlnx]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.020973", "end": "2022-04-13 11:42:48.848592", "msg": "non-zero return code", "rc": 1, "start": "2022-04-13 11:42:43.827619", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-04-13T11:42:48.846Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001508c0/10.213.2.127:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.213.2.127:2379: connect: connection refused\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-04-13T11:42:48.846Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001508c0/10.213.2.127:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.213.2.127:2379: connect: connection refused\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}
default playbook from kubespray repo.
Is this on a cloud or on-prem deployment? What are the specifications of your nodes? If in a cloud deployment did you open the necessary ports in your security policies/security groups?
On prem
בתאריך יום ד׳, 13 באפר׳ 2022, 15:26, מאת Cristian Calin < @.***>:
Is this on a cloud or on-prem deployment? What are the specifications of your nodes? If in a cloud deployment did you open the necessary ports in your security policies/security groups?
— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kubespray/issues/8701#issuecomment-1097990147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBFZ4ORBG7RQMM4KWMENSDVE24P5ANCNFSM5TAVFWFA . You are receiving this because you authored the thread.Message ID: @.***>
on prem on my side also ,contaierd should be running on the hosts?
on prem on my side also ,contaierd should be running on the hosts?
It depends on how you configured etcd, the default is to run etcd as a systemd service so it does not depend on containerd.
Please check the logs of etcd.service
and see what errors the etcd service is reporting.
Alternatively please try to do a deployment with etcd_version: v3.5.2
to rule out any initialisation bug with 3.5.1 (default)
etcd/containrd should be installed on the machine ? (Not ansible machine that i am using for deployment)
https://github.com/kubernetes-sigs/kubespray/issues/8374#issuecomment-1007377820 seems helpful for this issue. Could you try
sudo rm -rf /var/lib/etcd2/*
sudo rm -f /etc/systemd/system/etcd*
on all etcd nodes before running the ansible-playbook?
/cc @oomichi
oomichi it didn't work
@vladi14 Thanks for trying that.
Could you provide the following information based on the issue template?
Especially Kubespray version
, Network plugin used
and configurations related to etcd are important to reproduce this issue.
I put some items based on your previous info.
**Environment**:
- **Cloud provider or hardware configuration:**
on-premises
- **OS (`printf "$(uname -srm)\n$(cat /etc/os-release)\n"`):**
Ubuntu 20.04.4 LTS
- **Version of Ansible** (`ansible --version`):
ansible 5.5.0
- **Version of Python** (`python --version`):
python 3.8.10
**Kubespray version (commit) (`git rev-parse --short HEAD`):**
**Network plugin used**:
**Full inventory with variables (`ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"`):**
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->
**Command used to invoke ansible**:
**Output of ansible run**:
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->
**Anything else do we need to know**:
<!-- By running scripts/collect-info.yaml you can get a lot of useful informations.
Script can be started by:
ansible-playbook -i <inventory_file_path> -u <ssh_user> -e ansible_ssh_user=<ssh_user> -b --become-user=root -e dir=`pwd` scripts/collect-info.yaml
(If you using CoreOS remember to add '-e ansible_python_interpreter=/opt/bin/python').
After running this command you can find logs in `pwd`/logs.tar.gz. You can even upload somewhere entire file and paste link here.-->
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
At this moment, I have this problem too.
TASK [etcd : Configure | Ensure etcd is running] *********************************************************************************************************************************************************************************************************************************
ok: [node1]
ok: [node3]
ok: [node2]
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (1 retries left).
TASK [etcd : Configure | Wait for etcd cluster to be healthy] ********************************************************************************************************************************************************************************************************************
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.020942", "end": "2022-09-01 17:29:49.746232", "msg": "non-zero return code", "rc": 1, "start": "2022-09-01 17:29:44.725290", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-09-01T17:29:49.744+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0003fc8c0/10.0.0.1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: authentication handshake failed: remote error: tls: bad certificate\\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-09-01T17:29:49.744+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0003fc8c0/10.0.0.1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: authentication handshake failed: remote error: tls: bad certificate\\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}
I use ubuntu 22.04 and kubespray from master 36bec19. Other params by default.
I deleted:
rm -rf /var/lib/etcd2/*
rm -f /etc/systemd/system/etcd*
rm -f /var/backups
rm -rf /etc/ssl/etcd
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Has anyone, ever, come up with a solution to this, or is this solved in a subsequent version of kubespray, ansible or K8s, or any specific combination of the 3?
Thank you!
Hi,
I tried to install k8s according kubespray.
But it's failed in step - Configure | Wait for etcd cluster to be healthy.
full log:
TASK [etcd : Configure | Ensure etcd is running] ** ok: [node2] ok: [node1] ok: [node3] Sunday 10 April 2022 10:50:23 +0000 (0:00:00.465) 0:03:06.692 ** Sunday 10 April 2022 10:50:23 +0000 (0:00:00.057) 0:03:06.750 ** FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (4 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (3 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (2 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (1 retries left).
TASK [etcd : Configure | Wait for etcd cluster to be healthy] ***** fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.025825", "end": "2022-04-10 10:51:14.329654", "msg": "non-zero return code", "rc": 1, "start": "2022-04-10 10:51:09.303829", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-04-10T10:51:14.327Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000452c40/10.173.64.39:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-04-10T10:51:14.327Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000452c40/10.173.64.39:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT ****
PLAY RECAP **** localhost : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=467 changed=18 unreachable=0 failed=1 skipped=579 rescued=0 ignored=0 node2 : ok=445 changed=18 unreachable=0 failed=0 skipped=327 rescued=0 ignored=0 node3 : ok=384 changed=16 unreachable=0 failed=0 skipped=296 rescued=0 ignored=0
How I can to fix the problem ?
Thanks, Shai