TASK [etcd : Configure | Wait for etcd cluster to be healthy] - failed

shikam commented 2 years ago

Hi,

I tried to install k8s according kubespray.

But it's failed in step - Configure | Wait for etcd cluster to be healthy.

full log:

TASK [etcd : Configure | Ensure etcd is running] ** ok: [node2] ok: [node1] ok: [node3] Sunday 10 April 2022 10:50:23 +0000 (0:00:00.465) 0:03:06.692 ** Sunday 10 April 2022 10:50:23 +0000 (0:00:00.057) 0:03:06.750 ** FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (4 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (3 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (2 retries left). FAILED - RETRYING: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ***** fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.025825", "end": "2022-04-10 10:51:14.329654", "msg": "non-zero return code", "rc": 1, "start": "2022-04-10 10:51:09.303829", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-04-10T10:51:14.327Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000452c40/10.173.64.39:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-04-10T10:51:14.327Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc000452c40/10.173.64.39:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ****

PLAY RECAP **** localhost : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=467 changed=18 unreachable=0 failed=1 skipped=579 rescued=0 ignored=0 node2 : ok=445 changed=18 unreachable=0 failed=0 skipped=327 rescued=0 ignored=0 node3 : ok=384 changed=16 unreachable=0 failed=0 skipped=296 rescued=0 ignored=0

How I can to fix the problem ?

Thanks, Shai

cristicalin commented 2 years ago

You could start by sharing some information according to the bug template so we know what version you are running, what OS, ansible and python version and you ansible inventory and ansible inventory variables.

vladi14 commented 2 years ago

also reproduced , same error. python 3.8.10 ansible 5.5.0 Ubuntu 20.04.4 LTS

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ***** fatal: [co-node-1-127.mtr.labs.mlnx]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.020973", "end": "2022-04-13 11:42:48.848592", "msg": "non-zero return code", "rc": 1, "start": "2022-04-13 11:42:43.827619", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-04-13T11:42:48.846Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001508c0/10.213.2.127:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.213.2.127:2379: connect: connection refused\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-04-13T11:42:48.846Z\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001508c0/10.213.2.127:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\"transport: Error while dialing dial tcp 10.213.2.127:2379: connect: connection refused\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

default playbook from kubespray repo.

cristicalin commented 2 years ago

Is this on a cloud or on-prem deployment? What are the specifications of your nodes? If in a cloud deployment did you open the necessary ports in your security policies/security groups?

shikam commented 2 years ago

On prem

בתאריך יום ד׳, 13 באפר׳ 2022, 15:26, מאת Cristian Calin ‏< @.***>:

Is this on a cloud or on-prem deployment? What are the specifications of your nodes? If in a cloud deployment did you open the necessary ports in your security policies/security groups?

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kubespray/issues/8701#issuecomment-1097990147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBFZ4ORBG7RQMM4KWMENSDVE24P5ANCNFSM5TAVFWFA . You are receiving this because you authored the thread.Message ID: @.***>

vladi14 commented 2 years ago

on prem on my side also ,contaierd should be running on the hosts?

cristicalin commented 2 years ago

on prem on my side also ,contaierd should be running on the hosts?

It depends on how you configured etcd, the default is to run etcd as a systemd service so it does not depend on containerd.

Please check the logs of etcd.service and see what errors the etcd service is reporting.

Alternatively please try to do a deployment with etcd_version: v3.5.2 to rule out any initialisation bug with 3.5.1 (default)

vladi14 commented 2 years ago

etcd/containrd should be installed on the machine ? (Not ansible machine that i am using for deployment)

oomichi commented 2 years ago

https://github.com/kubernetes-sigs/kubespray/issues/8374#issuecomment-1007377820 seems helpful for this issue. Could you try

sudo rm -rf /var/lib/etcd2/* 
sudo rm -f /etc/systemd/system/etcd*

on all etcd nodes before running the ansible-playbook?

/cc @oomichi

vladi14 commented 2 years ago

oomichi it didn't work

oomichi commented 2 years ago

@vladi14 Thanks for trying that. Could you provide the following information based on the issue template? Especially Kubespray version, Network plugin used and configurations related to etcd are important to reproduce this issue. I put some items based on your previous info.

**Environment**:
- **Cloud provider or hardware configuration:**

on-premises

- **OS (`printf "$(uname -srm)\n$(cat /etc/os-release)\n"`):**

Ubuntu 20.04.4 LTS

- **Version of Ansible** (`ansible --version`):

ansible 5.5.0

- **Version of Python** (`python --version`):

python 3.8.10

**Kubespray version (commit) (`git rev-parse --short HEAD`):**

**Network plugin used**:

**Full inventory with variables (`ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"`):**
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->

**Command used to invoke ansible**:

**Output of ansible run**:
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->

**Anything else do we need to know**:
<!-- By running scripts/collect-info.yaml you can get a lot of useful informations.
Script can be started by:
ansible-playbook -i <inventory_file_path> -u <ssh_user> -e ansible_ssh_user=<ssh_user> -b --become-user=root -e dir=`pwd` scripts/collect-info.yaml
(If you using CoreOS remember to add '-e ansible_python_interpreter=/opt/bin/python').
After running this command you can find logs in `pwd`/logs.tar.gz. You can even upload somewhere entire file and paste link here.-->

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

batazor commented 2 years ago

At this moment, I have this problem too.

TASK [etcd : Configure | Ensure etcd is running] *********************************************************************************************************************************************************************************************************************************
ok: [node1]
ok: [node3]
ok: [node2]
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: [node1]: Configure | Wait for etcd cluster to be healthy (1 retries left).

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ********************************************************************************************************************************************************************************************************************
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.020942", "end": "2022-09-01 17:29:49.746232", "msg": "non-zero return code", "rc": 1, "start": "2022-09-01 17:29:44.725290", "stderr": "{\"level\":\"warn\",\"ts\":\"2022-09-01T17:29:49.744+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0003fc8c0/10.0.0.1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: authentication handshake failed: remote error: tls: bad certificate\\\"\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2022-09-01T17:29:49.744+0200\",\"logger\":\"etcd-client\",\"caller\":\"v3/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0003fc8c0/10.0.0.1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: authentication handshake failed: remote error: tls: bad certificate\\\"\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

I use ubuntu 22.04 and kubespray from master 36bec19. Other params by default.

I deleted:

rm -rf /var/lib/etcd2/* 
rm -f /etc/systemd/system/etcd* 
rm -f /var/backups
rm -rf /etc/ssl/etcd

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/8701#issuecomment-1264402552): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Chrisys93 commented 7 months ago

Has anyone, ever, come up with a solution to this, or is this solved in a subsequent version of kubespray, ansible or K8s, or any specific combination of the 3?

Thank you!

kubernetes-sigs / kubespray

TASK [etcd : Configure | Wait for etcd cluster to be healthy] - failed #8701