MaxCCC commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG

Environment:

Cloud provider or hardware configuration: VMware Fusion

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):


Linux 3.10.0-862.2.3.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"


- **Version of Ansible** (`ansible --version`):
`ansible 2.5.2`

**Kubespray version (commit) (`git rev-parse --short HEAD`):**
REL v2.5.0
commit: `02cd5418`
**Network plugin used**:
default: calico

**Copy of your inventory file:**

[all] node1 ansible_host=192.168.140.191 ip=192.168.140.191 node2 ansible_host=192.168.140.192 ip=192.168.140.192 node3 ansible_host=192.168.140.193 ip=192.168.140.193

[kube-master] node1 node2

[kube-node] node1 node2 node3

[etcd] node1 node2 node3

[k8s-cluster:children] kube-node kube-master

[calico-rr]

[vault] node1 node2 node3



**Command used to invoke ansible**:
`ansible-playbook --flush-cache -u myuser -b -i inventory/mycluster/hosts.ini cluster.yml`

**Output of ansible run**:
<!-- We recommend using snippets services like https://gist.github.com/ etc. -->
Errors in trend of:
`fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []}`

and:

`fatal: [node2]: FAILED! => {"attempts": 10, "changed": false, "content": "", "msg": "Status code was -1 and not [200]: Request failed: <urlopen error ('_ssl.c:563: The handshake operation timed out',)>", "redirected": false, "status": -1, "url": "https://192.168.140.192:2379/health"}`

**Anything else do we need to know**:
<!-- By running scripts/collect-info.yaml you can get a lot of useful informations.
Script can be started by:
ansible-playbook -i <inventory_file_path> -u <ssh_user> -e ansible_ssh_user=<ssh_user> -b --become-user=root -e dir=`pwd` scripts/collect-info.yaml
(If you using CoreOS remember to add '-e ansible_python_interpreter=/opt/bin/python').
After running this command you can find logs in `pwd`/logs.tar.gz. You can even upload somewhere entire file and paste link here.-->
`Firewall disabled, ssh access + root priv work for ansible, sudo swapoff -a`

woopstar commented 6 years ago

It's fixed in this PR #2577

manunmathew commented 6 years ago

I am facing same issue TASK [etcd : Configure | Check if etcd cluster is healthy] *** Thursday 17 May 2018 13:53:52 +0000 (0:00:00.508) 0:05:09.972 ** FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). fatal: [node2]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027081", "end": "2018-05-17 13:54:35.677364", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:29.650283", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []} FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.021069", "end": "2018-05-17 13:54:51.668894", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:45.647825", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []} fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.035036", "end": "2018-05-17 13:54:52.431413", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:46.396377", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *** to retry, use: --limit @/etc/ansible/roles/kubespray/cluster.retry

PLAY RECAP *** localhost : ok=2 changed=0 unreachable=0 failed=0
node1 : ok=177 changed=11 unreachable=0 failed=1
node2 : ok=173 changed=11 unreachable=0 failed=1
node3 : ok=173 changed=11 unreachable=0 failed=1
node4 : ok=152 changed=9 unreachable=0 failed=0
node5 : ok=149 changed=9 unreachable=0 failed=0
node6 : ok=149 changed=9 unreachable=0 failed=0
node7 : ok=149 changed=9 unreachable=0 failed=0
node8 : ok=149 changed=9 unreachable=0 failed=0
node9 : ok=149 changed=9 unreachable=0 failed=0

Thursday 17 May 2018 13:54:52 +0000 (0:01:00.404) 0:06:10.376 ** =============================================================================== etcd : Configure | Check if etcd cluster is healthy -------------------------------------------------------------------------------------------------- 60.40s gather facts from all instances ---------------------------------------------------------------------------------------------------------------------- 18.74s kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------ 17.78s kubernetes/preinstall : install growpart ------------------------------------------------------------------------------------------------------------- 14.13s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 8.41s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 7.29s etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------------------- 6.78s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.79s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.70s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.63s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.61s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.60s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.46s docker : Ensure old versions of Docker are not installed. | RedHat ------------------------------------------------------------------------------------ 5.24s etcd : Configure | Check if etcd-events cluster is healthy -------------------------------------------------------------------------------------------- 4.71s kubernetes/preinstall : Hosts | populate inventory into hosts file ------------------------------------------------------------------------------------ 4.21s download : container_download | Download containers if pull is required or told to always pull (all nodes) -------------------------------------------- 3.82s docker : ensure docker packages are installed --------------------------------------------------------------------------------------------------------- 3.64s kubernetes/preinstall : Update package management cache (YUM) - Redhat -------------------------------------------------------------------------------- 3.29s kubernetes/preinstall : Create kubernetes directories ------------------------------------------------------------------------------------------------- 2.82s [integrationteam@IntegrationTeam-Ansible-Vm1 kubespray]$

manunmathew commented 6 years ago

RHEL 7.2 ansible node ansible==2.4.2.0 RHEL 7.5 Master and agent node

dlifanov commented 6 years ago

Same issue with Ubuntu 16.04.4 Ansible 2.5.1

DukeHarris commented 6 years ago

Same issue using Vagrant with default configuration on current master

Vagrant 2.1.1 ansible 2.5.3

ArieLevs commented 6 years ago

Had exactly same issue here, using Centos7 3.10.0-693.21.1.el7.x86_64 ansible 2.5.4

Well your cluster is actually up and running its just that the health check is failing, here is my workaround.

You can manually check your cluster by providing cert and key values

SSH to one of the nodes and:

etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health

You should get successful check for all cluster members

Just remove/comment from all tasks:

Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy

Same health check is later failing in playbook for Calico | wait for etcd

You can also check that by doing

curl --cert /etc/ssl/etcd/ssl/member-node1.pem --key /etc/ssl/etcd/ssl/member-node1-key.pem https://127.0.0.1:2379/health

So just remove/comment all playbook task

Calico | wait for etcd

Hope this gets fixed soon, wasted a lot of time to figure this out

woopstar commented 6 years ago

They checks:

Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy

Should not fail if the cluster is healthy and the certificates are present to check. Removing the checks are not a solution at all.

pablodav commented 6 years ago

after investigating this, the unique way to replicate the issue in my case was using incorrect no_proxy env settings and http_proxy var in /etc/environment

I have just removed http_proxy in /etc/environment and fixed no_proxy environment.

example:

no_proxy: "localhost,127.0.0.1,.local.domain,10.3.0.1,10.3.0.2,10.3.0.4,10.3.0.5,10.3.0.6" # no_proxy for subnets is ignored

You must have all yours host IPs in the no_proxy when using proxy.

This was my case, don't know if it is yours.

There is also strange empty when here that I have removed in my tests:

https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/tasks/main.yml#L9

- include_tasks: "gen_certs_{{ cert_management }}.yml"
  when:
  tags:
    - etcd-secrets

ArieLevs commented 6 years ago

just to update, the above error is probably due to firewalld issue on dev env just stop and disable firewalld service on production open all relevant ports (2379, 2380 etc...)

running on Centos 7 Linux 4.17.3-1.el7.elrepo.x86_64 x86_64

mikimtm commented 6 years ago

I am having the same issue. Disabled firewall and it does not help. Running on CentOS 7

mikimtm commented 6 years ago

Actually with firewalld disabled it seems that it is starting to work. Does anyone know full list of ports I need to open?

ArieLevs commented 6 years ago

Run on master nodes:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd —reload

Run no all nodes:

firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp

btw, SELinux is working fine, i did not had to do any adjustments or disable it

fauzan-n commented 6 years ago

Same issue here With Centos 7

fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.5.70:2379,https://192.168.5.71:2379,https://192.168.5.72:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.239063", "end": "2018-07-16 15:27:23.188711", "msg": "non-zero return code", "rc": 1, "start": "2018-07-16 15:27:20.949648", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid\n; error #1: x509: certificate has expired or is not yet valid\n; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout\n\nerror #0: x509: certificate has expired or is not yet valid\nerror #1: x509: certificate has
expired or is not yet valid\nerror #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid", "; error #1: x509: certificate has expired or is not yet valid", "; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "", "error #0: x509: certificate has expired or is not yet valid", "error #1: x509: certificate has expired or is not yet valid", "error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

my configuration

> `[all]
> node1    ansible_host=192.168.5.70 ip=192.168.5.70
> node2    ansible_host=192.168.5.71 ip=192.168.5.71
> node3    ansible_host=192.168.5.72 ip=192.168.5.72
> 
> [kube-master]
> node1
> 
> [kube-node]
> node2
> node3
> 
> [etcd]

> node1
> node2
> node3
> 
> [k8s-cluster:children]
> kube-node
> kube-master
> 
> [calico-rr]
> 
> [vault]
> node1
> node2
> node3

frippe75 commented 6 years ago

Think I might have the same issue and can't figure out why. I do get both the connection error as well as complaint on ca-cert's being selfsigned.

There is a task:

- name: Gen_certs | update ca-certificates (Debian/Ubuntu/Container Linux by CoreOS)
  command: update-ca-certificates
  when: etcd_ca_cert.changed

Not sure it's working as expected. It is successful but there is no update-ca-certificate script on my installation (CoreOS7).

So I'm also stuck on waiting for etcd health status to check out ok.

Will try the workaround disabling the check task for now. Noticed the update-ca-certificate is part of the overlay filesystem of the etcd container. Should that task really be run on the node?

hdave commented 6 years ago

This issue was stopping deployment in 2.6.0, but with 2.7.0 my Ubuntu 18.04 cluster gets deployed. However, there is still an etcd health check failing (it is ignored). As per @ArieLevs , I can confirm, running the etcdctl check with the certs on the command line works. I think the root cause of this error is NOT a firewall issue (although that has the same symptoms), it is a self-signed cert error. If you run the etcdctl in debug mode without the certs, it complains: error #0: remote error: tls: bad certificate

The offending check is in file kubespray/roles/etcd/tasks/configure.yml as follows:

name: Configure | Check if etcd cluster is healthy
  shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
  register: etcd_cluster_is_healthy
  until: etcd_cluster_is_healthy.rc == 0
  retries: 4
  delay: "{{ retry_stagger | random + 3 }}"
  ignore_errors: false
  changed_when: false
  check_mode: no
  when: is_etcd_master and etcd_cluster_setup
  tags:
    - facts
  environment:
    ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
    ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"

I believe the environment variables here are not respected by etcdctl. A better way to do this (and that works) is in the Calico configuration where the certs are passed in via the command line as follows:

- name: Calico | wait for etcd
  uri:
    url: "{{ etcd_access_addresses.split(',') | first }}/health"
    validate_certs: no
    client_cert: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}.pem"
    client_key: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}-key.pem"
  register: result
  until: result.status == 200 or result.status == 401
  retries: 10
  delay: 5
  run_once: true

truncj commented 6 years ago

Alternatively, if you want to feed the cli arguments to the shell task:


  shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} --cert-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem cluster-health | grep -q 'cluster is healthy'"
  register: etcd_cluster_is_healthy
  ignore_errors: true
  changed_when: false
  check_mode: no
  when: is_etcd_master and etcd_cluster_setup
  tags:
    - facts
  environment:
    ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
    ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"```

ChiKenNeg commented 6 years ago

I'm having the same issue with default vars using vagrant.

I've tried to verify etcd cluster health with admin/member certificates still gets request exceeded error. Is there any progress with this?

ezzoueidi commented 6 years ago

same issue and behavior with Ubuntu 16.04. Ansible version 2.6.6

$ etcdctl --debug cluster-health
Cluster-Endpoints: http://127.0.0.1:4001, http://127.0.0.1:2379
cURL Command: curl -X GET http://127.0.0.1:4001/v2/members
cURL Command: curl -X GET http://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
; error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout

error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout

I have disabled the ufw but no luck. As mentioned above when I try to edit my inventory.cfg or hosts.ini file to use only one etcd, it will not work also.

My understanding from this weird behavior and from debugging it are the following:

This can happen when the etcd node addresses ('endpoints') are not published or are incorrect (when you try to curl the endpoints you will have the same issue).
Basically, this could happen due to the etcd docker container, the etcd2 is running in the container and it is not exposing the ports to the host os (etcd ports: 4001, 2380 and 2379) which is normal while the container was starting with the --net host option so the container will run on the host network.
When you try to stop etcd as a service or rm the etcd running container, after some seconds you can see that the etcd cluster is healthy and available but then it will back to the same issue.
Try to refrain uging the --no-sync option. Example: etcdctl --no-sync --endpoint http://ip:2379 set /hello world.

Kill the process of etcd and try to run it manually usind etcd2 (the token is to get from https://discovery.etcd.io/new?size=1)


kill -9 "$(ps aux | grep etcd | grep -v grep | sed 's/^[^ ][^ ]*[ ][ ]*\([0-9][0-9]*\).*$/\1/g')"
etcd2 --name infra1 --initial-advertise-peer-urls http://10.0.0.101:2380 \
--listen-peer-urls http://IP:2380 \
--listen-client-urls http://IP:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://IP:2379 \
--discovery https://discovery.etcd.io/<token>

etcdctl --debug cluster-health

- I tried also to enable the firewall again and accept the traffic from the etcd ports
`iptables -I INPUT  -p tcp -m tcp --dport 2379 -j ACCEPT && iptables -I INPUT  -p tcp -m tcp --dport 2380 -j ACCEPT`

Finally, I had to do this; delete the old etcd docker image  and `gcr.io/google_containers/cluster-proportional-autoscaler-amd64` to prevent k8s from getting back the old image of etcd and then I had to run the docker image manually including the ssl and certificates path and changing the behavior of the etcd docker container so it runs without the `--net host` option and get an IP from the docker0 interface, then expose the needed ports.

docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \ --name etcd quay.io/coreos/etcd:v2.3.8 \ -name etcd0 \ -advertise-client-urls http://IP:2379,http://IP:4001 \ -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \ -initial-advertise-peer-urls http://IP:2380 \ -listen-peer-urls http://0.0.0.0:2380 \ -initial-cluster-token etcd-cluster-1 \ -initial-cluster etcd0=http://IP1:2380,etcd1=http://IP2:2380,etcd2=http://IP3:2380 \ -initial-cluster-state new



I am still looking for any workaround/solution for this.

PexMor commented 5 years ago

I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).

The above was failing every time. When I switched the setup to 1 bastion and 1 master and 2 nodes (neither master nor node having the floating IP associated), then after a little fiddling with the inventory/sample/no-floating.yml and moving it into correct inventory/$CLUSTER/ directory and running both terraform and ansible from the root of the kubespray git repo ... magic happened and the cluster was up and running without any further issue.

To conclude I would find nice to have an automated test for OpenStack deployment with a working setup. Even the howto guide should be slightly updated to reflect actual steps to be done.

To-Do: fix the deployment to work with DNS, w/o bastion and name-based certs (FreeIPA cert-monger would be nice?)

Eventually, I can create a pull request?

mvasilenko commented 5 years ago

@PexMor hitting the same issue, OpenStack + Ubuntu, please take a look at #2606, those changes were approved but not merged

fentas commented 5 years ago

I also have the etcd health task failing. But what is wired if I run the task (after playbook is done) manually it works perfectly (setting the env's and calling cluster health).

As @ArieLevs said etcd seems healthy.

laimison commented 5 years ago

In my case I ran this successfully only after checking out release-2.8 branch instead of using master. Used defaults and modified only hosts.ini

A note that configuration was exactly the same when tried to use release-2.8 and master

Errors that disappeared:

fatal: [k8s-1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.009745", "end": "2019-02-02 16:15:22.366223", "msg": "non-zero return code", "rc": 1, "start": "2019-02-02 16:15:22.356478", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused\n\nerror #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "", "error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

this error above was ignored and build continued

AND at the end:

fatal: [k8s-1]: FAILED! => {"msg": "The conditional check 'kube_token_auth' failed. The error was: error while evaluating conditional (kube_token_auth): 'kube_token_auth' is undefined\n\nThe error appears to have been in '/Users/music/Documents/git/kubespray/roles/kubernetes/tokens/tasks/check-tokens.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: \"Check_tokens | check if the tokens have already been generated on first master\"\n  ^ here\n"}

Ansible 2.6.0 Vagrant 2.0.4 VirtualBox 5.2.26 Ubuntu 18.04

Ran Kubespray from Mac OS El Capitan

llarsson commented 5 years ago

I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).

This happened for me, but opening the security group to allow traffic to port 2379 from "everywhere" (laziness) on the master made it possible for itself to connect via the floating IP and the playbook could complete.

Seems to me that the solution is to either not use the floating IP or make sure that the security group allows access to it.

rodrigc commented 5 years ago

I ran into the same problem when trying to run kubespray against 3 bare metal Centos 7.6 servers.

It turns out that I had not set up the bare metal servers properly, because the system time was not correct on the three different machines. So what was happening was that kubespray generated certificates which had a start time which was greater than the system time on 2 out of 3 of my machines.

I solved this by installing chronyd and starting it on each machine, to set the correct time on each machine. I could have also installed ntpd.

riponbanik commented 5 years ago

it the same issue, etcd runs only on master/controller node, other nodes it is not running, no issue with firewall - it is even not running - RHEL 7.5 in AWS and no firewalld/iptables

fatal: [machine-01]: FAILED! => { "attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018172", "end": "2019-02-15 17:22:52.241082", "invocation": { "module_args": { "_raw_params": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2019-02-15 17:22:50.222910", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\n; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\n; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused\n\nerror #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\nerror #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\nerror #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused", "stderr_lines": [ "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout", "; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused", "; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused", "", "error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout", "error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused", "error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused" ], "stdout": "", "stdout_lines": [] }

vterry commented 5 years ago

Hey guys,

A possible workaround for this issue is flush the iptables # iptables -F, this works for me.

setup: CentOS Linux release 7.6.1810 (Core) kubespray commit: a8dd69cf (git rev-parse --short HEAD) cni: canal

qasim9641 commented 5 years ago

I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?

vterry commented 5 years ago

I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?

Have you tried flush your iptables?

qasim9641 commented 5 years ago

Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places

TASK [etcd : Configure | Check if member is in etcd cluster] ***** Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 * fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}

After above error it continues the playbook but it fails at this place

TASK [etcd : Join Member | Add member to etcd cluster] *** Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 * FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left). fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []} And it just stopped the playbook after this error I am not sure how to debug this issue further :(

vterry commented 5 years ago

Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places

TASK [etcd : Configure | Check if member is in etcd cluster] ***** Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 * fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}

After above error it continues the playbook but it fails at this place

TASK [etcd : Join Member | Add member to etcd cluster] *** Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 * FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left). fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []} And it just stopped the playbook after this error I am not sure how to debug this issue further :(

Can u shared your host.ini and your all.yml?

qasim9641 commented 5 years ago

Hi vterry, I have pasting my hosts.ini,inventory.ini and all.yml. As its not allow me to attach the files .If you can share your email I can also attach those files as well.

**I am using inventory.ini because if use the hosts.ini I am getting the error Failed to Parse -2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini:4: Expected key=value host variable assignment, got: 192.168.19.247

[WARNING]: Unable to parse /nfsdata/home/qraza/kubespraydeploymentapril9/kubespray-2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini as an inventory source**

File hosts.ini all: hosts: node1: access_ip: 192.168.19.247 ip: 192.168.19.247 ansible_host: 192.168.19.247 node2: access_ip: 192.168.19.248 ip: 192.168.19.248 ansible_host: 192.168.19.248 node3: access_ip: 192.168.19.249 ip: 192.168.19.249 ansible_host: 192.168.19.249 children: kube-master: hosts: node1: node2: kube-node: hosts: node3: node1: node2: etcd: hosts: node3: node1: node2: k8s-cluster: children: kube-master: kube-node: calico-rr: hosts: {}

File inventory.ini

Configure 'ip' variable to bind kubernetes services on a

different ip than the default iface

We should set etcd_member_name for etcd cluster. The node that is not a etcd member do not need to set the value, or can set the empty string value.

[all] node1 ansible_host=192.168.19.247 ip=192.168.19.247 etcd_member_name=etcd1 node2 ansible_host=192.168.19.248 ip=192.168.19.248 etcd_member_name=etcd2 node3 ansible_host=192.168.19.249 ip=192.168.19.249 etcd_member_name=etcd3

node4 ansible_host=95.54.0.15 # ip=10.3.0.4 etcd_member_name=etcd4

node5 ansible_host=95.54.0.16 # ip=10.3.0.5 etcd_member_name=etcd5

node6 ansible_host=95.54.0.17 # ip=10.3.0.6 etcd_member_name=etcd6

configure a bastion host if your nodes are not directly reachable

bastion ansible_host=x.x.x.x ansible_user=some_user

[kube-master] node1 node2

[etcd] node1 node2 node3

[kube-node] node2 node3

[k8s-cluster:children] kube-master kube-node

File all.yml

Directory where etcd data stored

etcd_data_dir: /var/lib/etcd

Directory where the binaries will be installed

bin_dir: /usr/local/bin

The access_ip variable is used to define how other nodes should access

the node. This is used in flannel to allow other flannel nodes to see

this node for example. The access_ip is really useful AWS and Google

environments where the nodes are accessed remotely by the "public" ip,

but don't know about that address themselves.

access_ip: 1.1.1.1

External LB example config

apiserver_loadbalancer_domain_name: "elb.some.domain"

loadbalancer_apiserver:

address: 1.2.3.4

port: 1234

Internal loadbalancers for apiservers

loadbalancer_apiserver_localhost: true

Local loadbalancer should use this port

And must be set port 6443

nginx_kube_apiserver_port: 6443

If nginx_kube_apiserver_healthcheck_port variable defined, enables proxy liveness check.

nginx_kube_apiserver_healthcheck_port: 8081

OTHER OPTIONAL VARIABLES

For some things, kubelet needs to load kernel modules. For example, dynamic kernel services are needed

for mounting persistent volumes into containers. These may not be loaded by preinstall kubernetes

processes. For example, ceph and rbd backed volumes. Set to true to allow kubelet to load kernel

modules.

kubelet_load_modules: false

Upstream dns servers

upstream_dns_servers:

- 8.8.8.8

- 8.8.4.4

There are some changes specific to the cloud providers

for instance we need to encapsulate packets with some network plugins

If set the possible values are either 'gce', 'aws', 'azure', 'openstack', 'vsphere', 'oci', or 'external'

When openstack is used make sure to source in the openstack credentials

like you would do when using openstack-client before starting the playbook.

Note: The 'external' cloud provider is not supported.

TODO(riverzhang): https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager

cloud_provider:

Set these proxy values in order to update package manager and docker daemon to use proxies

http_proxy: ""

https_proxy: ""

Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy

no_proxy: ""

Some problems may occur when downloading files over https proxy due to ansible bug

https://github.com/ansible/ansible/issues/32750. Set this variable to False to disable

SSL validation of get_url module. Note that kubespray will still be performing checksum validation.

download_validate_certs: False

If you need exclude all cluster nodes from proxy and other resources, add other resources here.

additional_no_proxy: ""

Certificate Management

This setting determines whether certs are generated via scripts.

Chose 'none' if you provide your own certificates.

Option is "script", "none"

note: vault is removed

cert_management: script

Set to true to allow pre-checks to fail and continue deployment

ignore_assert_errors: false

The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.

kube_read_only_port: 10255

Set true to download and cache container

download_container: true

Deploy container engine

Set false if you want to deploy container engine manually.

deploy_container_engine: true

Set Pypi repo and cert accordingly

pyrepo_index: https://pypi.example.com/simple

pyrepo_cert: /etc/ssl/certs/ca-certificates.crt

ansible_user: tmp1 ansible_password: password ansible_become_pass: password

markpenner34 commented 5 years ago

Has anyone, found a fix for this issue?

bagbag commented 5 years ago

I tried deploying Kubernetes in combination with WireGuard. It just didn't work. After some deeper digging, I found out, that ip (public ip) instead of access_ip (private, WireGuard ip) is used as listening address for etcd.

This commit in my fork fixed it for me: https://github.com/bagbag/kubespray/commit/209eb8a5118bd61a178cd08b7d802100dfd4e32e

ArieLevs commented 5 years ago

@markpenner34 I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles

execute on master nodes:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload

execute on all nodes:

firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload

If installing Calico open these ports on all nodes:

firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload

and it all went perfectly fine. what is the error you are getting? (please don't bomb with a really long log)

markpenner34 commented 5 years ago

hi @ArieLevs I don't have any firwalld installed on the servers. Running ubuntu 16.04 Latest version of kubespray and ansible 2.7.10.

Failing on roles etcd/configure specifically health checks.

error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused

error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

Any help would be appreciated.

ArieLevs commented 5 years ago

@markpenner34 i've noticed that etcd issue regarding port 4001 appears to occur on Ubuntu (while port 4001 is legacy and should not be used from etcd documentation)

What happens if you ssh to node1 and execute

etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health

Try ports 4001 and 2379 (the certificate file paths may be different on Ubuntu, as this command was executed on Centos, change to relevant paths if needed)

btw, the response of \x15\x03\x01\x00\x02\x02 means a non https request

markpenner34 commented 5 years ago

@ArieLevs

node 1 is not an etc node, this is my cluster.yml is this correct?

https://pastebin.com/403G71pC

When i run sudo lsof -i:2379 on the etcd nodes i can see that their are no ports listening.

However when i run that on the docker container - running etc. I can see the ports are listening to correctly

ArieLevs commented 5 years ago

@markpenner34 The config files look different for me, i use the official from https://github.com/kubernetes-sigs/kubespray#usage

So my inventory.ini only contain (everything else is commented out)

[k8s-cluster:children]
kube-master
kube-node

And the nodes information is declared at hosts.yml file I'm sorry i cannot assist too much, as i've never deployed k8s (using kubespray) on ubuntu.

csayler commented 5 years ago

same issue on:

Ubuntu 18.04
Kubespray 1.13.5
ansible 2.5.1 we are also using kube-proxy - which is using ipvs to setup the kubelet network connections between nodes...

aneeshwara commented 5 years ago

hi we are trying to create etcd cluster but we are facing the following error please find below error

Error: client: etcd cluster is unavailable or misconfigured error #0: dial tcp 192.168.2.139:2379: getsockopt: connection refused

please help us to get out of this

thanks in advance

we are following the below link to implement the kubernetes on bare metal

https://medium.com/faun/configuring-ha-kubernetes-cluster-on-bare-metal-servers-with-kubeadm-1-2-1e79f0f7857b

ewtang commented 5 years ago

My test as follows:

Ubuntu host using Vagrant with kubespray master branch.
Captioned issue resulted: fatal: [k8s-1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379,https://172.17.8.102:2379,https://172.17.8.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027821", "end": "2019-07-23 00:28:29.583827", "msg": "non-zero return code", "rc": 1, "start": "2019-07-23 00:28:23.556006", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\n; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\n; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout\n\nerror #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\nerror #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\nerror #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "", "error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
VM's have 2 network interfaces: eth0 for public network and eth1 for private network. Issue fixed if access_ip is assigned to public network ip and use access_ip instead of ip as etcd_address.

ok: [k8s-1] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.715343", "end": "2019-07-23 01:28:09.521148", "rc": 0, "start": "2019-07-23 01:28:04.805805", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ok: [k8s-2] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:01.726660", "end": "2019-07-23 01:28:09.588888", "rc": 0, "start": "2019-07-23 01:28:07.862228", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ok: [k8s-3] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.637851", "end": "2019-07-23 01:28:09.587249", "rc": 0, "start": "2019-07-23 01:28:04.949398", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Is this a proper fix?

ruimaciel commented 5 years ago

I'm also experiencing this problem while using kubespray to try to deploy a 2-node Kubernetes cluster on OpenStack instances running Ubuntu 18.04.

How to reproduce:

create 2 OpenStack instances running Ubuntu 18.04
follow the instructions in Kubespray's Quick Start section setting ip with the node's private IP and access_ip with the node's floating IP, and also the node's ansible_user.
run ansible-playbook as stated in the Quick Start guide.

Here are the contents of ./inventory/mycluster/hosts.yml:

all:
  hosts:
    node1:
      ansible_user: myuser
      ansible_host: 185.178.87.56
      ip: 192.168.0.8
      access_ip: 185.178.87.56
    node2:
      ansible_user: myuser
      ansible_host: 185.178.87.47
      ip: 192.168.0.9
      access_ip: 185.178.87.47
  children:
    kube-master:
      hosts:
        node1:
    kube-node:
      hosts:
        node1:
        node2:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

Result:

TASK [etcd : Configure | Check if etcd cluster is healthy] **************************************************************************************************************************************************************
Thursday 01 August 2019  15:47:02 +0100 (0:00:00.023)       0:02:44.977 ******* 
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://185.178.87.56:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018130", "end": "2019-08-01 14:47:30.642472", "msg": "non-zero return code", "rc": 1, "start": "2019-08-01 14:47:28.624342", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout\n\nerror #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "", "error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************
    to retry, use: --limit @/home/rmam/development/CORDS/other/creodias_kubespray/kubespray/cluster.retry

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
node1                      : ok=462  changed=12   unreachable=0    failed=1   
node2                      : ok=312  changed=9    unreachable=0    failed=0

ewtang commented 5 years ago

For my test with vagrant provider=libvirt, the problem turned out to be that ip address 172.17.8.1 of the private (virtual) network is occasionally used as src ip in TLS handshake instead of the host ip 172.17.8.10x of the etcd cluster nodes.

<network ipv6='yes'>
  <name>kubespray0</name>
  <uuid>a502bbbb-7118-4e4a-8443-7ae1195dc93d</uuid>
  <forward mode='nat'/>
  <bridge name='virbr2' stp='on' delay='0'/>
  <mac address='52:54:00:43:3f:ac'/>
  <ip address='172.17.8.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='172.17.8.1' end='172.17.8.254'/>
    </dhcp>
  </ip>
</network>

The workaround in such case is to add the relevant ip to the following setting:

etcd_cert_alt_ips: [172.17.8.1]

rjdsd commented 5 years ago

Hi, we are facing the same issue while deploying a 3 master 2 worker Kubernetes cluster on Azure.

Kubernetes Version: 1.15.3 Node Type : Azure VM OS : CoreOs 1967.6.0 Kubespray : release-2.11 ETCD Version: 3.3.10

Surprisingly, cluster set up worked fine few days back with this code.

Any help would be greatly appreciated.

Following is the etcd status on a Node


$ etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member.pem --key-file=/etc/ssl/etcd/ssl/member-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused

error #0: dial tcp 127.0.0.1:2379: connect: connection refused

rjdsd commented 5 years ago

ETCD gets messed up with wrong IP address and keeps crashing

ETCD_INITIAL_CLUSTER=master01=https://x.y.z.47:2380,master02=https://x.y.z.46:2380,master03=https://x.y.z.45:2380
...

Sep 19 02:37:43 kubemaster01 etcd[12077]: 2019-09-19 02:37:43.056677 C | etcdmain: listen tcp x.y.z.47:2380: bind: cannot assign requested address
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Failed with result 'exit-code'.

while the ip address of masters are different

[all]
kubemaster01    ansible_host=x.y.z.44  node_name=kubemaster01 etcd_member_name=master01
kubemaster02    ansible_host=x.y.z.45  node_name=kubemaster02 etcd_member_name=master02
kubemaster03    ansible_host=x.y.z.43  node_name=kubemaster03 etcd_member_name=master03
kubenode01    ansible_host=x.y.z.47      node_name=kubenode01
kubenode02    ansible_host=x.y.z.46      node_name=kubenode02
bastion101 ansible_host=bastion101

[bastion]
bastion101

[master]
kubemaster01
kubemaster02
kubemaster03

[etcd]
kubemaster01
kubemaster02
kubemaster03

[node]
kubenode01
kubenode02

[k8s-cluster:children]
master
node

[kube-master:children]
master

[kube-node:children]
node

[calico-rr]

[vault]
kubemaster01
kubemaster02
kubemaster03

rjdsd commented 5 years ago

We found the root case : The issue was caused due to Wrong Cache files of Ansible.

IP of VM changed when we recreated VM using Terraform scripts

Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue). So the IP information was referred from old cache.

timhughes commented 5 years ago

moved this comment to https://github.com/kubernetes-sigs/kubespray/issues/5118#issuecomment-533837327 as i think it is actually that bug and not this one

ewtang commented 5 years ago

Please check if docker is installed on vagrant host. If so, please uninstall and reboot. Then try again.

sunxingyu commented 5 years ago

etcd 无法启动

经过重启etcd 观察日志发现，etcd监控了2379 端口，但是却无法访问

深入观察发现kube-proxy 占用了2379端口，因此推测有人启动了一个svc ， nodeport 占了2379端口

iptables-save >a cat a

搜索2379 能看到是什么nodeport 占用了根据kube-proxy 可以知道，如果nodeport的endpoint 没有启动，会写一个 -j reject的规则来拒绝

因此即便kube-proxy 没有占用2379 ， etcd 监听的2379 也会被拒绝

解决方法：思路：让etcd 启动，干掉那个nodeport sv

停止 kubelet
删除所有容器
将kube-proxy 的镜像改名
将仓库地址暂时换个不好用的

3和4 的目的是让kube-proxy 无法启动，不能改iptables

iptables -F 这一步刷掉所有kube-proxy的 iptables 规则
systemctl start kubelet 然后干掉那个svc 再把kube-proxy 弄回来

GRomR1 commented 4 years ago

@markpenner34 I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles

execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload
and it all went perfectly fine. what is the error you are getting? (please don't bomb with a really long log)

Thank you, bro =) May be Ansible should be run this (add the firewall rules)? I've spend many time to find the solution. I think kubespray do everything that I need to install K8S.

kubernetes-sigs / kubespray

etcd cluster is unavailable or misconfigured: connection refused #2767

Configure 'ip' variable to bind kubernetes services on a

different ip than the default iface

We should set etcd_member_name for etcd cluster. The node that is not a etcd member do not need to set the value, or can set the empty string value.

node4 ansible_host=95.54.0.15 # ip=10.3.0.4 etcd_member_name=etcd4

node5 ansible_host=95.54.0.16 # ip=10.3.0.5 etcd_member_name=etcd5

node6 ansible_host=95.54.0.17 # ip=10.3.0.6 etcd_member_name=etcd6

configure a bastion host if your nodes are not directly reachable

bastion ansible_host=x.x.x.x ansible_user=some_user

Directory where etcd data stored

Directory where the binaries will be installed

The access_ip variable is used to define how other nodes should access

the node. This is used in flannel to allow other flannel nodes to see

this node for example. The access_ip is really useful AWS and Google

environments where the nodes are accessed remotely by the "public" ip,

but don't know about that address themselves.

access_ip: 1.1.1.1

External LB example config

apiserver_loadbalancer_domain_name: "elb.some.domain"

loadbalancer_apiserver:

address: 1.2.3.4

port: 1234

Internal loadbalancers for apiservers

loadbalancer_apiserver_localhost: true

Local loadbalancer should use this port

And must be set port 6443

If nginx_kube_apiserver_healthcheck_port variable defined, enables proxy liveness check.

OTHER OPTIONAL VARIABLES

For some things, kubelet needs to load kernel modules. For example, dynamic kernel services are needed

for mounting persistent volumes into containers. These may not be loaded by preinstall kubernetes

processes. For example, ceph and rbd backed volumes. Set to true to allow kubelet to load kernel

modules.

kubelet_load_modules: false

Upstream dns servers

upstream_dns_servers:

- 8.8.8.8

- 8.8.4.4

There are some changes specific to the cloud providers

for instance we need to encapsulate packets with some network plugins

If set the possible values are either 'gce', 'aws', 'azure', 'openstack', 'vsphere', 'oci', or 'external'

When openstack is used make sure to source in the openstack credentials

like you would do when using openstack-client before starting the playbook.

Note: The 'external' cloud provider is not supported.

TODO(riverzhang): https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager

cloud_provider:

Set these proxy values in order to update package manager and docker daemon to use proxies

http_proxy: ""

https_proxy: ""

Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy

no_proxy: ""

Some problems may occur when downloading files over https proxy due to ansible bug

https://github.com/ansible/ansible/issues/32750. Set this variable to False to disable

SSL validation of get_url module. Note that kubespray will still be performing checksum validation.

download_validate_certs: False

If you need exclude all cluster nodes from proxy and other resources, add other resources here.

additional_no_proxy: ""

Certificate Management

This setting determines whether certs are generated via scripts.

Chose 'none' if you provide your own certificates.

Option is "script", "none"

note: vault is removed

cert_management: script

Set to true to allow pre-checks to fail and continue deployment

ignore_assert_errors: false

The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.

Set true to download and cache container

download_container: true

Deploy container engine

Set false if you want to deploy container engine manually.

deploy_container_engine: true

Set Pypi repo and cert accordingly

pyrepo_index: https://pypi.example.com/simple

pyrepo_cert: /etc/ssl/certs/ca-certificates.crt