Closed MaxCCC closed 4 years ago
It's fixed in this PR #2577
I am facing same issue TASK [etcd : Configure | Check if etcd cluster is healthy] *** Thursday 17 May 2018 13:53:52 +0000 (0:00:00.508) 0:05:09.972 ** FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). fatal: [node2]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027081", "end": "2018-05-17 13:54:35.677364", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:29.650283", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []} FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left). fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.021069", "end": "2018-05-17 13:54:51.668894", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:45.647825", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []} fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.035036", "end": "2018-05-17 13:54:52.431413", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:46.396377", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *** to retry, use: --limit @/etc/ansible/roles/kubespray/cluster.retry
PLAY RECAP ***
localhost : ok=2 changed=0 unreachable=0 failed=0
node1 : ok=177 changed=11 unreachable=0 failed=1
node2 : ok=173 changed=11 unreachable=0 failed=1
node3 : ok=173 changed=11 unreachable=0 failed=1
node4 : ok=152 changed=9 unreachable=0 failed=0
node5 : ok=149 changed=9 unreachable=0 failed=0
node6 : ok=149 changed=9 unreachable=0 failed=0
node7 : ok=149 changed=9 unreachable=0 failed=0
node8 : ok=149 changed=9 unreachable=0 failed=0
node9 : ok=149 changed=9 unreachable=0 failed=0
Thursday 17 May 2018 13:54:52 +0000 (0:01:00.404) 0:06:10.376 ** =============================================================================== etcd : Configure | Check if etcd cluster is healthy -------------------------------------------------------------------------------------------------- 60.40s gather facts from all instances ---------------------------------------------------------------------------------------------------------------------- 18.74s kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------ 17.78s kubernetes/preinstall : install growpart ------------------------------------------------------------------------------------------------------------- 14.13s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 8.41s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 7.29s etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------------------- 6.78s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.79s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.70s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.63s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.61s download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.60s download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.46s docker : Ensure old versions of Docker are not installed. | RedHat ------------------------------------------------------------------------------------ 5.24s etcd : Configure | Check if etcd-events cluster is healthy -------------------------------------------------------------------------------------------- 4.71s kubernetes/preinstall : Hosts | populate inventory into hosts file ------------------------------------------------------------------------------------ 4.21s download : container_download | Download containers if pull is required or told to always pull (all nodes) -------------------------------------------- 3.82s docker : ensure docker packages are installed --------------------------------------------------------------------------------------------------------- 3.64s kubernetes/preinstall : Update package management cache (YUM) - Redhat -------------------------------------------------------------------------------- 3.29s kubernetes/preinstall : Create kubernetes directories ------------------------------------------------------------------------------------------------- 2.82s [integrationteam@IntegrationTeam-Ansible-Vm1 kubespray]$
RHEL 7.2 ansible node ansible==2.4.2.0 RHEL 7.5 Master and agent node
Same issue with Ubuntu 16.04.4 Ansible 2.5.1
Same issue using Vagrant with default configuration on current master
Vagrant 2.1.1 ansible 2.5.3
Had exactly same issue here, using Centos7 3.10.0-693.21.1.el7.x86_64 ansible 2.5.4
Well your cluster is actually up and running its just that the health check is failing, here is my workaround.
You can manually check your cluster by providing cert and key values
SSH to one of the nodes and:
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
You should get successful check for all cluster members
Just remove/comment from all tasks:
Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy
Same health check is later failing in playbook for Calico | wait for etcd
You can also check that by doing
curl --cert /etc/ssl/etcd/ssl/member-node1.pem --key /etc/ssl/etcd/ssl/member-node1-key.pem https://127.0.0.1:2379/health
So just remove/comment all playbook task
Calico | wait for etcd
Hope this gets fixed soon, wasted a lot of time to figure this out
They checks:
Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy
Should not fail if the cluster is healthy and the certificates are present to check. Removing the checks are not a solution at all.
after investigating this, the unique way to replicate the issue in my case was using incorrect no_proxy
env settings and http_proxy
var in /etc/environment
I have just removed http_proxy
in /etc/environment
and fixed no_proxy
environment.
example:
no_proxy: "localhost,127.0.0.1,.local.domain,10.3.0.1,10.3.0.2,10.3.0.4,10.3.0.5,10.3.0.6" # no_proxy for subnets is ignored
You must have all yours host IPs in the no_proxy when using proxy.
This was my case, don't know if it is yours.
There is also strange empty when here that I have removed in my tests:
https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/tasks/main.yml#L9
- include_tasks: "gen_certs_{{ cert_management }}.yml"
when:
tags:
- etcd-secrets
just to update, the above error is probably due to firewalld issue on dev env just stop and disable firewalld service on production open all relevant ports (2379, 2380 etc...)
running on Centos 7 Linux 4.17.3-1.el7.elrepo.x86_64 x86_64
I am having the same issue. Disabled firewall and it does not help. Running on CentOS 7
Actually with firewalld disabled it seems that it is starting to work. Does anyone know full list of ports I need to open?
Run on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd —reload
Run no all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
btw, SELinux is working fine, i did not had to do any adjustments or disable it
Same issue here With Centos 7
fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.5.70:2379,https://192.168.5.71:2379,https://192.168.5.72:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.239063", "end": "2018-07-16 15:27:23.188711", "msg": "non-zero return code", "rc": 1, "start": "2018-07-16 15:27:20.949648", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid\n; error #1: x509: certificate has expired or is not yet valid\n; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout\n\nerror #0: x509: certificate has expired or is not yet valid\nerror #1: x509: certificate has
expired or is not yet valid\nerror #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid", "; error #1: x509: certificate has expired or is not yet valid", "; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "", "error #0: x509: certificate has expired or is not yet valid", "error #1: x509: certificate has expired or is not yet valid", "error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
my configuration
> `[all]
> node1 ansible_host=192.168.5.70 ip=192.168.5.70
> node2 ansible_host=192.168.5.71 ip=192.168.5.71
> node3 ansible_host=192.168.5.72 ip=192.168.5.72
>
> [kube-master]
> node1
>
> [kube-node]
> node2
> node3
>
> [etcd]
> node1
> node2
> node3
>
> [k8s-cluster:children]
> kube-node
> kube-master
>
> [calico-rr]
>
> [vault]
> node1
> node2
> node3
Think I might have the same issue and can't figure out why. I do get both the connection error as well as complaint on ca-cert's being selfsigned.
There is a task:
- name: Gen_certs | update ca-certificates (Debian/Ubuntu/Container Linux by CoreOS)
command: update-ca-certificates
when: etcd_ca_cert.changed
Not sure it's working as expected. It is successful but there is no update-ca-certificate script on my installation (CoreOS7).
So I'm also stuck on waiting for etcd health status to check out ok.
Will try the workaround disabling the check task for now. Noticed the update-ca-certificate is part of the overlay filesystem of the etcd container. Should that task really be run on the node?
This issue was stopping deployment in 2.6.0, but with 2.7.0 my Ubuntu 18.04 cluster gets deployed. However, there is still an etcd
health check failing (it is ignored). As per @ArieLevs , I can confirm, running the etcdctl
check with the certs on the command line works. I think the root cause of this error is NOT a firewall issue (although that has the same symptoms), it is a self-signed cert error. If you run the etcdctl
in debug mode without the certs, it complains: error #0: remote error: tls: bad certificate
The offending check is in file kubespray/roles/etcd/tasks/configure.yml
as follows:
name: Configure | Check if etcd cluster is healthy
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_is_healthy
until: etcd_cluster_is_healthy.rc == 0
retries: 4
delay: "{{ retry_stagger | random + 3 }}"
ignore_errors: false
changed_when: false
check_mode: no
when: is_etcd_master and etcd_cluster_setup
tags:
- facts
environment:
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"
I believe the environment variables here are not respected by etcdctl
. A better way to do this (and that works) is in the Calico
configuration where the certs are passed in via the command line as follows:
- name: Calico | wait for etcd
uri:
url: "{{ etcd_access_addresses.split(',') | first }}/health"
validate_certs: no
client_cert: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}.pem"
client_key: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}-key.pem"
register: result
until: result.status == 200 or result.status == 401
retries: 10
delay: 5
run_once: true
Alternatively, if you want to feed the cli arguments to the shell task:
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} --cert-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_is_healthy
ignore_errors: true
changed_when: false
check_mode: no
when: is_etcd_master and etcd_cluster_setup
tags:
- facts
environment:
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"```
I'm having the same issue with default vars using vagrant.
I've tried to verify etcd cluster health with admin/member certificates still gets request exceeded error. Is there any progress with this?
same issue and behavior with Ubuntu 16.04. Ansible version 2.6.6
$ etcdctl --debug cluster-health
Cluster-Endpoints: http://127.0.0.1:4001, http://127.0.0.1:2379
cURL Command: curl -X GET http://127.0.0.1:4001/v2/members
cURL Command: curl -X GET http://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
; error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout
error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout
I have disabled the ufw but no luck. As mentioned above when I try to edit my inventory.cfg or hosts.ini file to use only one etcd, it will not work also.
My understanding from this weird behavior and from debugging it are the following:
--net host
option so the container will run on the host network.--no-sync option
. Example: etcdctl --no-sync --endpoint http://ip:2379 set /hello world
.
kill -9 "$(ps aux | grep etcd | grep -v grep | sed 's/^[^ ][^ ]*[ ][ ]*\([0-9][0-9]*\).*$/\1/g')"
etcd2 --name infra1 --initial-advertise-peer-urls http://10.0.0.101:2380 \
--listen-peer-urls http://IP:2380 \
--listen-client-urls http://IP:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://IP:2379 \
--discovery https://discovery.etcd.io/<token>
etcdctl --debug cluster-health
- I tried also to enable the firewall again and accept the traffic from the etcd ports
`iptables -I INPUT -p tcp -m tcp --dport 2379 -j ACCEPT && iptables -I INPUT -p tcp -m tcp --dport 2380 -j ACCEPT`
Finally, I had to do this; delete the old etcd docker image and `gcr.io/google_containers/cluster-proportional-autoscaler-amd64` to prevent k8s from getting back the old image of etcd and then I had to run the docker image manually including the ssl and certificates path and changing the behavior of the etcd docker container so it runs without the `--net host` option and get an IP from the docker0 interface, then expose the needed ports.
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \ --name etcd quay.io/coreos/etcd:v2.3.8 \ -name etcd0 \ -advertise-client-urls http://IP:2379,http://IP:4001 \ -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \ -initial-advertise-peer-urls http://IP:2380 \ -listen-peer-urls http://0.0.0.0:2380 \ -initial-cluster-token etcd-cluster-1 \ -initial-cluster etcd0=http://IP1:2380,etcd1=http://IP2:2380,etcd2=http://IP3:2380 \ -initial-cluster-state new
I am still looking for any workaround/solution for this.
I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).
The above was failing every time. When I switched the setup to 1 bastion and 1 master and 2 nodes (neither master nor node having the floating IP associated), then after a little fiddling with the inventory/sample/no-floating.yml and moving it into correct inventory/$CLUSTER/ directory and running both terraform and ansible from the root of the kubespray git repo ... magic happened and the cluster was up and running without any further issue.
To conclude I would find nice to have an automated test for OpenStack deployment with a working setup. Even the howto guide should be slightly updated to reflect actual steps to be done.
To-Do: fix the deployment to work with DNS, w/o bastion and name-based certs (FreeIPA cert-monger would be nice?)
Eventually, I can create a pull request?
@PexMor hitting the same issue, OpenStack + Ubuntu, please take a look at #2606, those changes were approved but not merged
I also have the etcd
health task failing. But what is wired if I run the task (after playbook is done) manually it works perfectly (setting the env's and calling cluster health).
As @ArieLevs said etcd
seems healthy.
In my case I ran this successfully only after checking out release-2.8 branch instead of using master. Used defaults and modified only hosts.ini
A note that configuration was exactly the same when tried to use release-2.8 and master
Errors that disappeared:
fatal: [k8s-1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.009745", "end": "2019-02-02 16:15:22.366223", "msg": "non-zero return code", "rc": 1, "start": "2019-02-02 16:15:22.356478", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused\n\nerror #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "", "error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
this error above was ignored and build continued
AND at the end:
fatal: [k8s-1]: FAILED! => {"msg": "The conditional check 'kube_token_auth' failed. The error was: error while evaluating conditional (kube_token_auth): 'kube_token_auth' is undefined\n\nThe error appears to have been in '/Users/music/Documents/git/kubespray/roles/kubernetes/tokens/tasks/check-tokens.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: \"Check_tokens | check if the tokens have already been generated on first master\"\n ^ here\n"}
Ansible 2.6.0 Vagrant 2.0.4 VirtualBox 5.2.26 Ubuntu 18.04
Ran Kubespray from Mac OS El Capitan
I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).
This happened for me, but opening the security group to allow traffic to port 2379
from "everywhere" (laziness) on the master made it possible for itself to connect via the floating IP and the playbook could complete.
Seems to me that the solution is to either not use the floating IP or make sure that the security group allows access to it.
I ran into the same problem when trying to run kubespray against 3 bare metal Centos 7.6 servers.
It turns out that I had not set up the bare metal servers properly, because the system time was not correct on the three different machines. So what was happening was that kubespray generated certificates which had a start time which was greater than the system time on 2 out of 3 of my machines.
I solved this by installing chronyd and starting it on each machine, to set the correct time on each machine. I could have also installed ntpd.
it the same issue, etcd runs only on master/controller node, other nodes it is not running, no issue with firewall - it is even not running - RHEL 7.5 in AWS and no firewalld/iptables
fatal: [machine-01]: FAILED! => { "attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018172", "end": "2019-02-15 17:22:52.241082", "invocation": { "module_args": { "_raw_params": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2019-02-15 17:22:50.222910", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\n; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\n; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused\n\nerror #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\nerror #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\nerror #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused", "stderr_lines": [ "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout", "; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused", "; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused", "", "error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout", "error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused", "error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused" ], "stdout": "", "stdout_lines": [] }
Hey guys,
A possible workaround for this issue is flush the iptables # iptables -F
, this works for me.
setup: CentOS Linux release 7.6.1810 (Core) kubespray commit: a8dd69cf (git rev-parse --short HEAD) cni: canal
I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?
I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?
Have you tried flush your iptables?
Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places
TASK [etcd : Configure | Check if member is in etcd cluster] ***** Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 * fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}
After above error it continues the playbook but it fails at this place
TASK [etcd : Join Member | Add member to etcd cluster] *** Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 * FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left). fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []} And it just stopped the playbook after this error I am not sure how to debug this issue further :(
Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places
TASK [etcd : Configure | Check if member is in etcd cluster] ***** Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 * fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}
After above error it continues the playbook but it fails at this place
TASK [etcd : Join Member | Add member to etcd cluster] *** Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 * FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left). FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left). fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []} And it just stopped the playbook after this error I am not sure how to debug this issue further :(
Can u shared your host.ini and your all.yml?
Hi vterry, I have pasting my hosts.ini,inventory.ini and all.yml. As its not allow me to attach the files .If you can share your email I can also attach those files as well.
**I am using inventory.ini because if use the hosts.ini I am getting the error Failed to Parse -2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini:4: Expected key=value host variable assignment, got: 192.168.19.247
[WARNING]: Unable to parse /nfsdata/home/qraza/kubespraydeploymentapril9/kubespray-2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini as an inventory source**
File hosts.ini all: hosts: node1: access_ip: 192.168.19.247 ip: 192.168.19.247 ansible_host: 192.168.19.247 node2: access_ip: 192.168.19.248 ip: 192.168.19.248 ansible_host: 192.168.19.248 node3: access_ip: 192.168.19.249 ip: 192.168.19.249 ansible_host: 192.168.19.249 children: kube-master: hosts: node1: node2: kube-node: hosts: node3: node1: node2: etcd: hosts: node3: node1: node2: k8s-cluster: children: kube-master: kube-node: calico-rr: hosts: {}
File inventory.ini
[all] node1 ansible_host=192.168.19.247 ip=192.168.19.247 etcd_member_name=etcd1 node2 ansible_host=192.168.19.248 ip=192.168.19.248 etcd_member_name=etcd2 node3 ansible_host=192.168.19.249 ip=192.168.19.249 etcd_member_name=etcd3
[kube-master] node1 node2
[etcd] node1 node2 node3
[kube-node] node2 node3
[k8s-cluster:children] kube-master kube-node
File all.yml
etcd_data_dir: /var/lib/etcd
bin_dir: /usr/local/bin
nginx_kube_apiserver_port: 6443
nginx_kube_apiserver_healthcheck_port: 8081
kube_read_only_port: 10255
ansible_user: tmp1 ansible_password: password ansible_become_pass: password
Has anyone, found a fix for this issue?
I tried deploying Kubernetes in combination with WireGuard. It just didn't work. After some deeper digging, I found out, that ip
(public ip) instead of access_ip
(private, WireGuard ip) is used as listening address for etcd.
This commit in my fork fixed it for me: https://github.com/bagbag/kubespray/commit/209eb8a5118bd61a178cd08b7d802100dfd4e32e
@markpenner34 I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles
execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload
and it all went perfectly fine. what is the error you are getting? (please don't bomb with a really long log)
hi @ArieLevs I don't have any firwalld installed on the servers. Running ubuntu 16.04 Latest version of kubespray and ansible 2.7.10.
Failing on roles etcd/configure specifically health checks.
error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
Any help would be appreciated.
@markpenner34 i've noticed that etcd issue regarding port 4001 appears to occur on Ubuntu (while port 4001 is legacy and should not be used from etcd documentation)
What happens if you ssh to node1 and execute
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
Try ports 4001 and 2379 (the certificate file paths may be different on Ubuntu, as this command was executed on Centos, change to relevant paths if needed)
btw, the response of \x15\x03\x01\x00\x02\x02
means a non https request
@ArieLevs
node 1 is not an etc node, this is my cluster.yml is this correct?
When i run sudo lsof -i:2379 on the etcd nodes i can see that their are no ports listening.
However when i run that on the docker container - running etc. I can see the ports are listening to correctly
@markpenner34 The config files look different for me, i use the official from https://github.com/kubernetes-sigs/kubespray#usage
So my inventory.ini
only contain (everything else is commented out)
[k8s-cluster:children]
kube-master
kube-node
And the nodes information is declared at hosts.yml
file
I'm sorry i cannot assist too much, as i've never deployed k8s (using kubespray) on ubuntu.
same issue on:
hi we are trying to create etcd cluster but we are facing the following error please find below error
Error: client: etcd cluster is unavailable or misconfigured error #0: dial tcp 192.168.2.139:2379: getsockopt: connection refused
please help us to get out of this
thanks in advance
we are following the below link to implement the kubernetes on bare metal
My test as follows:
Ubuntu host using Vagrant with kubespray master branch.
Captioned issue resulted: fatal: [k8s-1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379,https://172.17.8.102:2379,https://172.17.8.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027821", "end": "2019-07-23 00:28:29.583827", "msg": "non-zero return code", "rc": 1, "start": "2019-07-23 00:28:23.556006", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\n; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\n; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout\n\nerror #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\nerror #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\nerror #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "", "error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
VM's have 2 network interfaces: eth0 for public network and eth1 for private network. Issue fixed if access_ip is assigned to public network ip and use access_ip instead of ip as etcd_address.
ok: [k8s-1] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.715343", "end": "2019-07-23 01:28:09.521148", "rc": 0, "start": "2019-07-23 01:28:04.805805", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ok: [k8s-2] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:01.726660", "end": "2019-07-23 01:28:09.588888", "rc": 0, "start": "2019-07-23 01:28:07.862228", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ok: [k8s-3] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.637851", "end": "2019-07-23 01:28:09.587249", "rc": 0, "start": "2019-07-23 01:28:04.949398", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Is this a proper fix?
I'm also experiencing this problem while using kubespray to try to deploy a 2-node Kubernetes cluster on OpenStack instances running Ubuntu 18.04.
How to reproduce:
create 2 OpenStack instances running Ubuntu 18.04
follow the instructions in Kubespray's Quick Start section setting ip
with the node's private IP and access_ip
with the node's floating IP, and also the node's ansible_user
.
run ansible-playbook as stated in the Quick Start guide.
Here are the contents of ./inventory/mycluster/hosts.yml
:
all:
hosts:
node1:
ansible_user: myuser
ansible_host: 185.178.87.56
ip: 192.168.0.8
access_ip: 185.178.87.56
node2:
ansible_user: myuser
ansible_host: 185.178.87.47
ip: 192.168.0.9
access_ip: 185.178.87.47
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
node2:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}
Result:
TASK [etcd : Configure | Check if etcd cluster is healthy] **************************************************************************************************************************************************************
Thursday 01 August 2019 15:47:02 +0100 (0:00:00.023) 0:02:44.977 *******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://185.178.87.56:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018130", "end": "2019-08-01 14:47:30.642472", "msg": "non-zero return code", "rc": 1, "start": "2019-08-01 14:47:28.624342", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout\n\nerror #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "", "error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *****************************************************************************************************************************
to retry, use: --limit @/home/rmam/development/CORDS/other/creodias_kubespray/kubespray/cluster.retry
PLAY RECAP *****************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=462 changed=12 unreachable=0 failed=1
node2 : ok=312 changed=9 unreachable=0 failed=0
For my test with vagrant provider=libvirt, the problem turned out to be that ip address 172.17.8.1 of the private (virtual) network is occasionally used as src ip in TLS handshake instead of the host ip 172.17.8.10x of the etcd cluster nodes.
<network ipv6='yes'>
<name>kubespray0</name>
<uuid>a502bbbb-7118-4e4a-8443-7ae1195dc93d</uuid>
<forward mode='nat'/>
<bridge name='virbr2' stp='on' delay='0'/>
<mac address='52:54:00:43:3f:ac'/>
<ip address='172.17.8.1' netmask='255.255.255.0'>
<dhcp>
<range start='172.17.8.1' end='172.17.8.254'/>
</dhcp>
</ip>
</network>
The workaround in such case is to add the relevant ip to the following setting:
etcd_cert_alt_ips: [172.17.8.1]
Hi, we are facing the same issue while deploying a 3 master 2 worker Kubernetes cluster on Azure.
Kubernetes Version: 1.15.3 Node Type : Azure VM OS : CoreOs 1967.6.0 Kubespray : release-2.11 ETCD Version: 3.3.10
Surprisingly, cluster set up worked fine few days back with this code.
Any help would be greatly appreciated.
Following is the etcd status on a Node
$ etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member.pem --key-file=/etc/ssl/etcd/ssl/member-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused
error #0: dial tcp 127.0.0.1:2379: connect: connection refused
ETCD gets messed up with wrong IP address and keeps crashing
ETCD_INITIAL_CLUSTER=master01=https://x.y.z.47:2380,master02=https://x.y.z.46:2380,master03=https://x.y.z.45:2380
...
Sep 19 02:37:43 kubemaster01 etcd[12077]: 2019-09-19 02:37:43.056677 C | etcdmain: listen tcp x.y.z.47:2380: bind: cannot assign requested address
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Failed with result 'exit-code'.
while the ip address of masters are different
[all]
kubemaster01 ansible_host=x.y.z.44 node_name=kubemaster01 etcd_member_name=master01
kubemaster02 ansible_host=x.y.z.45 node_name=kubemaster02 etcd_member_name=master02
kubemaster03 ansible_host=x.y.z.43 node_name=kubemaster03 etcd_member_name=master03
kubenode01 ansible_host=x.y.z.47 node_name=kubenode01
kubenode02 ansible_host=x.y.z.46 node_name=kubenode02
bastion101 ansible_host=bastion101
[bastion]
bastion101
[master]
kubemaster01
kubemaster02
kubemaster03
[etcd]
kubemaster01
kubemaster02
kubemaster03
[node]
kubenode01
kubenode02
[k8s-cluster:children]
master
node
[kube-master:children]
master
[kube-node:children]
node
[calico-rr]
[vault]
kubemaster01
kubemaster02
kubemaster03
We found the root case : The issue was caused due to Wrong Cache files of Ansible.
IP of VM changed when we recreated VM using Terraform scripts
Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue). So the IP information was referred from old cache.
moved this comment to https://github.com/kubernetes-sigs/kubespray/issues/5118#issuecomment-533837327 as i think it is actually that bug and not this one
Please check if docker is installed on vagrant host. If so, please uninstall and reboot. Then try again.
etcd 无法启动
经过重启etcd 观察日志发现,etcd监控了2379 端口,但是却无法访问
深入观察发现kube-proxy 占用了2379端口,因此推测有人启动了一个svc , nodeport 占了2379端口
iptables-save >a cat a
搜索2379 能看到是 什么nodeport 占用了 根据kube-proxy 可以知道,如果nodeport的endpoint 没有启动,会写一个 -j reject的规则来拒绝
因此即便kube-proxy 没有占用2379 , etcd 监听的2379 也会被拒绝
解决方法: 思路: 让etcd 启动,干掉那个nodeport sv
3和4 的目的是让kube-proxy 无法启动,不能改iptables
iptables -F 这一步刷掉所有kube-proxy的 iptables 规则
systemctl start kubelet 然后干掉那个svc 再把kube-proxy 弄回来
@markpenner34 I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles
execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp firewall-cmd --permanent --add-port=2379/tcp firewall-cmd --permanent --add-port=2380/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10251/tcp firewall-cmd --permanent --add-port=10252/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --permanent --add-port=6783/tcp firewall-cmd --permanent --add-port=443/tcp firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp firewall-cmd --permanent --add-port=5473/tcp firewall-cmd --permanent --add-port=4789/udp firewall-cmd --reload
and it all went perfectly fine. what is the error you are getting? (please don't bomb with a really long log)
Thank you, bro =) May be Ansible should be run this (add the firewall rules)? I've spend many time to find the solution. I think kubespray do everything that I need to install K8S.
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG
Environment:
Cloud provider or hardware configuration: VMware Fusion
OS (
printf "$(uname -srm)\n$(cat /etc/os-release)\n"
):CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
[all] node1 ansible_host=192.168.140.191 ip=192.168.140.191 node2 ansible_host=192.168.140.192 ip=192.168.140.192 node3 ansible_host=192.168.140.193 ip=192.168.140.193
[kube-master] node1 node2
[kube-node] node1 node2 node3
[etcd] node1 node2 node3
[k8s-cluster:children] kube-node kube-master
[calico-rr]
[vault] node1 node2 node3