Closed hbokh closed 8 years ago
I just encountered the same issue when removing the etcd/master nodes from the kube-node group. When I keep all nodes in the group, everything works. Well, except that we don't want masters/etcds in the kube-nodes group
Thanks for report @jjungnickel @hbokh.
Just tested a new deployment with debian 8 without facing the issue. Currently running another one with ubuntu xenial
It works on ubuntu xenial too, let's try on CoreOS
I'm curious to see what the role distribution is for @hbokh's environment. My suspicions for how this could happen:
If this issue still can be reproduced in master, let's try to improve the check here to see if the pod started okay (and if not, it dumps some useful output)
Thank you, @mattymo - I have just setup a new CoreOS env on 3 VMware VM's, based on the latest Stable 1068.8.0 (based on this repo: https://github.com/gclayburg/coreos-vmware-deploy).
Kubespray's Kargo-deployment used to work fine up until app. one month ago.
The issue remains (same error at task Check for dnsmasq port ...
) with a fresh pull from this git-repo today.
What specific information can be helpful in this specific case?
To get started, here's my inventory.cfg
:
[kube-master]
node1
[etcd]
node1
node2
node3
[kube-node]
node2
node3
[k8s-cluster:children]
kube-node
kube-master
There are pods too (checked with kubectl
on OS X, after installing the new certificates from node1):
$ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
dnsmasq-i7g3u 1/1 Running 0 11m
dnsmasq-jk63c 1/1 Running 0 11m
flannel-node2 2/2 Running 2 11m
flannel-node3 2/2 Running 2 11m
kube-proxy-node2 1/1 Running 1 11m
kube-proxy-node3 1/1 Running 2 11m
Weird thing is, Docker-image andyshinn/dnsmasq
is there on both node2 and node3.
Is it supposed to be downloaded on node1 too, I wonder?
Anyway, I'll leave this (uncompleted) setup running to provide for needed information.
This is an architectural issue.
https://github.com/kubespray/kargo/blob/master/roles/dnsmasq/tasks/main.yml#L58
kube-proxy only runs on kube-node role nodes. If you change this line above to the following:
when: inventory_hostname == groups['kube-node'][0]
it should work.
Do we want to run kube-proxy on kube-master[0]? I don't see the benefit, but we do configure DNS to point to 10.233.0.2 on all kube-masters, so we're effectively breaking their DNS configuration if they aren't also kube-node roles.
Thanks heaps! I have applied the above change and re-run the playbook: no more error / failure indeed. I miss the architectural POV in this... @jjungnickel seems to have the same consideration as I have / had.
Would you be so kind to suggest a better setup / inventory.cfg
in this case with 3 hosts? If adding a 4th host / VM on my setup is needed, I consider that an option too.
I prefer the following 3-node setup:
node1
node2
[etcd]
node1
node2
node3
[kube-node]
node1
node2
node3
[k8s-cluster:children]
kube-node
kube-master```
It won't run into this bug.
While I don't really mind running masters and etcd on the same host and also as a node, this directly contradicts the deployment options that are being outlined in the examples shown in the README
@mattymo Just to be 100% sure, you mean adding a 2nd node to [kube-master]
? In the above case [kube-master]
seems to have fallen off the first line, so:
[kube-master]
node1
node2
[etcd]
node1
...
Yes, sorry. Copy and paste error on my side.
@jjungnickel Thanks for pointing that out, however I do miss your point here, since this is in "Basic usage" regarding 3 VM's:
3 vms, all 3 have etcd installed, all 3 are nodes (running pods), 2 of them run master components
and that looks exactly what @mattymo is suggesting with his inventory.cfg
.
@hbokh Oh, you're right - I should have made myself more clear. I was referring to the other deployment options outlined in the document, specifically the one isolating etcd/master/nodes. These are being contradicted, the first one is not.
Hi, I have the same issue even with the suggested inventory configuration (3 nodes among which 2 master), I tried with freshly installed Ubuntu 16.04 and centos 7, same error.
I have the same issue on centos 7, inventory:
kargo's version is v2.1.0
ansible-playbook -i inventory/inventory.ini cluster.yml -b -v --private-key=~/.ssh/id_rsa
[kube-master]
node1
node2
[etcd]
node1
node2
node3
[kube-node]
node1
node2
node3
[k8s-cluster:children]
kube-node
kube-master
TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] **
Thursday 23 March 2017 18:23:10 +0800 (0:00:01.552) 0:06:00.043 ********
fatal: [node1]: FAILED! => {"changed": false, "elapsed": 301, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}
The ip address 10.233.0.2
is a k8s service cluster ip , it had be expose
to host?
@antonyfr @hellwen I also ran into this situation. I was trying to setup kubernetes on nodes that had been previously used for another purpose. Upon closer inspection, I noticed the firewall rules may not be as permissive as kargo would like. I suggest you log into each of your nodes and clear the iptables rules:
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X
After doing this, running ansible-playbook cluster.yml -i inventory/inventory.cfg
completed the cluster setup for me. Hope this helps!
I've gotten this issue every time I've tried to use kargo. I've normally struggled through to get it working, using the solution listed by @hbokh but am hitting the problem again and can't seem to get past it.
I've got only two nodes- one master, and one node. It seems that, no matter the configuration of the inventory, this step fails like this:
TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] ************************************************************************************************************************************************************************************************
task path: .../kargo/roles/dnsmasq/tasks/main.yml:83
Saturday 22 April 2017 20:40:04 -0400 (0:00:02.709) 0:05:35.279 ********
fatal: [master]: FAILED! => {"changed": false, "elapsed": 180, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}
Strangely, this failure does not stop the task until about 30 seconds later, it ends at
TASK [kubernetes/preinstall : run xfs_growfs] ********************************************************************************************************************************************************************************************************************************
task path: .../kargo/roles/kubernetes/preinstall/tasks/growpart-azure-centos-7.yml:25
Saturday 22 April 2017 20:43:19 -0400 (0:00:00.042) 0:08:51.066 ********
META: ran handlers
META: ran handlers
then hits
PLAY [kube-master[0]] ********************************************************************************************************************************************************************************************************************************************************
to retry, use: --limit .../kargo/cluster.retry
PLAY RECAP *******************************************************************************************************************************************************************************************************************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0
master : ok=352 changed=75 unreachable=0 failed=1
node1 : ok=353 changed=72 unreachable=0 failed=0
My inventory file looks like this:
master ansible_ssh_host=1.2.3.4
node1 ansible_ssh_host=5.6.7.8
[kube-master]
master
[etcd]
master
node1
[kube-node]
master
node1
[k8s-cluster:children]
kube-node
kube-master
[k8s-cluster:vars]
ansible_python_interpreter="/opt/bin/python"
The systems both have the latest CoreOS, but I've experienced this same problem on 5 nodes running CentOS and Debian every time I tried to use Kargo, all vanilla installs (including latest CoreOS).
Relevant: all.yml
bootstrap_os: coreos
bin_dir: /opt/bin
...
k8s-cluster.yml
kube_version: v1.5.6
kube_network_plugin: weave
dns_mode: dnsmasq_kubedns
resolvconf_mode: host_resolvconf
using 502f2f040db01f3e2237010359b0d26b058fd4cf
from master (cloned)
Any help would be much appreciated!
@twister915 are you deploying on Azure? Calico does not work there. Try Flannel or Weave network plugin
@mattymo I'm deploying to a bare metal server I purchased from OVH, using Weave (as noted in the k8s-cluster.yml included), and have had the same issues on Vultr doing a similar configuration.
Note, this is the log content of my dnsmasq autoscaler
Error while getting cluster status: Get https://10.233.0.1:443/api/v1/nodes: dial tcp 10.233.0.1:443: getsockopt: no route to host
Like 100 times over, and this is all the log contains.
That's the service IP for kubernetes in default namespace
master ~ # kubectl -n default describe svc/kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
Endpoints: [master ip]:6443
Session Affinity: ClientIP
Events: <none>
Also, possibly less relevant, but this is the routing table from the master
master ~ # ip route
default via [master gateway] dev eth0 proto static
10.233.64.0/18 dev weave proto kernel scope link src 10.233.64.1
[master network]/28 dev eth0 proto kernel scope link src [master ip]
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
Some testing
master ~ # curl 10.233.0.1:443
[no output, but no error]
master ~ # curl [master ip]:6443
[no output, but no error]
obviously that works as expected, but not sure why the pods are having such trouble talking to the ip, here's the same debugs done inside a pod
[root@test-centos-370406107-06j5w /]# ip route
default via 10.233.96.0 dev eth0
10.233.64.0/18 dev eth0 proto kernel scope link src 10.233.96.3
[root@test-centos-370406107-06j5w /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
33: eth0@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP
link/ether 56:7d:a6:7a:bc:e5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.233.96.3/18 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::547d:a6ff:fe7a:bce5/64 scope link
valid_lft forever preferred_lft forever
[root@test-centos-370406107-06j5w /]# curl 10.233.0.1:443
curl: (7) Failed connect to 10.233.0.1:443; No route to host
[root@test-centos-370406107-06j5w /]# env
HOSTNAME=test-centos-370406107-06j5w
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_PORT_443_TCP_PROTO=tcp
no_proxy=*.local, 169.254/16
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
and these are some of my network settings from k8s-cluster.yml
kube_network_plugin: weave
enable_network_policy: false
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_network_node_prefix: 24
Hopefully that might shed some more light on what's causing this
As far as I can see you are lacking a third node in your inventory (just as I did). Have a look at the first image, below "Basic usage": https://github.com/kubespray/kargo-cli "3 vms, all 3 have etcd installed, all 3 are nodes (running pods), 2 of them run master components"
@hbokh Appreciate the info. I managed to get kargo to deploy to two nodes last night, though. I've been trying to put together two different clusters for two different projects, and I've only been having problems with this particular cluster. I know it says I need three nodes there, but I've gotten my other cluster up and running using the exact same kargo folder:
pc:kargo user$ kubectl get nodes
NAME STATUS AGE
master Ready 19h
node1 Ready 19h
I think the issue I'm having has something to do with the node I'm deploying to (master is identical to the other cluster's master). The install of CoreOS was performed by the host (OVH), not by me with my cloud-config. I don't know specifically what they might have done wrong, though, and I'd like to pin down the cause instead of just trying to replicate the state of the other cluster's node as closely as possible.
First let me state that I'm really impressed by these scripts. Started using them app. 2 weeks ago. However today on both CoreOS Alpha and Ubuntu 16.04 the playbook fails. Using
commit 52a85d5757799669f4d10502d048e0e78c1b98db
, this is the error from the playbook:On host
etcd0
this is the command failing AFAIK:Any clues?