Check for dnsmasq port fails

hbokh commented 8 years ago

First let me state that I'm really impressed by these scripts. Started using them app. 2 weeks ago. However today on both CoreOS Alpha and Ubuntu 16.04 the playbook fails. Using commit 52a85d5757799669f4d10502d048e0e78c1b98db, this is the error from the playbook:

TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] **
skipping: [etcd1]
skipping: [etcd2]
fatal: [etcd0]: FAILED! => {"changed": false, "elapsed": 301, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}

On host etcd0 this is the command failing AFAIK:

/usr/local/bin/kubelet --v=2 --address=0.0.0.0 --hostname_override=etcd0 --allow_privileged=true --cluster_dns=10.233.0.2 --cluster_domain=cluster.local --kubeconfig=/etc/kubernetes/node-kubeconfig.yaml --config=/etc/kubernetes/manifests --resolv-conf=/etc/resolv.conf --register-node=false

Any clues?

jjungnickel commented 8 years ago

I just encountered the same issue when removing the etcd/master nodes from the kube-node group. When I keep all nodes in the group, everything works. Well, except that we don't want masters/etcds in the kube-nodes group

ant31 commented 8 years ago

Thanks for report @jjungnickel @hbokh.

Smana commented 8 years ago

Just tested a new deployment with debian 8 without facing the issue. Currently running another one with ubuntu xenial

Smana commented 8 years ago

It works on ubuntu xenial too, let's try on CoreOS

mattymo commented 8 years ago

I'm curious to see what the role distribution is for @hbokh's environment. My suspicions for how this could happen:

kube-proxy isn't running on this host (or hasn't started yet). I tried to fix this in https://github.com/kubespray/kargo/pull/398
kubelet or kube-proxy can't reach apiserver. They only connect to master[0] IP and if apiserver isn't running there, it will definitely fail
There were issues downloading the dnsmasq container. Maybe there are docker pull errors in the system logs

If this issue still can be reproduced in master, let's try to improve the check here to see if the pod started okay (and if not, it dumps some useful output)

hbokh commented 8 years ago

Thank you, @mattymo - I have just setup a new CoreOS env on 3 VMware VM's, based on the latest Stable 1068.8.0 (based on this repo: https://github.com/gclayburg/coreos-vmware-deploy). Kubespray's Kargo-deployment used to work fine up until app. one month ago. The issue remains (same error at task Check for dnsmasq port ...) with a fresh pull from this git-repo today. What specific information can be helpful in this specific case? To get started, here's my inventory.cfg:

[kube-master]
node1

[etcd]
node1
node2
node3

[kube-node]
node2
node3

[k8s-cluster:children]
kube-node
kube-master

There are pods too (checked with kubectl on OS X, after installing the new certificates from node1):

$ kubectl get pods --namespace=kube-system
NAME               READY     STATUS    RESTARTS   AGE
dnsmasq-i7g3u      1/1       Running   0          11m
dnsmasq-jk63c      1/1       Running   0          11m
flannel-node2      2/2       Running   2          11m
flannel-node3      2/2       Running   2          11m
kube-proxy-node2   1/1       Running   1          11m
kube-proxy-node3   1/1       Running   2          11m

Weird thing is, Docker-image andyshinn/dnsmasq is there on both node2 and node3. Is it supposed to be downloaded on node1 too, I wonder?

Anyway, I'll leave this (uncompleted) setup running to provide for needed information.

mattymo commented 8 years ago

This is an architectural issue.

https://github.com/kubespray/kargo/blob/master/roles/dnsmasq/tasks/main.yml#L58

kube-proxy only runs on kube-node role nodes. If you change this line above to the following: when: inventory_hostname == groups['kube-node'][0] it should work.

Do we want to run kube-proxy on kube-master[0]? I don't see the benefit, but we do configure DNS to point to 10.233.0.2 on all kube-masters, so we're effectively breaking their DNS configuration if they aren't also kube-node roles.

hbokh commented 8 years ago

Thanks heaps! I have applied the above change and re-run the playbook: no more error / failure indeed. I miss the architectural POV in this... @jjungnickel seems to have the same consideration as I have / had.

Would you be so kind to suggest a better setup / inventory.cfg in this case with 3 hosts? If adding a 4th host / VM on my setup is needed, I consider that an option too.

mattymo commented 8 years ago

I prefer the following 3-node setup:

node1
node2

[etcd]
node1
node2
node3

[kube-node]
node1
node2
node3

[k8s-cluster:children]
kube-node
kube-master```

It won't run into this bug.

jjungnickel commented 8 years ago

While I don't really mind running masters and etcd on the same host and also as a node, this directly contradicts the deployment options that are being outlined in the examples shown in the README

hbokh commented 8 years ago

@mattymo Just to be 100% sure, you mean adding a 2nd node to [kube-master]? In the above case [kube-master] seems to have fallen off the first line, so:

[kube-master]
node1
node2

[etcd]
node1
...

mattymo commented 8 years ago

Yes, sorry. Copy and paste error on my side.

hbokh commented 8 years ago

@jjungnickel Thanks for pointing that out, however I do miss your point here, since this is in "Basic usage" regarding 3 VM's:

3 vms, all 3 have etcd installed, all 3 are nodes (running pods), 2 of them run master components

and that looks exactly what @mattymo is suggesting with his inventory.cfg.

jjungnickel commented 8 years ago

@hbokh Oh, you're right - I should have made myself more clear. I was referring to the other deployment options outlined in the document, specifically the one isolating etcd/master/nodes. These are being contradicted, the first one is not.

antonyfr commented 7 years ago

Hi, I have the same issue even with the suggested inventory configuration (3 nodes among which 2 master), I tried with freshly installed Ubuntu 16.04 and centos 7, same error.

hellwen commented 7 years ago

I have the same issue on centos 7, inventory:

kargo's version is v2.1.0

ansible-playbook -i inventory/inventory.ini cluster.yml -b -v --private-key=~/.ssh/id_rsa

[kube-master]
node1
node2

[etcd]
node1
node2
node3

[kube-node]
node1
node2
node3

[k8s-cluster:children]
kube-node
kube-master

TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] **
Thursday 23 March 2017  18:23:10 +0800 (0:00:01.552)       0:06:00.043 ******** 
fatal: [node1]: FAILED! => {"changed": false, "elapsed": 301, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}

The ip address 10.233.0.2 is a k8s service cluster ip , it had be expose to host?

jdowning commented 7 years ago

@antonyfr @hellwen I also ran into this situation. I was trying to setup kubernetes on nodes that had been previously used for another purpose. Upon closer inspection, I noticed the firewall rules may not be as permissive as kargo would like. I suggest you log into each of your nodes and clear the iptables rules:

iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X

After doing this, running ansible-playbook cluster.yml -i inventory/inventory.cfg completed the cluster setup for me. Hope this helps!

Twister915 commented 7 years ago

I've gotten this issue every time I've tried to use kargo. I've normally struggled through to get it working, using the solution listed by @hbokh but am hitting the problem again and can't seem to get past it.

I've got only two nodes- one master, and one node. It seems that, no matter the configuration of the inventory, this step fails like this:

TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] ************************************************************************************************************************************************************************************************
task path: .../kargo/roles/dnsmasq/tasks/main.yml:83
Saturday 22 April 2017  20:40:04 -0400 (0:00:02.709)       0:05:35.279 ******** 
fatal: [master]: FAILED! => {"changed": false, "elapsed": 180, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}

Strangely, this failure does not stop the task until about 30 seconds later, it ends at

TASK [kubernetes/preinstall : run xfs_growfs] ********************************************************************************************************************************************************************************************************************************
task path: .../kargo/roles/kubernetes/preinstall/tasks/growpart-azure-centos-7.yml:25
Saturday 22 April 2017  20:43:19 -0400 (0:00:00.042)       0:08:51.066 ******** 
META: ran handlers
META: ran handlers

then hits

PLAY [kube-master[0]] ********************************************************************************************************************************************************************************************************************************************************
    to retry, use: --limit .../kargo/cluster.retry

PLAY RECAP *******************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0   
master                     : ok=352  changed=75   unreachable=0    failed=1   
node1                      : ok=353  changed=72   unreachable=0    failed=0

My inventory file looks like this:

master ansible_ssh_host=1.2.3.4
node1 ansible_ssh_host=5.6.7.8

[kube-master]
master

[etcd]
master
node1

[kube-node]
master
node1

[k8s-cluster:children]
kube-node
kube-master

[k8s-cluster:vars]
ansible_python_interpreter="/opt/bin/python"

The systems both have the latest CoreOS, but I've experienced this same problem on 5 nodes running CentOS and Debian every time I tried to use Kargo, all vanilla installs (including latest CoreOS).

Relevant: all.yml

bootstrap_os: coreos
bin_dir: /opt/bin

...

k8s-cluster.yml

kube_version: v1.5.6
kube_network_plugin: weave
dns_mode: dnsmasq_kubedns
resolvconf_mode: host_resolvconf

using 502f2f040db01f3e2237010359b0d26b058fd4cf from master (cloned)

Any help would be much appreciated!

mattymo commented 7 years ago

@twister915 are you deploying on Azure? Calico does not work there. Try Flannel or Weave network plugin

Twister915 commented 7 years ago

@mattymo I'm deploying to a bare metal server I purchased from OVH, using Weave (as noted in the k8s-cluster.yml included), and have had the same issues on Vultr doing a similar configuration.

Twister915 commented 7 years ago

Note, this is the log content of my dnsmasq autoscaler

Error while getting cluster status: Get https://10.233.0.1:443/api/v1/nodes: dial tcp 10.233.0.1:443: getsockopt: no route to host

Like 100 times over, and this is all the log contains.

That's the service IP for kubernetes in default namespace

master ~ # kubectl -n default describe svc/kubernetes
Name:           kubernetes
Namespace:      default
Labels:         component=apiserver
            provider=kubernetes
Annotations:        <none>
Selector:       <none>
Type:           ClusterIP
IP:         10.233.0.1
Port:           https   443/TCP
Endpoints:      [master ip]:6443
Session Affinity:   ClientIP
Events:         <none>

Also, possibly less relevant, but this is the routing table from the master

master ~ # ip route
default via [master gateway] dev eth0  proto static 
10.233.64.0/18 dev weave  proto kernel  scope link  src 10.233.64.1 
[master network]/28 dev eth0  proto kernel  scope link  src [master ip]
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1

Some testing

master ~ # curl 10.233.0.1:443
[no output, but no error]

master ~ # curl [master ip]:6443
[no output, but no error]

obviously that works as expected, but not sure why the pods are having such trouble talking to the ip, here's the same debugs done inside a pod

[root@test-centos-370406107-06j5w /]# ip route
default via 10.233.96.0 dev eth0 
10.233.64.0/18 dev eth0  proto kernel  scope link  src 10.233.96.3 

[root@test-centos-370406107-06j5w /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
33: eth0@if34: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP 
    link/ether 56:7d:a6:7a:bc:e5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.233.96.3/18 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::547d:a6ff:fe7a:bce5/64 scope link 
       valid_lft forever preferred_lft forever

[root@test-centos-370406107-06j5w /]# curl 10.233.0.1:443
curl: (7) Failed connect to 10.233.0.1:443; No route to host

[root@test-centos-370406107-06j5w /]# env
HOSTNAME=test-centos-370406107-06j5w
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_PORT_443_TCP_PROTO=tcp
no_proxy=*.local, 169.254/16
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443

and these are some of my network settings from k8s-cluster.yml

kube_network_plugin: weave
enable_network_policy: false
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_network_node_prefix: 24

Hopefully that might shed some more light on what's causing this

hbokh commented 7 years ago

As far as I can see you are lacking a third node in your inventory (just as I did). Have a look at the first image, below "Basic usage": https://github.com/kubespray/kargo-cli "3 vms, all 3 have etcd installed, all 3 are nodes (running pods), 2 of them run master components"

Twister915 commented 7 years ago

@hbokh Appreciate the info. I managed to get kargo to deploy to two nodes last night, though. I've been trying to put together two different clusters for two different projects, and I've only been having problems with this particular cluster. I know it says I need three nodes there, but I've gotten my other cluster up and running using the exact same kargo folder:

pc:kargo user$ kubectl get nodes
NAME      STATUS    AGE
master    Ready     19h
node1     Ready     19h

I think the issue I'm having has something to do with the node I'm deploying to (master is identical to the other cluster's master). The install of CoreOS was performed by the host (OVH), not by me with my cloud-config. I don't know specifically what they might have done wrong, though, and I'd like to pin down the cause instead of just trying to replicate the state of the other cluster's node as closely as possible.

kubernetes-sigs / kubespray

Check for dnsmasq port fails #368