kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.14k stars 6.47k forks source link

timeout when wait_for # dnsmasq : Check for dnsmasq port (pulling image and running container #1164

Closed 4admin2root closed 6 years ago

4admin2root commented 7 years ago

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Environment:

Kargo version (commit) (git rev-parse --short HEAD): f6cd42e

Network plugin used: calico

Copy of your inventory file: [kube-master] kg1 kg2

[etcd] kg1 kg2 kg3

[kube-node] kg2 kg3 kg4

[k8s-cluster:children] kube-node kube-master

Command used to invoke ansible:

Output of ansible run:

TASK [dnsmasq : Start Resources] *** task path: /usr/local/lvzj/github/kargo/roles/dnsmasq/tasks/main.yml:65 Tuesday 21 March 2017 13:14:24 +0800 (0:00:01.126) 0:05:52.038 * Using module file /usr/local/lvzj/github/kargo/library/kube.py

ESTABLISH SSH CONNECTION FOR USER: None SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r kg1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' changed: [kg1] => (item={'_ansible_parsed': True, u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.04-72781134800908/source', '_ansible_item_result': True, u'group': u'root', u'uid': 0, u'dest': u'/etc/kubernetes/dnsmasq-deploy.yml', u'checksum': u'052a2046899441a059031f3bb891dc4bb6ec2382', u'md5sum': u'e26f59fca90df2c15968992852ce872f', u'owner': u'root', 'item': {u'type': u'deployment', u'name': u'dnsmasq', u'file': u'dnsmasq-deploy.yml'}, u'state': u'file', u'gid': 0, u'secontext': u'system_u:object_r:etc_t:s0', u'mode': u'0644', 'changed': True, 'invocation': {u'module_args': {u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.04-72781134800908/source', u'directory_mode': None, u'force': True, u'remote_src': None, u'unsafe_writes': None, u'selevel': None, u'seuser': None, u'serole': None, u'follow': True, u'content': None, u'dest': u'/etc/kubernetes/dnsmasq-deploy.yml', u'setype': None, u'original_basename': u'dnsmasq-deploy.yml', u'delimiter': None, u'mode': None, u'regexp': None, u'owner': None, u'group': None, u'validate': None, u'backup': False}}, u'size': 1562, '_ansible_no_log': False}) => { "changed": true, "invocation": { "module_args": { "all": false, "filename": "/etc/kubernetes/dnsmasq-deploy.yml", "force": false, "kubectl": "/usr/local/bin/kubectl", "label": null, "log_level": 0, "name": "dnsmasq", "namespace": "kube-system", "resource": "deployment", "server": null, "state": "latest" }, "module_name": "kube" }, "item": { "changed": true, "checksum": "052a2046899441a059031f3bb891dc4bb6ec2382", "dest": "/etc/kubernetes/dnsmasq-deploy.yml", "gid": 0, "group": "root", "invocation": { "module_args": { "backup": false, "content": null, "delimiter": null, "dest": "/etc/kubernetes/dnsmasq-deploy.yml", "directory_mode": null, "follow": true, "force": true, "group": null, "mode": null, "original_basename": "dnsmasq-deploy.yml", "owner": null, "regexp": null, "remote_src": null, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.04-72781134800908/source", "unsafe_writes": null, "validate": null } }, "item": { "file": "dnsmasq-deploy.yml", "name": "dnsmasq", "type": "deployment" }, "md5sum": "e26f59fca90df2c15968992852ce872f", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:etc_t:s0", "size": 1562, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.04-72781134800908/source", "state": "file", "uid": 0 }, "msg": "success: deployment \"dnsmasq\" deleted deployment \"dnsmasq\" replaced" } Using module file /usr/local/lvzj/github/kargo/library/kube.py ESTABLISH SSH CONNECTION FOR USER: None SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r kg1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' changed: [kg1] => (item={'_ansible_parsed': True, u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.39-79301315044098/source', '_ansible_item_result': True, u'group': u'root', u'uid': 0, u'dest': u'/etc/kubernetes/dnsmasq-svc.yml', u'checksum': u'a11ee168b81ff8c9ad9e519b24161db8d36d1759', u'md5sum': u'c0e0d5d3fa51ee9fdd6137728afe6f39', u'owner': u'root', 'item': {u'type': u'svc', u'name': u'dnsmasq', u'file': u'dnsmasq-svc.yml'}, u'state': u'file', u'gid': 0, u'secontext': u'system_u:object_r:etc_t:s0', u'mode': u'0644', 'changed': True, 'invocation': {u'module_args': {u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.39-79301315044098/source', u'directory_mode': None, u'force': True, u'remote_src': None, u'unsafe_writes': None, u'selevel': None, u'seuser': None, u'serole': None, u'follow': True, u'content': None, u'dest': u'/etc/kubernetes/dnsmasq-svc.yml', u'setype': None, u'original_basename': u'dnsmasq-svc.yml', u'delimiter': None, u'mode': None, u'regexp':None, u'owner': None, u'group': None, u'validate': None, u'backup': False}}, u'size': 395, '_ansible_no_log': False}) => { "changed": true, "invocation": { "module_args": { "all": false, "filename": "/etc/kubernetes/dnsmasq-svc.yml", "force": false, "kubectl": "/usr/local/bin/kubectl", "label": null, "log_level": 0, "name": "dnsmasq", "namespace": "kube-system", "resource": "svc", "server": null, "state": "latest" }, "module_name": "kube" }, "item": { "changed": true, "checksum": "a11ee168b81ff8c9ad9e519b24161db8d36d1759", "dest": "/etc/kubernetes/dnsmasq-svc.yml", "gid": 0, "group": "root", "invocation": { "module_args": { "backup": false, "content": null, "delimiter": null, "dest": "/etc/kubernetes/dnsmasq-svc.yml", "directory_mode": null, "follow": true, "force": true, "group": null, "mode": null, "original_basename": "dnsmasq-svc.yml", "owner": null, "regexp": null, "remote_src": null, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.39-79301315044098/source", "unsafe_writes": null, "validate": null } }, "item": { "file": "dnsmasq-svc.yml", "name": "dnsmasq", "type": "svc" }, "md5sum": "c0e0d5d3fa51ee9fdd6137728afe6f39", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:etc_t:s0", "size": 395, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.39-79301315044098/source", "state": "file", "uid": 0 }, "msg": "success: service \"dnsmasq\" deleted service \"dnsmasq\" replaced" } Using module file /usr/local/lvzj/github/kargo/library/kube.py ESTABLISH SSH CONNECTION FOR USER: None SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r kg1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' changed: [kg1] => (item={'_ansible_parsed': True, u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.74-208071795770388/source', '_ansible_item_result': True, u'group': u'root', u'uid': 0, u'dest': u'/etc/kubernetes/dnsmasq-autoscaler.yml', u'checksum': u'891e717050f9918ffdb41b789c4f6c04011db569', u'md5sum': u'1d6ada9b626a15f7e73482afb06d4c22', u'owner': u'root', 'item': {u'type': u'deployment', u'name': u'dnsmasq-autoscaler', u'file': u'dnsmasq-autoscaler.yml'}, u'state': u'file', u'gid': 0, u'secontext': u'system_u:object_r:etc_t:s0', u'mode': u'0644', 'changed': True, 'invocation': {u'module_args': {u'src': u'/root/.ansible/tmp/ansible-tmp-1490073263.74-208071795770388/source', u'directory_mode': None, u'force': True, u'remote_src': None, u'unsafe_writes': None, u'selevel': None, u'seuser': None, u'serole': None, u'follow': True, u'content': None, u'dest': u'/etc/kubernetes/dnsmasq-autoscaler.yml', u'setype': None, u'original_basename': u'dnsmasq-autoscaler.yml', u'delimiter': None, u'mode': None, u'regexp': None, u'owner': None, u'group': None, u'validate': None, u'backup': False}}, u'size': 1832, '_ansible_no_log': False}) => { "changed": true, "invocation": { "module_args": { "all": false, "filename": "/etc/kubernetes/dnsmasq-autoscaler.yml", "force": false, "kubectl": "/usr/local/bin/kubectl", "label": null, "log_level": 0, "name": "dnsmasq-autoscaler", "namespace": "kube-system", "resource": "deployment", "server": null, "state": "latest" }, "module_name": "kube" }, "item": { "changed": true, "checksum": "891e717050f9918ffdb41b789c4f6c04011db569", "dest": "/etc/kubernetes/dnsmasq-autoscaler.yml", "gid": 0, "group": "root", "invocation": { "module_args": { "backup": false, "content": null, "delimiter": null, "dest": "/etc/kubernetes/dnsmasq-autoscaler.yml", "directory_mode": null, "follow": true, "force": true, "group": null, "mode": null, "original_basename": "dnsmasq-autoscaler.yml", "owner": null, "regexp": null, "remote_src": null, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.74-208071795770388/source", "unsafe_writes": null, "validate": null } }, "item": { "file": "dnsmasq-autoscaler.yml", "name": "dnsmasq-autoscaler", "type": "deployment" }, "md5sum": "1d6ada9b626a15f7e73482afb06d4c22", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:etc_t:s0", "size": 1832, "src": "/root/.ansible/tmp/ansible-tmp-1490073263.74-208071795770388/source", "state": "file", "uid": 0 }, "msg": "success: deployment \"dnsmasq-autoscaler\" deleted deployment \"dnsmasq-autoscaler\" replaced" } TASK [dnsmasq : Check for dnsmasq port (pulling image and running container)] ** task path: /usr/local/lvzj/github/kargo/roles/dnsmasq/tasks/main.yml:76 Tuesday 21 March 2017 13:14:28 +0800 (0:00:04.364) 0:05:56.402 ********* Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/utilities/logic/wait_for.py ESTABLISH SSH CONNECTION FOR USER: None SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r kg2 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' fatal: [kg2]: FAILED! => { "changed": false, "elapsed": 301, "failed": true, "invocation": { "module_args": { "connect_timeout": 5, "delay": 15, "exclude_hosts": null, "host": "10.233.0.2", "path": null, "port": 53, "search_regex": null, "state": "started", "timeout": 300 }, "module_name": "wait_for" }, "msg": "Timeout when waiting for 10.233.0.2:53" } NO MORE HOSTS LEFT ************************************************************* to retry, use: --limit @/usr/local/lvzj/github/kargo/cluster.retry PLAY RECAP ********************************************************************* kg1 : ok=343 changed=40 unreachable=0 failed=0 kg2 : ok=370 changed=51 unreachable=0 failed=1 kg3 : ok=306 changed=27 unreachable=0 failed=0 kg4 : ok=263 changed=19 unreachable=0 failed=0 localhost : ok=3 changed=0 unreachable=0 failed=0 **Anything else do we need to know**: In this case , I can get pod with kubectl command output as follow: [root@cloud4ourself-kg1 ~]# /usr/local/bin/kubectl get pod --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE kube-system dnsmasq-411420702-mzhl9 1/1 Running 2 19h 10.233.102.128 cloud4ourself-kg4 kube-system dnsmasq-autoscaler-1155841093-9jd94 1/1 Running 2 19h 10.233.65.192 cloud4ourself-kg3 kube-system kube-apiserver-cloud4ourself-kg1 1/1 Running 2 19h 10.9.5.105 cloud4ourself-kg1 kube-system kube-apiserver-cloud4ourself-kg2 1/1 Running 0 19h 10.9.5.104 cloud4ourself-kg2 kube-system kube-controller-manager-cloud4ourself-kg1 0/1 Pending 0 24m cloud4ourself-kg1 kube-system kube-controller-manager-cloud4ourself-kg2 0/1 Pending 0 24m cloud4ourself-kg2 kube-system kube-proxy-cloud4ourself-kg1 1/1 Running 3 19h 10.9.5.105 cloud4ourself-kg1 kube-system kube-proxy-cloud4ourself-kg2 1/1 Running 0 19h 10.9.5.104 cloud4ourself-kg2 kube-system kube-proxy-cloud4ourself-kg3 1/1 Running 3 19h 10.9.5.103 cloud4ourself-kg3 kube-system kube-proxy-cloud4ourself-kg4 1/1 Running 3 19h 10.9.5.102 cloud4ourself-kg4 kube-system kube-scheduler-cloud4ourself-kg1 0/1 Pending 0 24m cloud4ourself-kg1 kube-system kube-scheduler-cloud4ourself-kg2 0/1 Pending 0 24m cloud4ourself-kg2 kube-system nginx-proxy-cloud4ourself-kg3 1/1 Running 3 19h 10.9.5.103 cloud4ourself-kg3 kube-system nginx-proxy-cloud4ourself-kg4 1/1 Running 3 19h 10.9.5.102 cloud4ourself-kg4 ## Change this to use another Kubernetes version, e.g. a current beta release kube_version: v1.5.4
4admin2root commented 7 years ago

logs.tar.gz

4admin2root commented 7 years ago

I change the kube_network_plugin to flannel and it works

jicki commented 7 years ago

me too fatal: [node1]: FAILED! => {"changed": false, "elapsed": 180, "failed": true, "msg": "Timeout when waiting for 10.233.0.2:53"}

skuda commented 7 years ago

same here with k8s 1.6.0 and Ubuntu xenial 16.04 as host.

skuda commented 7 years ago

Ok, my problem could be related to being installing it in Azure although had the same problem yesterday in DigitalOcean. In the coming weeks I will install it in some baremetal servers, I will try then another time with calico, I will stick for now with Azure and flannel for testing purposes.

jduhamel commented 7 years ago

for some reason this failed for me on flannel / baseos coreos-beta.

RRAlex commented 7 years ago

I have the same issue with both 1.5.3 & 1.6.0, on Ubuntu 16.04, running inside OpenStack, using ansible 2.2.1 (in a venv because 2.2.2 is borked).

Only options added from default are:

ipip: true
calico_mtu: 1340

It seems the roles/dnsmasq/meta/main.yml:

---
dependencies:
  - role: download
    file: "{{ downloads.dnsmasq }}"
    when: dns_mode == 'dnsmasq_kubedns' and download_localhost|default(false)
    tags: [download, dnsmasq]

...never gets to run as I don't see any andyshinn/dnsmask:2.72 image anywhere on the nodes.

I'm not sure it's related to the download_run_once and download_localhost, in roles/download/defaults/main.yml, whose raison d'être i'm not sure about, but I don't need local downloads.

I'm new to kubernetes, but my guess is that deployments should be able to pull their own image when they need to, through docker...?

More debug / info:

# kubectl get deployment --all-namespaces
NAMESPACE     NAME                        DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deploy/dnsmasq              1         0         0            0           1h
kube-system   deploy/dnsmasq-autoscaler   1         0         0            0           1h

# kubectl describe -f /etc/kubernetes/dnsmasq-svc.yml
Name:           dnsmasq
Namespace:      kube-system
Labels:         k8s-app=dnsmasq
            kubernetes.io/cluster-service=true
Selector:       k8s-app=dnsmasq
Type:           ClusterIP
IP:         10.233.0.2
Port:           dns-tcp 53/TCP
Endpoints:      <none>
Port:           dns 53/UDP
Endpoints:      <none>
Session Affinity:   None
No events.      

# curl http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/dnsmasq
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "no endpoints available for service \"dnsmasq\"",
  "reason": "ServiceUnavailable",

  "code": 503   
}
bradbeam commented 7 years ago

Any luck if you add the calico network to the allowed address ranges in openstack ( mentioned at the bottom of the doc below )? https://github.com/kubernetes-incubator/kargo/blob/master/docs/openstack.md

RRAlex commented 7 years ago

I'm already using ipip: true and calico_mtu: 1340. if I also set cloud_provider: "openstack" (with or without neutron port-update for a 10.233.0.0/16 range) it gets stuck at :

RUNNING HANDLER [kubernetes/master : Master | wait for the apiserver to be running] ***
[...]
FAILED - RETRYING: HANDLER: kubernetes/master : Master | wait for the apiserver to be running (1 retries left).
FAILED - RETRYING: HANDLER: kubernetes/master : Master | wait for the apiserver to be running (1 retries left).
FAILED - RETRYING: HANDLER: kubernetes/master : Master | wait for the apiserver to be running (1 retries left).
fatal: [staging_test_002]: FAILED! => {"attempts": 20, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:8080/healthz"}
fatal: [staging_test_001]: FAILED! => {"attempts": 20, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:8080/healthz"}
fatal: [staging_test_003]: FAILED! => {"attempts": 20, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:8080/healthz"}

EDIT³: With flannel or calico, I now get stuck at this on the first run, and at the problem above on the second: RUNNING HANDLER [kubernetes/master : Master | wait for kube-scheduler] ********* Not sure if the latest git pull is responsible for this ...

Looks like with the latest version (on 2017-04-06), it still timesout, but after a 15 minutes the service is running, though the curl test still fails and I can't resolve inside a container.

justicel commented 7 years ago

So it seems if you set the cloud_provider kubernetes is trying to reach out to openstack before the DNS services are setup and this breaks the install flow. Hope this helps to figure out how to troubleshoot and resolve!

RRAlex commented 7 years ago

@justicel: strangely, I have the same result with the 1.6.1 (released today) and no cloud_provider set...

skuda commented 7 years ago

I had no problem installing 1.6.1 in baremetal using calico, everything is working fine, I upgraded some packages in the config though. Probably the package upgrade is not needed but I wanted to test with the latest releases.

etcd_version: v3.1.4 calico_version: "v1.1.0" calico_cni_version: "v1.6.1" calico_policy_version: "v0.5.4" flannel_version: v0.7.0

Calico had one problem with kubernetes 1.6 fixed at calico cni 1.6.1, you can find the details here: http://docs.projectcalico.org/v2.1/releases/

Anyway my problem with calico was cleary related to trying to use it in Azure so maybe the update is not needed.

justicel commented 7 years ago

The problem I have seen is not specifically with the network driver, as long as you follow the instructions for your platform. The issue is that the kubernetes API/management pods need to be spun up after DNS services are established or temporarily use some other name-servers than the kubedns/dnsmasq name-servers. What happens is that openstack API or others can't be reached until these pods are up and it's a chicken/egg thing.

Starefossen commented 7 years ago

Similar failure with weave network plugin. Will try again with flannel to see if that helps.