openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.19k stars 2.31k forks source link

If the default route is owned by an interface that is unmanaged dnsmasq dispatcher fails #2115

Closed sdodson closed 6 years ago

sdodson commented 8 years ago

possibly a common thing on azure?

FilipVozar commented 8 years ago

it was centos 7.2 on Azure, can provide more input if needed

sdodson commented 8 years ago

@FilipVozar after #2112 merged can you see if this works on a clean install? It's a work around but it should address the issue for now.

sdodson commented 8 years ago

@FilipVozar is there a particular centos image you used? It doesn't look like there's an official centos image.

FilipVozar commented 8 years ago

@sdodson I'm using image provided by OpenLogic http://www.openlogic.com/products-services/services/cloud-services/azure, that's the only "plain" centos image I could find in the Azure Marketplace.

I created a new host using the same image and did a clean install, dnsmasq was restarted, using correct config and DNS works.

[root@node1 ~]# systemctl status dnsmasq
● dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2016-07-05 15:58:27 UTC; 4min 54s ago
 Main PID: 6852 (dnsmasq)
   CGroup: /system.slice/dnsmasq.service
           └─6852 /usr/sbin/dnsmasq -k
Jul 05 15:58:27 node1 systemd[1]: Started DNS caching server..
Jul 05 15:58:27 node1 systemd[1]: Starting DNS caching server....
Jul 05 15:58:27 node1 dnsmasq[6852]: started, version 2.66 cachesize 150
Jul 05 15:58:27 node1 dnsmasq[6852]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
Jul 05 15:58:27 node1 dnsmasq[6852]: using nameserver 172.30.0.1#53 for domain cluster.local
Jul 05 15:58:27 node1 dnsmasq[6852]: read /etc/hosts - 2 addresses
[root@node1 ~]# dig docker-registry.default.svc.cluster.local @localhost
; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> docker-registry.default.svc.cluster.local @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24506
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;docker-registry.default.svc.cluster.local. IN A
;; ANSWER SECTION:
docker-registry.default.svc.cluster.local. 30 IN A 172.30.28.58
;; Query time: 2 msec
;; SERVER: ::1#53(::1)
;; WHEN: Tue Jul 05 15:59:20 UTC 2016
;; MSG SIZE  rcvd: 75
sdodson commented 8 years ago

Actually, it looks like the description of this is wrong, at least based on my testing. Here's the NM debug logs https://gist.github.com/sdodson/1034166301747486549d015991fa40e7 ipv4.dns is set but IP4_NAMESERVERS is empty. Need to see how to get access to that.

sdodson commented 8 years ago

@FilipVozar Can you verify that you can still resolve external hosts via dnsmasq?

FilipVozar commented 8 years ago

@sdodson I can't (and I couldn't before, this hasn't changed). /etc/resolv.conf inside pod has 2 nameservers - host IP and nameserver IP inherited from the host. Resolving external domains using host's dnsmasq fails.

sdodson commented 8 years ago

@FilipVozar Thanks, it's designed such that dnsmasq should be capable of answering all queries and it should become the only nameserver at the host level, but due to a bug in NetworkManager I believe it's not working properly on Azure. This will be required in the future but it's not fatal today. I'll keep trying to figure out why this isn't working on Azure.

sdodson commented 8 years ago

@FilipVozar I think this is another manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1316138 Can you try updating to NetworkManager-1.0.6-29.el7_2 or later then rebooting your machine? What I'm looking for is the node's /etc/resolv.conf should point at itself (dnsmasq) and /etc/dnsmasq.d/origin-upstream-dns.conf should list the otherwise default nameservers. At that point dnsmasq should be able to resolve all hostnames both cluster.local and external (ie: google.com)

FilipVozar commented 8 years ago

NM 1.0.6-29.el7_2 was installed since the beginning (or openshift-ansible installed it, I only logged in after running ansible).

[root@node1 etc]# yum info NetworkManager
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
Installed Packages
Name     : NetworkManager
Arch        : x86_64
Epoch       : 1
Version     : 1.0.6
Release     : 29.el7_2
Size        : 9.1 M
Repo        : installed
From repo   : CentOS-Updates
....
[root@node1 etc]# cat /etc/resolv.conf
; generated by /usr/sbin/dhclient-script
search o2pdvy5yxbcu1ds0vouklfrdlg.cx.internal.cloudapp.net
nameserver 168.63.129.16

Also there is no /etc/dnsmasq.d/origin-upstream-dns.conf, only /etc/dnsmasq/origin-dns.conf. In /etc/dnsmasq.conf and /etc/dnsmasq.d/origin-dns.conf, these are the only uncommented lines:

[root@node1 etc]# grep -r "^[^#]" /etc/dnsmasq*
/etc/dnsmasq.conf:conf-dir=/etc/dnsmasq.d
/etc/dnsmasq.d/origin-dns.conf:strict-order
/etc/dnsmasq.d/origin-dns.conf:no-resolv
/etc/dnsmasq.d/origin-dns.conf:domain-needed
/etc/dnsmasq.d/origin-dns.conf:server=/cluster.local/172.30.0.1