openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

DNS resolution fails if default search domain has a wildcard match #17316

Open ikus060 opened 6 years ago

ikus060 commented 6 years ago

Name resolution from inside the pod seams to be broken because of multiple factor.

Version
# oc version
oc v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.7.0-rc.0+e92d5c5
kubernetes v1.7.6+a08f5eeb62
Steps To Reproduce

Look like the /etc/resolv.conf file generated by openshift is not working in every scenario.

Just to show it's working with something...

# cat /etc/resolv.conf
nameserver 8.8.8.8
search patrikdufresne.com

# nslookup -debug dl-cdn.alpinelinux.org
Server:     8.8.8.8
Address:    8.8.8.8#53

------------
    QUESTIONS:
    dl-cdn.alpinelinux.org, type = A, class = IN
    ANSWERS:
    ->  dl-cdn.alpinelinux.org
    canonical name = global.prod.fastly.net.
    ttl = 59
    ->  global.prod.fastly.net
    internet address = 151.101.0.249
    ttl = 19
    ->  global.prod.fastly.net
    internet address = 151.101.64.249
    ttl = 19
    ->  global.prod.fastly.net
    internet address = 151.101.128.249
    ttl = 19
    ->  global.prod.fastly.net
    internet address = 151.101.192.249
    ttl = 19
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
dl-cdn.alpinelinux.org  canonical name = global.prod.fastly.net.
Name:   global.prod.fastly.net
Address: 151.101.0.249
Name:   global.prod.fastly.net
Address: 151.101.64.249
Name:   global.prod.fastly.net
Address: 151.101.128.249
Name:   global.prod.fastly.net
Address: 151.101.192.249

This is the /etc/resolv.conf generated in the pod. not working

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
options ndots:5

# nslookup -debug dl-cdn.alpinelinux.org
Server:     8.8.8.8
Address:    8.8.8.8#53

------------
    QUESTIONS:
    dl-cdn.alpinelinux.org.default.svc.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
    origin = a.root-servers.net
    mail addr = nstld.verisign-grs.com
    serial = 2017111401
    refresh = 1800
    retry = 900
    expire = 604800
    minimum = 86400
    ttl = 86385
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.default.svc.cluster.local: NXDOMAIN
Server:     8.8.8.8
Address:    8.8.8.8#53

------------
    QUESTIONS:
    dl-cdn.alpinelinux.org.svc.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
    origin = a.root-servers.net
    mail addr = nstld.verisign-grs.com
    serial = 2017111401
    refresh = 1800
    retry = 900
    expire = 604800
    minimum = 86400
    ttl = 86394
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.svc.cluster.local: NXDOMAIN
Server:     8.8.8.8
Address:    8.8.8.8#53

------------
    QUESTIONS:
    dl-cdn.alpinelinux.org.cluster.local, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
    origin = a.root-servers.net
    mail addr = nstld.verisign-grs.com
    serial = 2017111401
    refresh = 1800
    retry = 900
    expire = 604800
    minimum = 86400
    ttl = 86378
    ADDITIONAL RECORDS:
------------
** server can't find dl-cdn.alpinelinux.org.cluster.local: NXDOMAIN
Server:     8.8.8.8
Address:    8.8.8.8#53

------------
    QUESTIONS:
    dl-cdn.alpinelinux.org.patrikdufresne.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  patrikdufresne.com
    origin = ns2.no-ip.com
    mail addr = hostmaster.no-ip.com
    serial = 2010091255
    refresh = 10800
    retry = 1800
    expire = 604800
    minimum = 1800
    ttl = 1799
    ADDITIONAL RECORDS:
------------
Non-authoritative answer:
*** Can't find dl-cdn.alpinelinux.org: No answer

If I remove my domain name patrikdufresne.com. working

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server:     8.8.8.8
Address:    8.8.8.8#53

Non-authoritative answer:
dl-cdn.alpinelinux.org  canonical name = global.prod.fastly.net.
Name:   global.prod.fastly.net
Address: 151.101.0.249
Name:   global.prod.fastly.net
Address: 151.101.64.249
Name:   global.prod.fastly.net
Address: 151.101.128.249
Name:   global.prod.fastly.net
Address: 151.101.192.249

Also working if I remove ndots:5.

# cat /etc/resolv.conf 
nameserver 8.8.8.8
search default.svc.cluster.local svc.cluster.local cluster.local patrikdufresne.com
root@tymara:/home/ikus060# nslookup dl-cdn.alpinelinux.org
Server:     8.8.8.8
Address:    8.8.8.8#53

Non-authoritative answer:
dl-cdn.alpinelinux.org  canonical name = global.prod.fastly.net.
Name:   global.prod.fastly.net
Address: 151.101.0.249
Name:   global.prod.fastly.net
Address: 151.101.64.249
Name:   global.prod.fastly.net
Address: 151.101.128.249
Name:   global.prod.fastly.net
Address: 151.101.192.249
johnfosborneiii commented 6 years ago

I ran into this exact same issue with a fresh installation of OCP 3.7 on a RHEL 7.4 VM.

The outbound networking worked from the VM. The outbound networking also worked when I ran a container out of band from Kubernetes (using docker run). OCP ran the container, the outbound networking broke but it could be fixed by removing the options ndots:5 or "search josborne.com". I couldn't figure out where "search josborne.com" was even coming from because I didn't set that anywhere in the Ansible advanced installation. I changed my /etc/hostname file from openshift.josborne.com to openshift and rebooted. At that point "search josborne.com" was removed from the pod /etc/resolv.conf and everything started working. Is this user error or a bug? I've installed every release of OCP from scratch using a FQDN in my /etc/hostname file and it first broke in either 3.6 or 3.7 so I think something has changed in the platform.

danwinship commented 6 years ago

Right, so the problem is that if the domain that gets listed in the search line does wildcard matching, then because of the ndots:5, basically all hostnames will end up being treated as subdomains of the default domain. Eg, *.josbourne.com appears to resolve to a particular AWS hostname, so if you look up, say, github.com, it ends up matching as github.com.josbourne.com which resolves to the AWS IP.

I guess the search field in the pod resolv.conf is set automatically from the node hostname?

What we really want is to make service name lookups behave like ndots:5, but make other lookups not do that. We can't make the libc resolver do that, but in cases where we're running a DNS server inside the cluster, we could do the ndots-like special-casing inside that server, and then we could give the pods a resolv.conf without ndots.

The other possibility would be to stop including the node's domain in the pod resolv.conf's search field, but that would break any existing pods that were depending on the current behavior, so we'd need some sort of compatibility option.

ikus060 commented 6 years ago

Since the way to install openshift is to go with ansible playbook. I would add extra validation in ansible to make sure the provided DNS domain is behaving as you like. If not, the playbook should fail and warn the user.

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

gbraad commented 6 years ago

This is still an issue. /remove-lifecycle rotten

gbraad commented 6 years ago

For minishift this is an issue with some Hypervisor that force a search entry from the DHCP offer. Eg. HyperV on the "default switch" uses search mshome.net and can cause lookups during S2i to github.com to fail

gbraad commented 6 years ago

Note: the options ndots:5 is part of Kubernetes since about 2015 => https://github.com/kubernetes/kubernetes/pull/10266/commits/23caf446ae69236641da0fdc432d4cfb5fff098d#diff-0db82891d463ba14dd59da9c77f4776eR66 (ref: https://github.com/kubernetes/kubernetes/pull/10266)

xpflying commented 6 years ago

Same issue with ansible install openshift 3.10

shadowlord017 commented 6 years ago

Same for me: ndots:5 makes it substitute domain name (from search line) before checking original address

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

danwinship commented 5 years ago

/remove-lifecycle stale /lifecycle frozen

sponte commented 4 years ago

Hello, is there a workaround for this? I seem to be facing the same issue with k8s 1.19, coredns and my external domain which is part of the DNS search path, having wildcard match