Closed timothysc closed 7 years ago
Trello card associated with this issue: https://trello.com/c/uIvwy5Yj
Post deployment our default configuration leaves DNS in a state where pods can not communicate with the outside world.
The article you've linked to solves many problems but it shouldn't be directly related to this unless google.com overlaps with your default subdomain, overlaps with your host's search path, or your hosts themselves cannot resolve google.com. Can you post the contents of a pod's /etc/resolv.conf
?
Below you can see that the pod has skydns listed as its first resolver and then the host's resolvers are next. SkyDNS returns a SERVFAIL so the resolver moves on to the resolvers which are successful.
[root@hello-openshift1 /]# cat /etc/resolv.conf
nameserver 172.30.0.1
nameserver 192.168.122.1
search default.svc.cluster.local svc.cluster.local cluster.local example.com
options ndots:5
[root@hello-openshift1 /]# dig +nofail google.com
;; Got SERVFAIL reply from 172.30.0.1, trying next server
; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.2 <<>> +nofail google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21539
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 27 IN A 173.194.68.139
google.com. 27 IN A 173.194.68.101
google.com. 27 IN A 173.194.68.102
google.com. 27 IN A 173.194.68.138
google.com. 27 IN A 173.194.68.100
google.com. 27 IN A 173.194.68.113
;; Query time: 0 msec
;; SERVER: 192.168.122.1#53(192.168.122.1)
;; WHEN: Mon Feb 01 15:50:49 EST 2016
;; MSG SIZE rcvd: 124
[root@hello-openshift1 /]# curl https://www.google.com -v 2>&1 | grep -e subject -e OK
* subject: CN=www.google.com,O=Google Inc,L=Mountain View,ST=California,C=US
< HTTP/1.1 200 OK
Also, what is the dnsPolicy of the pod?
nameserver 172.24.0.1 nameserver 10.1.4.30
pod was run via: sudo ./e2e.test --provider="local" --ginkgo.v=true --ginkgo.focus="should provide Internet connection for containers" --kubeconfig="/etc/origin/master/admin.kubeconfig" --repo-root="/home/cloud-user/kubernetes"
It Got SERVFAIL reply from 172.24.0.1 , but then bailed. Once the dnsmasq was enabled, all was well.
I suspect this was introduced in the recent rebase
Introduced via https://github.com/kubernetes/kubernetes/pull/18089
issue being discussed at https://github.com/kubernetes/kubernetes/issues/20090
On Mon, Feb 1, 2016 at 4:22 PM, Timothy St. Clair notifications@github.com wrote:
nameserver 172.24.0.1 nameserver 10.1.4.30
pod was run via: sudo ./e2e.test --provider="local" --ginkgo.v=true --ginkgo.focus="should provide Internet connection for containers" --kubeconfig="/etc/origin/master/admin.kubeconfig" --repo-root="/home/cloud-user/kubernetes"
It Got SERVFAIL reply from 172.24.0.1 , but then bailed. Once the dnsmasq was enabled, all was well.
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-178199412 .
yup.
@liggitt or @smarterclayton Do you guys know yet what we intend to do about the change to ClusterFirst dnsPolicy? Will this be critical for 3.2?
I thought we were carrying a patch to preserve the old behavior... are you seeing a change in behavior in origin?
I deployed origin/master:
commit 86a15eb95324991b0de57de04e71b29cec5ab63f
Date: Thu Jan 28 09:55:15 2016 -0500
and I'm seeing the behavior outlined.
Our tests routinely clone from github.com, etc, inside pods...
https://github.com/openshift/origin/commit/7d76e2e6880c1c9dfddb37e5c270bc1930c059a6 is the carry, which preserves the pre-rebase behavior of including the clusterDNS in the list of nameservers
@timothysc can you also show the output of
oc get service kubernetes -n default -o yaml
oc get endpoints kubernetes -n default -o yaml
Trying to figure out why you're seeing a behavior change
@liggitt details requested are here: https://paste.fedoraproject.org/317540/42926714/
I'm guessing the fact that we are running on openstack may have something to do with it.
was this setup working previously (and if so, when)? trying to nail down what changed
I think this is a problem with the busybox image in question not moving to the next nameserver on SERVFAIL, this fails on my OSE 3.1.1.6 cluster as well.
On Tue, Feb 2, 2016 at 11:27 AM, Jordan Liggitt notifications@github.com wrote:
was this setup working previously (and if so, when)? trying to nail down what changed
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-178672385 .
So that is likely the issue with inconsistent resolver behavior with multiple non-overlapping nameservers, which was the reason upstream removed the clusterDNS+hostDNS chaining.
172.30.0.1 is my kubernetes svc
[root@ose3-master ~]# docker run -it gcr.io/google_containers/busybox
# cat /etc/resolv.conf
# Generated by NetworkManager
search example.com
nameserver 192.168.122.1
# wget -s google.com
Connecting to google.com (173.194.208.138:80)
Connecting to www.google.com (74.125.226.18:80)
# cat /etc/resolv.conf
# Generated by NetworkManager
search example.com
nameserver 172.30.0.1
nameserver 192.168.122.1
# wget -s google.com
wget: bad address 'google.com'
and that works correctly with another image, like one of our origin images?
Yeah, it's fine with fedora/rhel based images which use glibc's resolver.
On Tue, Feb 2, 2016 at 11:41 AM, Jordan Liggitt notifications@github.com wrote:
and that works correctly with another image, like one of our origin images?
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-178679572 .
so. @smarterclayton, ruling on this? A. turn skydns into an open resolver, and undo our carry B. break non-glibc resolvers C. require additional setup of dnsmasq?
Also works with latest docker.io/busybox image
On Tue, Feb 2, 2016 at 11:48 AM, Jordan Liggitt notifications@github.com wrote:
so. @smarterclayton https://github.com/smarterclayton, ruling on this? turn our dns into an open resolver, break non-glibc resolvers, or require additional setup of dnsmasq?
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-178682769 .
imo end user experience is what matters most here, not a fan of (B).
@liggitt @smarterclayton
A. turn skydns into an open resolver, and undo our carry
Instead of turning skydns into an open resolver, could we just configure forwarders to a provided set of dns hosts through the master config?
B. break non-glibc resolvers
I don't like the idea of potentially inconsistent behavior for the user, especially if we are advertising running upstream docker containers.
C. require additional setup of dnsmasq?
I'm not sure we want to travel down that path just yet... Unless there was no other choice.
I don't think it's a bad idea to do it for demo/POC use cases, but supporting it in a production use case is something else...
From this: https://github.com/skynetservices/skydns#configuration it looks like configuring the forwarders is possible and it defaults to the entries in /etc/resolv.conf (which should be a good fallback).
Right... turning on no_rec was explicitly done in https://github.com/openshift/origin/commit/8ff419a71968caf36dde8501112e7a37c982bc81... hence needing a ruling
If we switch to upstream behavior I don't need to write the filtering to prevent duplicate entries in pod resolv.conf when we add the kubernetes svc ip to the host's resolv.conf because clusterFirst is really cluster only upstream.
@brenton This is the issue we discussed during standup, in particular we need to reach a decision on Jordan's 3 proposed options.
My opinion is that we have proof positive that at least some resolvers screw up and that we should go with option A while attempting to secure skydns in a way that it's only accessible to the cluster.
We can make this an optional configuration for admins and tell them to block external DNS traffic into the cluster. They turn on open resolver, they need to close the resolver loop.
On Thu, Feb 4, 2016 at 10:45 AM, Scott Dodson notifications@github.com wrote:
@brenton https://github.com/brenton This is the issue we discussed during standup, in particular we need to reach a decision on Jordan's 3 proposed options.
My opinion is that we have proof positive that at least some resolvers screw up and that we should go with option A while attempting to secure skydns in a way that it's only accessible to the cluster.
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-179908630 .
I don't want C - I don't think it helps us in the short term, and the long term this is a cluster problem.
On Thu, Feb 4, 2016 at 11:16 AM, Clayton Coleman ccoleman@redhat.com wrote:
We can make this an optional configuration for admins and tell them to block external DNS traffic into the cluster. They turn on open resolver, they need to close the resolver loop.
On Thu, Feb 4, 2016 at 10:45 AM, Scott Dodson notifications@github.com wrote:
@brenton https://github.com/brenton This is the issue we discussed during standup, in particular we need to reach a decision on Jordan's 3 proposed options.
My opinion is that we have proof positive that at least some resolvers screw up and that we should go with option A while attempting to secure skydns in a way that it's only accessible to the cluster.
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-179908630 .
@smarterclayton the issue with B is that it'll require further custom logic to handle not duplicating the DNS resolver in the pods..
The simple use case is easy, where the host DNS resolver is pointed at the service IP for skydns. The other use cases are more difficult (where they are pointed to skydns through an IP on a master host or a public IP assigned to a master).
For A, we can always default to locking down the skydns port from all but the service network.
The service network is just local nat on a node to IP node ips for the master. So wouldn't we have to ensure it's open to all subnets where nodes exist? This seems challenging without having a deny all and having the api-servers add allow rules for each known node ip.
That is true... you would need a deny all for 53 and add an entry for each node on the master.
The plus side, is that you could limit the impact on the other rules by pushing port 53 into it's own chain for processing.
As an alternative, since we already require the masters be on the pod network, we could potentially expose it that way.
Ultimately nodes and ramp nodes will have to access DNS as well - both of those need access to the pod network as well.
On Thu, Feb 4, 2016 at 11:40 AM, Jason DeTiberus notifications@github.com wrote:
That is true... you would need a deny all for 53 and add an entry for each node on the master.
The plus side, is that you could limit the impact on the other rules by pushing port 53 into it's own chain for processing.
As an alternative, since we already require the masters be on the pod network, we could potentially expose it that way.
— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/1314#issuecomment-179936068 .
While working on adding cluster DNS to hosts I discussed with the networking team how best to automate reconfiguring host's resolv.conf etc. They actually suggested using NetworkManager's dns=dnsmasq setting and providing a dnsmasq config to selectively resolve cluster dns zones via the kubesvc IP.
Perhaps Option C the best way forward given it's probably how we'll end up giving host processes access to cluster dns? If so should we set our DNSPolicy: Default so that it just uses the host's resolv.conf and we're effectively all in on dnsmasq?
This would also mean we'd require NetworkManager, which is slightly controversial but if it's the right tool...
I'm not sure we're going to be able to sell our customer base on using NetworkManager.
I'm of the thinking that if all of the hosts that need to talk to DNS are all on/have access to the pod network, then maybe that is the best option for securing access and having a centralized authoritative DNS solution.
https://github.com/openshift/origin/pull/7598 adds dnsmasq via NetworkManager dispatcher works out of the box assuming default cluster dns values. My intention is to complete testing of that and deliver that via the origin/atomic-openshift node RPMs. We can then configure dnsIP = ansible_default_ipv4 to switch over to using node local dnsmasq once we're sure of how robust dnsmasq proves to be.
https://trello.com/c/a2vUr9KE/12-dns-integration seems to be related to this discussion and I feel that if a "centralized authoritative DNS solution" is used as @detiber points out, we have a much better story on how our DNS solution is "plugable" and you can swap out the "provided solution" for your own.
So what's the resolution, as of today (v3.2.0.9) the end user experience is still:
/usr/libexec/atomic-openshift/extended.test --ginkgo.v=true --ginkgo.focus="Conformance"
[Fail] ClusterDns [Feature:Example] [It] should create pod that uses dns [Conformance]
/builddir/build/BUILD/atomic-openshift-git-0.b99af7d/_thirdpartyhacks/src/k8s.io/kubernetes/test/e2e/util.go:1537
[Fail] DNS [It] should provide DNS for the cluster [Conformance]
/builddir/build/BUILD/atomic-openshift-git-0.b99af7d/_thirdpartyhacks/src/k8s.io/kubernetes/test/e2e/dns.go:229
[Fail] DNS [It] should provide DNS for services [Conformance]
/builddir/build/BUILD/atomic-openshift-git-0.b99af7d/_thirdpartyhacks/src/k8s.io/kubernetes/test/e2e/dns.go:229
[Fail] Networking [It] should provide Internet connection for containers [Conformance]
/builddir/build/BUILD/atomic-openshift-git-0.b99af7d/_thirdpartyhacks/src/k8s.io/kubernetes/test/e2e/networking.go:53
/cc @danmcp
What are these tests doing specifically? Are they running from a pod, or from a node?
DNS resolution should work just fine from a pod and should provide the pod with internet connectivity. Do the nodes have internet connectivity/resolution themselves? Is the test using a container image that doesn't not iterate through the list of dns resolvers?
Is the test using a container image that doesn't not iterate through the list of dns resolvers
It's a uclibc resolver in the (busy-box)image, as mentioned above.
This issue has been inactive for quite some time. Please update and reopen this issue if this is still a priority you would like to see action on.
Inactive but not resolved? The ask here is (as I understand it is to have OCP, install a dns solution). Has the idea here changed?
No, this is all about internal dns. The card you've referenced is for providing external facing authoritative dns but that's completely different from the original subject of this issue. In my opinion this should've been closed when #1588 merged.
Post deployment our default configuration leaves DNS in a state where pods can not communicate with the outside world.
When running upstreams networking conformance tests the pods fail to resolve google.com. In order to rectify, I had to setup dnsmasq on the master. http://developerblog.redhat.com/2015/11/19/dns-your-openshift-v3-cluster/
This is not spelled out as a separate step post installation, but imho there is really no reason the installer can not properly configure dnsmasq on the master.
/cc @rrati @jayunit100 @jeremyeder @mattf