openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.17k stars 2.32k forks source link

99-origin-dns.sh and /etc/resolv.conf #4088

Closed Bak3y closed 4 years ago

Bak3y commented 7 years ago

Description

When using the Advanced Installer for OSCP 3.5, if you don't make NetworkManager aware of your DNS settings, the script 99-origin-dns.sh ends up leaving your system in a state where you don't have working DNS, even though you had valid entires in /etc/resolv.conf prior to the install.

Version
ansible 2.2.1.0
atomic-openshift-utils-3.5.60-1.git.0.b6f77a6.el7.noarch
openshift-ansible-3.5.60-1.git.0.b6f77a6.el7.noarch
Steps To Reproduce
  1. Install base OS and configure network settings via /etc/sysconfig/network-scripts and /etc/resolv.conf, not using NetworkManager
  2. Install OSCP 3.5 using the advanced installer
Expected Results

Describe what you expected to happen.

Advanced installer completes resulting in a working OSCP 3.5 install
Observed Results

Describe what is actually happening.

Right after the dnsmasq sections of the playbook are run and the script 99-origin-dns.sh runs, further steps fail due to DNS lookup failures.
Bak3y commented 7 years ago

I also have an RFE opened with RedHat to have this become aware of /etc/resolv.conf and not just use Network Manager alone for DNS settings. CASE 01842852

sdodson commented 7 years ago

@Bak3y Exactly how are you configuring your dns servers?

You shouldn't be setting them in /etc/resolv.conf directly and all other methods should convey the dns server configuration to our dispatcher script though we do require NetworkManager be enabled.

Bak3y commented 7 years ago

Can you expound on you "you shouldn't be setting them in /etc/resolv.conf directly" comment? We're largely using ansible for server configs these days and yes we're directly dropping a preconfigured resolv.conf on servers. It works 100% fine in every other scenario but this. I just think it'd be super simple to check /etc/resolv.conf in addition to Network Manager for this rather than only checking in one place.

sdodson commented 7 years ago

Rather than editing /etc/resolv.conf can you try setting DNS1 and DNS2 values in /etc/sysconfig/network-scripts/ifcfg-eth0 or whatever your device is?

The network configuration guide for RHEL7 never mentions that editing /etc/resolv.conf as a valid manner in which to configure name servers. The two options they give are via NetworkManager or sysconfig scripts so that's what we've targeted. And since we want to update dnsmasq configuration if/when new dns servers are set via DHCP we've implemented this via a NetworkManager dispatcher script which notifies us of changes.

Bak3y commented 7 years ago

We've always configured resolv.conf, going back to RHEL6. Since it continued to work as we transitioned to 7, we didn't see a need to change a bunch of Ansible playbooks to account for anything different.

Further, we have more than just nameserver lines in our resolv.conf, we also have search domains set, which I see no way to account for via sysconfig scripts.

I can see where you'd want to stick to RedHat's recommendations here, but the script still leaves the boxes it touches in an unusable state if it doesn't find a valid DNS server in NetworkManager, which is a less than ideal scenario. I'd rather the playbook error out on a 'NULL' value when querying for this than take that NULL and shove it into the dnsmasq config.

sdodson commented 7 years ago

I agree, we shouldn't leave /etc/resolv.conf in a bad state no how a host is configured.

Does it work when setting those values though?

Bak3y commented 7 years ago

We got around it by using nmcli to inform NetworkManager of the proper DNS settings, yes.

dmytroleonenko commented 7 years ago

I wonder why do you use this 99-origin-dns.sh and /etc/dnsmasq.d/* instead of setting dns=dnsmasq in the [main] section of /etc/NetworkManager/NetworkManager.conf and adding items to /etc/NetworkManager/dnsmasq.d/ + iptables REDIRECT to 127.0.0.1:53 ? What are the cons of this approach? Looks like it is less intrusive than 99-origin-dns.sh

sdodson commented 7 years ago

@dmytroleonenko see https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh#L4-L9

Hmm, not sure about iptables, can you show me what that rule would look like?

dmytroleonenko commented 7 years ago

Seen that sure. Still can't get what's the difference between dns=dnsmasq + iptables -t nat -A PREROUTING -s ..../16 -d ho.st.ip.ad -p udp --dport 53 -j DNAT --to-destination 127.0.0.1:53 also sysctl -w net.ipv4.conf.eth0.route_localnet=1 will be required Sure iptables rule can be refined to include interface and other conditionals

dmytroleonenko commented 7 years ago

Any thoughts? The reason I'm asking is that I can't modify the default behavior of dnsmasq just editing the config created by the script. It will get overridden after all.

sdodson commented 7 years ago

@dcbw any thoughts on using dns=dnsmasq and iptables to facilitate pods hitting dnsmasq via 127.0.0.1:53 on the node via DNAT? The only reason we were doing this dispatcher script was because dns=dnsmasq was only binding to 127.0.0.1

brenton commented 7 years ago

Gently bumping this...

@dcbw

DanyC97 commented 6 years ago

i'm interested too in this answer, thank you @dcbw

jnach commented 6 years ago

This is still an open issue affecting a project currently, root cause is 99-origin script does not handle existing search domains in /etc/resolv. It will instead overwrite all search domain values with 'search cluster.local' only.

vrutkovs commented 6 years ago

@jnach is it still happening after https://github.com/openshift/openshift-ansible/pull/7103 is merged? Which search domains were removed?

jnach commented 6 years ago

@vrutkovs I was wrong - after patching this file to scrape search domains, I discovered it actually runs many times, and the real root cause is NM creates a duplicate interface every time you try to configure a search domain, after restarting the NM service. This is repoducible out of the box, in my case, on Azure with a vanilla 7.4 instance. The only reliable way I've found to solve this is to continue adding search domains to the new interfaces, eventually this behavior stops, but the machine thinks it has 3-4 NICS. Seems like these are in-memory, I have not had time to chase this any further.

danuamirudin commented 4 years ago

I have the same Issue and Solved with one of 2 ways :

  1. chmod 000 /etc/resolv.conf == when using root in inventory ansible
  2. chmod 600 /etc/resolv.conf == when using user in ansible inventory or that mean if using method 2 you must add ansible_become=true in ansible inventory..

Hope this help :)

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/4088#issuecomment-686786676): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.