Install will fail if `dns=none` in `/etc/NetworkManager/NetworkManager.conf`

itewk commented 6 years ago

Description

If /etc/NetworkManager/NetworkManager.conf has dns=none in it and the openshift installer is expected to install and configure dnsmasq then the install will fail because then the /etc/NetworkManager/dispatcher.d/99-origin-dns.sh can not properly pull the nameservers from NetworkManager.

Version

https://github.com/openshift/openshift-ansible/tree/openshift-ansible-3.9.31-1/playbooks

Steps To Reproduce

configure ifcfg device for DNS1/2
have dns=none in /etc/NetworkManager/NetworkManager.conf
run installer
be frustrated for 2 days while you try to figure out why the heck dnsmasq can't find the upstream DNS servers

Expected Results

It would be great if the ansible playbooks could detect that:

dnsmasq has not been preconfigured as per https://docs.openshift.com/container-platform/3.9/install_config/install/prerequisites.html#prereq-dns and therefor the installer should be configuring dnsmasq
/etc/NetworkManager/NetworkManager.conf contains dns=none

And then if those two things are true, bail EARLY.

Right now the installer runs for about 30 minutes before it gets to the dnsmasq configuration and then attempts to install a package which fails because DNS is busted. It would be great to detect this early and bail early.

Observed Results

dnsmasq isn't getting configured correctly because of dns=none in /etc/NetworkManager/NetworkManager.conf and is causing installs to fail but not clear as to why.

Additional Information

From talking to my fellow openshift deployers a bunch of people have been running into this issue and spending massive amounts of time debugging the issue. It would be great to have an automated detection of the issue and report that back to users to avoid hours, in my case, days of troubleshooting.

This comes up because it is common for large slower organizations to still not be using NetworkManager and have it disabled as part of their kickstart of VM cloning process.

Maybe the detection should happen during the prerequisites playbook?

JayKayy commented 6 years ago

I have hit this issue a few times. A fix would be very much welcomed.

curt-matthews commented 6 years ago

I've hit this at two customers as well!

clodter commented 6 years ago

I've run into this problem at a customer site, too.

dstockdreher commented 6 years ago

Seen this as well multiple times and is now part of standard "don't forget to check this" procedure. Would be helpful to be blatantly called out and save many hours of re-install attempts some places. NM needs to be enabled AND the NMControlled = yes flag set from what I've gathered. Not just installed as is sometimes a point of confusion for inexperience linux folks validating prereqs on their own.

jnach commented 5 years ago

Joining the party, and can add that a common troubleshooting procedure for some enterprise networking teams for DNS issues is to try and set 'dns=none' (perhaps when you may not be looking), would be great to have an explicit message on this as part of the pre-flight checks.

abessifi commented 5 years ago

Just tested with a OpenShift 3.11.104 fresh install and seems to be working correctly.

Description:

configure ifcfg device with NM_CONTROLLED=yes
/etc/NetworkManager/NetworkManager.conf contains dns=none
/etc/resolv.conf is pre-set manually

After installation:

dnsmasq has the correct configuration (which means that the 99-origin-dns.sh dispatch script was correctly executed by NetworkManager)
/etc/resolv.conf points to dnsmasq as a nameserver
DNS resolution works fine for public, private and "kubernetes" domains

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/9351#issuecomment-667450814): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / openshift-ansible