Closed itewk closed 4 years ago
I have hit this issue a few times. A fix would be very much welcomed.
I've hit this at two customers as well!
I've run into this problem at a customer site, too.
Seen this as well multiple times and is now part of standard "don't forget to check this" procedure. Would be helpful to be blatantly called out and save many hours of re-install attempts some places. NM needs to be enabled AND the NMControlled = yes flag set from what I've gathered. Not just installed as is sometimes a point of confusion for inexperience linux folks validating prereqs on their own.
Joining the party, and can add that a common troubleshooting procedure for some enterprise networking teams for DNS issues is to try and set 'dns=none' (perhaps when you may not be looking), would be great to have an explicit message on this as part of the pre-flight checks.
Just tested with a OpenShift 3.11.104
fresh install and seems to be working correctly.
Description:
NM_CONTROLLED=yes
/etc/NetworkManager/NetworkManager.conf
contains dns=none
/etc/resolv.conf
is pre-set manuallyAfter installation:
dnsmasq
has the correct configuration (which means that the 99-origin-dns.sh
dispatch script was correctly executed by NetworkManager)/etc/resolv.conf
points to dnsmasq as a nameserverIssues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
Description
If
/etc/NetworkManager/NetworkManager.conf
hasdns=none
in it and the openshift installer is expected to install and configure dnsmasq then the install will fail because then the/etc/NetworkManager/dispatcher.d/99-origin-dns.sh
can not properly pull the nameservers from NetworkManager.Version
Steps To Reproduce
dns=none
in/etc/NetworkManager/NetworkManager.conf
Expected Results
It would be great if the ansible playbooks could detect that:
/etc/NetworkManager/NetworkManager.conf
containsdns=none
And then if those two things are true, bail EARLY.
Right now the installer runs for about 30 minutes before it gets to the dnsmasq configuration and then attempts to install a package which fails because DNS is busted. It would be great to detect this early and bail early.
Observed Results
dnsmasq isn't getting configured correctly because of
dns=none
in/etc/NetworkManager/NetworkManager.conf
and is causing installs to fail but not clear as to why.Additional Information
From talking to my fellow openshift deployers a bunch of people have been running into this issue and spending massive amounts of time debugging the issue. It would be great to have an automated detection of the issue and report that back to users to avoid hours, in my case, days of troubleshooting.
This comes up because it is common for large slower organizations to still not be using NetworkManager and have it disabled as part of their kickstart of VM cloning process.
Maybe the detection should happen during the prerequisites playbook?