ansible code should not by default change hostnames

cgwalters commented 9 years ago

I think we should require working DNS and hostnames before the ansible run sets up. In particular, it was really, super extra double plus confusing to me that it was the ansible run that was resetting the hostnames I had carefully set.

The reason that was happening is because cloud-init drops the metadata it retrieves in /var/lib/cloud, and Ansible was somehow pulling this out and re-setting the hostname even though I'd told cloud-init to preserve my hostname.

diff --git a/roles/openshift_common/tasks/main.yml b/roles/openshift_common/tasks/main.yml
index a7c5650..97e0f09 100644
--- a/roles/openshift_common/tasks/main.yml
+++ b/roles/openshift_common/tasks/main.yml
@@ -12,6 +12,3 @@
       use_openshift_sdn: "{{ openshift_use_openshift_sdn | default(None) }}"
       sdn_network_plugin_name: "{{ os_sdn_network_plugin_name | default(None) }}"
       deployment_type: "{{ openshift_deployment_type }}"
-
-- name: Set hostname
-  hostname: name={{ openshift.common.hostname }}

?

cgwalters commented 9 years ago

Or alternatively, we could validate here and have a hard error?

detiber commented 9 years ago

@cgwalters The main reason that we override the detected hostname is that many users to date is because we want to support installations with the minimal number of bootstrapping steps. Requiring all users to manually set hostnames that are valid in advance goes against that.

We provide the openshift_hostname and openshift_public_hostname variables to override the detected defaults.

Alternatively we could also add a flag to force the use of the system hostname, but I would still want it to default to using the detected defaults, since we've had issues in the past where some components defaulted to using hostname -f to determine the system hostname without providing a configuration option to override (In particular we've had issues with node registration and also with SDN configuration in the past because of this).

cgwalters commented 9 years ago

The really nefarious thing here is how "detected" is being pulled out of the cloud-init metadata by Ansible, even though I told cloud-init not to set the hostname. It defeats all documentation for how to configure the hostname.

Can we break out some of the "bootstrapping"/"preparation" as a separate playbook?

A good thing to do in a bootstrap playbook would be to ensure things like Docker storage are sane.

detiber commented 9 years ago

@cgwalters I don't see an issue with breaking the bootstrapping steps out as a separate playbook, but I don't expect I'll get a chance to get to it before other issues/features are dealt with first.

Another thing to note is that as the installer wrapper is developed more, we expect more installations to use that as opposed to the ansible playbooks directly. With the installer wrapper, the user is presented with the detected hostnames before proceeding further with the install.

If there is a reliable way for us to detect when the user configures cloud init to not set the hostname, then we can have the default values take that into account as well. That would be a quicker easier fix than refactoring the playbooks/roles to separate out the bootstrapping steps.

The whole goal of the detected defaults is to do the right thing in the majority of cases, any way we can improve that the better. From my experience helping maintain the v2 installer, the more bootstrapping steps outside of the installer the more trouble people have getting up and going. As much as we'd like our end users to carefully read the docs before getting started, most just expect to run an installer and go with it.

All of that said, I do see that the v3 docs do not mention the openshift_hostname, openshift_public_hostname, openshift_ip and openshift_public_ip variables as part of the Advanced Installation (https://docs.openshift.com/enterprise/3.0/admin_guide/install/advanced_install.html). I created https://bugzilla.redhat.com/show_bug.cgi?id=1245023 for tracking that issue.

akostadinov commented 9 years ago

Sounds strange to me that we take info from cloud-init to set hostname. If cloud-init was instructed to change hostname, the hostname would be already changed. On the other hand, if hostname is not what cloud-init thinks it should be, then cloud-init was instructed to NOT change hostname. In both cases setting hostname by ansible based on cloud-init metadata is either useless of undesired/harmful. Am I missing something?

The issue is that most openstack installations it seems, do not have a facility to set proper hostnames. Perhaps the approach works on public clouds but not so much on private clouds.

detiber commented 9 years ago

@akostadinov It is not just about setting hostnames, it is about detecting what the hostnames used within the installation should be as well. Our goal is to limit the number of bootstrapping steps to go from 0 - OpenShift with as little effort. If we require the user to set hostnames on each of their boxes to do that, then we are adding an additional bootstrapping step for little gain (especially since in a cloud environment we need to differentiate internal and externally used hostnames for a proper installation).

As far as I'm aware, the issues that we frequently see related to OpenStack are related to one of two issues: 1) misconfiguration of the openstack cloud that produces bogus metadata 2) nova-network based installations that provide internal hostnames that are only resolvable when guests are on the same hypervisor host.

I'm more than happy to change the approach we take for detecting (and setting) hostnames, but I want to make sure that any changes we make do not break things that are working for current users in properly configured environments.

The current approach for detection is as follows: 1) If the user passes values for the hostname/ip address variables for hosts, honor them 2) If the host is in a cloud, attempt to read the metadata and test the provided values before using them. 3) If the host is not in a cloud (vm, baremetal, etc) or unknown cloud attempt to use the values detected by ansible (either doing a local hostname lookup or trying to lookup the fqdn of the default interface). Several values are tested to avoid a bare hostname without a domain or using localhost.localdomain.

Then during installation, we set the local hostname to match the openshift_internal_hostname variable if set or the detected hostname. Previously, this step was needed because of issues when the system hostname didn't match the configured hostname for OpenShift. I'm unsure if this is still the case, tagging @smarterclayton to see if he knows off the top of his head.

If there is a better method that I can use for detecting when I can and can't trust the OpenStack metadata, I'll be happy to implement it, otherwise we've provided documentation on how to test the detected values and override them.

This is going to become even trickier once we start providing some of the future features for the installer, such as provisioning cloud systems to use for installation, but should be mitigated quite a bit by the installer wrapper as it gains features (since it will provide feedback about the detected values before an installation can proceed).

smarterclayton commented 9 years ago

OpenShift calls uname -n if you don't provide one. If you don't trust the hostname, pass one.

On Fri, Oct 23, 2015 at 5:10 PM, Jason DeTiberus notifications@github.com wrote:

@akostadinov https://github.com/akostadinov It is not just about setting hostnames, it is about detecting what the hostnames used within the installation should be as well. Our goal is to limit the number of bootstrapping steps to go from 0 - OpenShift with as little effort. If we require the user to set hostnames on each of their boxes to do that, then we are adding an additional bootstrapping step for little gain (especially since in a cloud environment we need to differentiate internal and externally used hostnames for a proper installation).

As far as I'm aware, the issues that we frequently see related to OpenStack are related to one of two issues: 1) misconfiguration of the openstack cloud that produces bogus metadata 2) nova-network based installations that provide internal hostnames that are only resolvable when guests are on the same hypervisor host.

I'm more than happy to change the approach we take for detecting (and setting) hostnames, but I want to make sure that any changes we make do not break things that are working for current users in properly configured environments.

The current approach for detection is as follows: 1) If the user passes values for the hostname/ip address variables for hosts, honor them 2) If the host is in a cloud, attempt to read the metadata and test the provided values before using them. 3) If the host is not in a cloud (vm, baremetal, etc) or unknown cloud attempt to use the values detected by ansible (either doing a local hostname lookup or trying to lookup the fqdn of the default interface). Several values are tested to avoid a bare hostname without a domain or using localhost.localdomain.

Then during installation, we set the local hostname to match the openshift_internal_hostname variable if set or the detected hostname. Previously, this step was needed because of issues when the system hostname didn't match the configured hostname for OpenShift. I'm unsure if this is still the case, tagging @smarterclayton https://github.com/smarterclayton to see if he knows off the top of his head.

If there is a better method that I can use for detecting when I can and can't trust the OpenStack metadata, I'll be happy to implement it, otherwise we've provided documentation on how to test the detected values and override them.

This is going to become even trickier once we start providing some of the future features for the installer, such as provisioning cloud systems to use for installation, but should be mitigated quite a bit by the installer wrapper as it gains features (since it will provide feedback about the detected values before an installation can proceed).

— Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-ansible/issues/377#issuecomment-150693566 .

detiber commented 9 years ago

@smarterclayton We pass in the nodeName for the node config, I don't see a similar setting for the master (other than corsOrigins and the names we add to the certificates). Should that be sufficient to avoid any usage of uname?

detiber commented 9 years ago

If so, that will avoid us having to set the hostname on the system. It will not help the issue of detecting hostnames in poorly configured OpenStack deployments (or those using nova networking)

akostadinov commented 9 years ago

If cloud-init package is installed, it should already have the hostname set based on the cloud meta-data. So I think we can assume hostname is already correct when running in any kind of cloud with cloud-init installed. We can't know better than cloud-init. Perhaps we can change it only if that hostname is invalid.

Using uname and other heuristics when running without cloud-init. I'd say, check first if hostname is correct and pointing at the machine, then try reverse lookup and last resort use uname, etc.

openshift_hostname and openshift_public_hostname are a useful workaround and we're using them currently.

detiber commented 9 years ago

@akostadinov it's not as straightforward as that in cloud environments... the hostname should be set to the internal hostname for the instance, we would still need to consult the metadata for the public hostname.

Even if we did not set the hostname (which I believe we may no longer need to do, but would require some regression testing by QA before we could officially remove it) and used the system hostname, if the openstack instance is providing bogus metadata, you will still end up with a broken installation, since we would then be using a bogus value for the external hostname of the instance. This type of scenario may work for testing/development where only the internal hostnames are being used, but it would most definitely not work for a customer deployment of OpenShift.

Also, if we are hitting any of the places that OpenShift internally calls uname with a system configured by openshift-ansible, that should definitely be considered a bug, since we should be providing overriding values in the associated configs explicitly.

For computing the hostname values to be used within openshift-ansible we are using the roles/openshift_facts/library/openshift_facts.py module, which will query the host to determine if it is running in GCE, AWS or OpenStack to query the metadata to use for setting the default hostname and ip address defaults. Only if the host type isn't detected as a cloud deployment (or the metadata is unavailable), would we default to the ansible provided python facts which use either the system hostname (through uname or hostnamectl, I can't remember off the top of my head) or will attempt to reverse resolve the ip address associated with the default route on the machine.

akostadinov commented 9 years ago

I see, it's a mess..

I can't understand why AWS sets internal hostname. Only complicates things... but that's unrelated. I'm guessing that for AWS your algorithm is correct and does not need changes.

But I think it is incorrect to set as hostname the name that openstack instance has. AFAIK that name is meant to be user readable name, not a hostname. I'm not sure who to blame about that - openstack limitations, cloud-init or openshift-ansible. But in any case setting the instance name as seen in OpenStack web UI does not make sense to me. At least not with the 3-4 OpenStack instances we have running internally.

If we can detect situations, where user has done configuration about setting hostname, that should be regarded. Otherwise we can do a guess work.

detiber commented 9 years ago

@akostadinov I completely agree that it is a complete mess.. I've created the following BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1275395 to track removing the setting of the hostname at least, since I don't believe that workaround is needed anymore, I suspect it'll be targeted post 3.1/1.1 release though, since I would like to avoid introducing any major regressions if there is an edge case that we hit where it's still needed.

From doing a quick bit of research it looks like the openstack metadata story is an even bigger mess than I had previously though, with some previous versions using neutron networking providing the wrong internal hostnames. I'll create a trello card to track this work as well: https://trello.com/c/J78fjNy5

tbielawa commented 7 years ago

Errata has been released for this issue.

This issue has been inactive for quite some time. Please update and reopen this issue if this is still a priority you would like to see action on.

openshift / openshift-ansible

ansible code should not by default change hostnames #377