Introducing support for OSEv3.1 to the deployment tools

oybed commented 8 years ago

What does this PR do?

Changes made to the rhc-ose tools to allow for installation of OSEv3.1. It also allows choosing which version to install - i.e.: 3.0.x or 3.1.x.

Please note the following:

The base images have been renamed to "ose3_X-base" where X is the minor number. This means that testing this PR will not interfere with already running 3.0.x systems.
There are multiple files that have been named 30 v.s 31, this is to allow for an easier customization of the new version without impacting the existing 3.0 functionality part of the tool
How should this be manually tested?

Several steps to test - from a high level:

Generate new base images for 3.0 v.s. 3.1
Use images from step 1 to install OSEv3.0 and a second env for OSEv3.1

NOTE: There are most likely more testing needed than what's outlined here, but these are the basic steps that should be executed.

To generate new images, run:

cd provisioning/openstack
./create_base_image -k=<OS1_key> -v=3.0 # to generate a base image for OSEv3.0
./create_base_image -k=<OS1_key> -v=3.1 # to generate a base image for OSEv3.1

Promote the images to become the new base images:

cd provisioning
./osc-manage-image --action=promote --image-name=<3.0_name_from_prev_step> --ose-version=3.0 
./osc-manage-image --action=promote --image-name=<3.1_name_from_prev_step> --ose-version=3.1

Next, use the newly created images to deploy new environments

cd provisioning
./osc-provision --num-nodes=2 --key=<OS1_key> # this will deploy the default (3.0)
./osc-provision --num-nodes=2 --key=<OS1_key> --ose-version=3.1 # this will deploy OSEv3.1

In the newly created environments, among other things, validate:

the environment is running :-)
the nodes have successfully connected to the master (oc get nodes)
the registry and router have been successfully deployed
the routes have been correctly configured
the basic users can login (i.e.: joe and alice)
can use the environment to perform a S2I deployment
Is there a relevant Issue open for this?

N/A - however, this takes care of issue #71

Who would you like to review this?

/cc @etsauer

etsauer commented 8 years ago

@oybed the 3.0 install looks to have completed successfully, however, the 3.1 install failed:

TASK: [openshift_node | Start and enable node] ******************************** 
failed: [node2.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

failed: [master.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

failed: [node3.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/config.retry

localhost                  : ok=14   changed=0    unreachable=0    failed=0   
master.ose.example.com     : ok=328  changed=91   unreachable=0    failed=1   
node2.ose.example.com      : ok=47   changed=18   unreachable=0    failed=1   
node3.ose.example.com      : ok=47   changed=18   unreachable=0    failed=1   

ERROR: Ansible install had failures

Master:
  - Hostname: master.ose.example.com
  - IPs: 172.16.252.56|10.3.8.170
Nodes: 
  - Node
  -- Hostname: master.ose.example.com
  -- IPs: 172.16.252.56|10.3.8.170
  - Node
  -- Hostname: node2.ose.example.com
  -- IPs: 172.16.252.67|10.3.10.223
  - Node
  -- Hostname: node3.ose.example.com
  -- IPs: 172.16.252.66|10.3.10.179
DNS Server: node3.ose.example.com
Environment ID: testenv-4ppEeJcP
[root@737e0701041f provisioning]# ssh root@10.3.8.170
Last login: Mon Nov 30 21:09:50 2015 from node1.ose.example.com
[root@master ~]# oc get nodes
[root@master ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: failed (Result: start-limit) since Mon 2015-11-30 21:09:54 EST; 7min ago
     Docs: https://github.com/openshift/origin
  Process: 15484 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
 Main PID: 15484 (code=exited, status=255)

Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Nov 30 21:09:54 master.ose.example.com systemd[1]: start request repeated too quickly for atomic-openshift-node.service
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.

oybed commented 8 years ago

@etsauer it appears that this is a problem with the OSE installer, and not necessarily our tools. Looking a bit closer at what the problem is - the output from journalctl -xe show:

Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: Invalid NodeConfig /etc/origin/node/node-config.yaml
Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: dnsIP: invalid value '\{\{.spec.clusterIP\}\}', Details: must be a valid IP

The "dnsIP" entry for a successful install contains the ip of the "kubernetes" service, e.g.:

[root@master ~]# grep dnsIP /etc/origin/node/node-config.yaml 
dnsIP: 172.30.0.1
[root@master ~]# oc get services | grep '172.30.0.1'
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                                               20d

As this isn't something we control from our tools, I suspect it's a bug in the actual OSE installer, maybe "openshift-ansible". Will have to investigate more to get to the bottom of it, but thought I'd share this info for now.

etsauer commented 8 years ago

@oybed yes, I saw the same thing. I'll try pulling down the upstream for the installer and see if it's fixed there.

etsauer commented 8 years ago

@oybed turns out we're still downloading and running the bleeding edge installer from github rather than the packaged one.

https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/osc-install#L195-L197

In those lines we need a case statement saying something like...

if 3.0
  # use existing
if 3.1
  ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml

oybed commented 8 years ago

@etsauer updated code to use the correct ansible playbooks based on version - please check it.

etsauer commented 8 years ago

@oybed the htpasswd_auth template still has a reference to /etc/openshift/master, causing auth to fail.

https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/templates/htpasswd_auth

we'll probably want to do the same split of that template as we do for the ansible_hosts template.

oybed commented 8 years ago

@etsauer that functionality used to be part of this PR - let me check what happened and why it's no longer here.

oybed commented 8 years ago

@etsauer ok, fixed the htpasswd aspect. Give it another try - hopefully this is it. I just re-ran all tests and everything seems to be working as expected.

etsauer commented 8 years ago

@oybed it works!

rhtconsulting / rhc-ose

Introducing support for OSEv3.1 to the deployment tools #80

What does this PR do?

How should this be manually tested?

Is there a relevant Issue open for this?

Who would you like to review this?