rhtconsulting / rhc-ose

OpenShift Automation and Utilities by Red Hat Consulting
42 stars 34 forks source link

Introducing support for OSEv3.1 to the deployment tools #80

Closed oybed closed 8 years ago

oybed commented 8 years ago

What does this PR do?

Changes made to the rhc-ose tools to allow for installation of OSEv3.1. It also allows choosing which version to install - i.e.: 3.0.x or 3.1.x.

Please note the following:

Several steps to test - from a high level:

  1. Generate new base images for 3.0 v.s. 3.1
  2. Use images from step 1 to install OSEv3.0 and a second env for OSEv3.1

NOTE: There are most likely more testing needed than what's outlined here, but these are the basic steps that should be executed.

To generate new images, run:

cd provisioning/openstack
./create_base_image -k=<OS1_key> -v=3.0 # to generate a base image for OSEv3.0
./create_base_image -k=<OS1_key> -v=3.1 # to generate a base image for OSEv3.1

Promote the images to become the new base images:

cd provisioning
./osc-manage-image --action=promote --image-name=<3.0_name_from_prev_step> --ose-version=3.0 
./osc-manage-image --action=promote --image-name=<3.1_name_from_prev_step> --ose-version=3.1 

Next, use the newly created images to deploy new environments

cd provisioning
./osc-provision --num-nodes=2 --key=<OS1_key> # this will deploy the default (3.0)
./osc-provision --num-nodes=2 --key=<OS1_key> --ose-version=3.1 # this will deploy OSEv3.1

In the newly created environments, among other things, validate:

N/A - however, this takes care of issue #71

Who would you like to review this?

/cc @etsauer

etsauer commented 8 years ago

@oybed the 3.0 install looks to have completed successfully, however, the 3.1 install failed:

TASK: [openshift_node | Start and enable node] ******************************** 
failed: [node2.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

failed: [master.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

failed: [node3.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/config.retry

localhost                  : ok=14   changed=0    unreachable=0    failed=0   
master.ose.example.com     : ok=328  changed=91   unreachable=0    failed=1   
node2.ose.example.com      : ok=47   changed=18   unreachable=0    failed=1   
node3.ose.example.com      : ok=47   changed=18   unreachable=0    failed=1   

ERROR: Ansible install had failures

Master:
  - Hostname: master.ose.example.com
  - IPs: 172.16.252.56|10.3.8.170
Nodes: 
  - Node
  -- Hostname: master.ose.example.com
  -- IPs: 172.16.252.56|10.3.8.170
  - Node
  -- Hostname: node2.ose.example.com
  -- IPs: 172.16.252.67|10.3.10.223
  - Node
  -- Hostname: node3.ose.example.com
  -- IPs: 172.16.252.66|10.3.10.179
DNS Server: node3.ose.example.com
Environment ID: testenv-4ppEeJcP
[root@737e0701041f provisioning]# ssh root@10.3.8.170
Last login: Mon Nov 30 21:09:50 2015 from node1.ose.example.com
[root@master ~]# oc get nodes
[root@master ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: failed (Result: start-limit) since Mon 2015-11-30 21:09:54 EST; 7min ago
     Docs: https://github.com/openshift/origin
  Process: 15484 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
 Main PID: 15484 (code=exited, status=255)

Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Nov 30 21:09:54 master.ose.example.com systemd[1]: start request repeated too quickly for atomic-openshift-node.service
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.
oybed commented 8 years ago

@etsauer it appears that this is a problem with the OSE installer, and not necessarily our tools. Looking a bit closer at what the problem is - the output from journalctl -xe show:

Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: Invalid NodeConfig /etc/origin/node/node-config.yaml
Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: dnsIP: invalid value '\{\{.spec.clusterIP\}\}', Details: must be a valid IP

The "dnsIP" entry for a successful install contains the ip of the "kubernetes" service, e.g.:

[root@master ~]# grep dnsIP /etc/origin/node/node-config.yaml 
dnsIP: 172.30.0.1
[root@master ~]# oc get services | grep '172.30.0.1'
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                                               20d

As this isn't something we control from our tools, I suspect it's a bug in the actual OSE installer, maybe "openshift-ansible". Will have to investigate more to get to the bottom of it, but thought I'd share this info for now.

etsauer commented 8 years ago

@oybed yes, I saw the same thing. I'll try pulling down the upstream for the installer and see if it's fixed there.

etsauer commented 8 years ago

@oybed turns out we're still downloading and running the bleeding edge installer from github rather than the packaged one.

https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/osc-install#L195-L197

In those lines we need a case statement saying something like...

if 3.0
  # use existing
if 3.1
  ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
oybed commented 8 years ago

@etsauer updated code to use the correct ansible playbooks based on version - please check it.

etsauer commented 8 years ago

@oybed the htpasswd_auth template still has a reference to /etc/openshift/master, causing auth to fail.

https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/templates/htpasswd_auth

we'll probably want to do the same split of that template as we do for the ansible_hosts template.

oybed commented 8 years ago

@etsauer that functionality used to be part of this PR - let me check what happened and why it's no longer here.

oybed commented 8 years ago

@etsauer ok, fixed the htpasswd aspect. Give it another try - hopefully this is it. I just re-ran all tests and everything seems to be working as expected.

etsauer commented 8 years ago

@oybed it works!