Closed oybed closed 8 years ago
@oybed the 3.0 install looks to have completed successfully, however, the 3.1 install failed:
TASK: [openshift_node | Start and enable node] ********************************
failed: [node2.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.
failed: [master.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.
failed: [node3.ose.example.com] => {"failed": true}
msg: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.
FATAL: all hosts have already failed -- aborting
PLAY RECAP ********************************************************************
to retry, use: --limit @/root/config.retry
localhost : ok=14 changed=0 unreachable=0 failed=0
master.ose.example.com : ok=328 changed=91 unreachable=0 failed=1
node2.ose.example.com : ok=47 changed=18 unreachable=0 failed=1
node3.ose.example.com : ok=47 changed=18 unreachable=0 failed=1
ERROR: Ansible install had failures
Master:
- Hostname: master.ose.example.com
- IPs: 172.16.252.56|10.3.8.170
Nodes:
- Node
-- Hostname: master.ose.example.com
-- IPs: 172.16.252.56|10.3.8.170
- Node
-- Hostname: node2.ose.example.com
-- IPs: 172.16.252.67|10.3.10.223
- Node
-- Hostname: node3.ose.example.com
-- IPs: 172.16.252.66|10.3.10.179
DNS Server: node3.ose.example.com
Environment ID: testenv-4ppEeJcP
[root@737e0701041f provisioning]# ssh root@10.3.8.170
Last login: Mon Nov 30 21:09:50 2015 from node1.ose.example.com
[root@master ~]# oc get nodes
[root@master ~]# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service - Atomic OpenShift Node
Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
└─openshift-sdn-ovs.conf
Active: failed (Result: start-limit) since Mon 2015-11-30 21:09:54 EST; 7min ago
Docs: https://github.com/openshift/origin
Process: 15484 ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
Main PID: 15484 (code=exited, status=255)
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Nov 30 21:09:54 master.ose.example.com systemd[1]: start request repeated too quickly for atomic-openshift-node.service
Nov 30 21:09:54 master.ose.example.com systemd[1]: Failed to start Atomic OpenShift Node.
Nov 30 21:09:54 master.ose.example.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
Nov 30 21:09:54 master.ose.example.com systemd[1]: atomic-openshift-node.service failed.
@etsauer it appears that this is a problem with the OSE installer, and not necessarily our tools. Looking a bit closer at what the problem is - the output from journalctl -xe show:
Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: Invalid NodeConfig /etc/origin/node/node-config.yaml
Dec 01 02:12:24 node3.ose.example.com atomic-openshift-node[21429]: dnsIP: invalid value '\{\{.spec.clusterIP\}\}', Details: must be a valid IP
The "dnsIP" entry for a successful install contains the ip of the "kubernetes" service, e.g.:
[root@master ~]# grep dnsIP /etc/origin/node/node-config.yaml
dnsIP: 172.30.0.1
[root@master ~]# oc get services | grep '172.30.0.1'
kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP <none> 20d
As this isn't something we control from our tools, I suspect it's a bug in the actual OSE installer, maybe "openshift-ansible". Will have to investigate more to get to the bottom of it, but thought I'd share this info for now.
@oybed yes, I saw the same thing. I'll try pulling down the upstream for the installer and see if it's fixed there.
@oybed turns out we're still downloading and running the bleeding edge installer from github rather than the packaged one.
https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/osc-install#L195-L197
In those lines we need a case statement saying something like...
if 3.0
# use existing
if 3.1
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
@etsauer updated code to use the correct ansible playbooks based on version - please check it.
@oybed the htpasswd_auth template still has a reference to /etc/openshift/master, causing auth to fail.
https://github.com/rhtconsulting/rhc-ose/blob/ose3_1/provisioning/templates/htpasswd_auth
we'll probably want to do the same split of that template as we do for the ansible_hosts template.
@etsauer that functionality used to be part of this PR - let me check what happened and why it's no longer here.
@etsauer ok, fixed the htpasswd aspect. Give it another try - hopefully this is it. I just re-ran all tests and everything seems to be working as expected.
@oybed it works!
What does this PR do?
Changes made to the rhc-ose tools to allow for installation of OSEv3.1. It also allows choosing which version to install - i.e.: 3.0.x or 3.1.x.
Please note the following:
How should this be manually tested?
Several steps to test - from a high level:
NOTE: There are most likely more testing needed than what's outlined here, but these are the basic steps that should be executed.
To generate new images, run:
Promote the images to become the new base images:
Next, use the newly created images to deploy new environments
In the newly created environments, among other things, validate:
Is there a relevant Issue open for this?
N/A - however, this takes care of issue #71
Who would you like to review this?
/cc @etsauer