gshipley / installcentos

427 stars 456 forks source link

3.11 deployment issue #107

Open ryannix123 opened 5 years ago

ryannix123 commented 5 years ago

TASK [openshift_control_plane : Wait for all control plane pods to become ready] ***** FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (51 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (50 retries left). ok: [10.0.1.31] => (item=etcd) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). ok: [10.0.1.31] => (item=api) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left).

TASK [openshift_node_group : Wait for the sync daemonset to become ready and available] ** FAILED - RETRYING: Wait for the sync daemonset to become ready and available (60 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (59 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (58 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (57 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (56 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (55 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (54 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (53 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (52 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (51 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (50 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (49 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (48 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (47 retries left).

marekjelen commented 5 years ago

Any chance you have Ansible 2.7?

fclaudiopalmeira commented 5 years ago

@gshipley, The dreaded error has appeared to me...the "wait for control plane pods to appear". When i run "journalctl -flu docker.service" on another ssh session i get:

Oct 21 08:39:44 optung.vm.local oci-umount[59912]: umounthook : prestart container_id:3501626da860 rootfs:/var/lib/docker/overlay2/d1c0efea2c3ec01638c000b736c49744ded80645d6c63c2cc7e77e011fc8fa30/merged Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.074482088-04:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 does not exist\none or more of the container deletions failed\n\"" Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.082623052-04:00" level=warning msg="3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 cleanup: failed to unmount secrets: invalid argument"

It keeps repeating the block above, the only difference is that level=warning msg="xxx" cleanup changes the id (where "xxx" is the ID) also when it gets to the last retry it shows the following message before starting all 60 retries:

failed: [10.84.51.10] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/usr/bin/oc get pod master-etcd-optung.vm.local -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server optung.vm.local:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

The vm has been created with: 8 cores(core i7) 16GB RAM 300GB hard drive(SSD hard drive) ansible version is that one from the script. I touched nothing on the scripts, are you able to help?

marekjelen commented 5 years ago

Can you check the logs whether the system complains about not being able to create certificates?

ryannix123 commented 5 years ago

Looks like it's the correct version, 2.6.5.

Installing : ansible-2.6.5-1.el7.ans.noarch 6/6

fclaudiopalmeira commented 5 years ago

Hey guys, i found out my problem...for some reason during the installation ansible was being updated to version 2.7, which doesn´t make any sense because of these 2 lines on the script: curl -o ansible.rpm https://releases.ansible.com/ansible/rpm/release/epel-7-x86_64/ansible-2.6.5-1.el7.ans.noarch.rpm yum -y --enablerepo=epel install ansible.rpm At first i tought that i had installed ansible on the system before running the script, so i went drastic and installed a Centos 7.5 minimal from scratch...it happened again. what i did to solve it was to add the line yum remove ansible before those 2 lines installing ansible and it is now working as intended. Weird stuff though. Do any of you by any means know if opencontrail/tungsten Fabric support is officially added on Origin/OKD??

ryannix123 commented 5 years ago

Post-install, mine is still 2.6.5.

ansible --version ansible 2.6.5 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

ryannix123 commented 5 years ago

I'd happily send the logs, but it seems like the logging location changes with each version of OpenShift, so I'm not sure where to look and Google isn't helping.

gshipley commented 5 years ago

@ryannix123 Why do lumberjacks get frustrated with OpenShift?

Answer: Because they can never find the logs.

Okay, okay - a Dad joke for sure. We are working on the logging situation and much improvement will happen in the 4.0 release.

marekjelen commented 5 years ago

@fclaudiopalmeira so far the only reason I have encountered for control plane failing with these messages are incorrect certificates caused by 2.7 Ansible

fclaudiopalmeira commented 5 years ago

@marekjelen My certificates were OK, the ansible version however, was not, I am inclined to believe that whenever you have ansible 2.7 installed weird stuff will happen! But, luckily i got past that error, and now i´m dealing with another one, which is related to git, when itry to create an APP i´m gettiing: error: fatal: unable to access 'https://github.com/gshipley/simplephp/': The requested URL returned error: 503 That started happening after I setup the GIT_SSL_NO_VERIFY = true env var (if i don´t, it gives me "the Peer's certificate issuer has been marked as not trusted by the user" ) But, so far i had no luck in finding out a solution!

fclaudiopalmeira commented 5 years ago

well...no luck at all with this certificate stuff...anyone could help?

marekjelen commented 5 years ago

@ryannix123 rerunning the setup script and all the control pods come up just fine. Can you go to the docker level (docker ps , docker logs) and check what containers are failing? and extract some logs?

marekjelen commented 5 years ago

@fclaudiopalmeira can you provide more info how are you trying to deploy the app?

Have tried to clone the repo on the machine

screen shot 2018-10-22 at 10 28 33

as well as deploy the app on OpenShift

screen shot 2018-10-22 at 10 27 06

and both seem to work ...

fclaudiopalmeira commented 5 years ago

Hey @marekjelen I was trying to deploy it by following exactly the youtube video(from openshift dahsboard)

marekjelen commented 5 years ago

hmm, that is the 2nd picture @fclaudiopalmeira and it worked fine on a cluster I have just provisioned.

javabeanz commented 5 years ago

you can alter the ansible version in the installation script from 2.6.x to 2.7.1.1 as a temporary workaround.

vrutkovs commented 5 years ago

Please attach the inventory and output with ansible-playbook -vvv.

Sync daemonset might fail if some nodes haven't applied their configuration, so oc describe nodes output would be handy too

choudharirahul commented 5 years ago

I have fixed this is by doing following steps.

  1. yum remove atomic-openshift* (On all node)
  2. yum install atomic-openshift* (On all node)
  3. mv /etc/origin /etc/origin.old
  4. mv /etc/kubernetes /etc/kubernetes.old
  5. mv ~/.kube/config /tmp/kube_config_backup

ansible-playbook -i /tmp/test /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml

Please let me know if that works for you.

choudharirahul commented 5 years ago

if above step doesnt work then update vi /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/main.yml

REPLACE THIS WITH BELOW - "{{ 'etcd' if inventory_hostname in groups['oo_etcd_to_config'] else omit }}"

sivalanka commented 5 years ago

Still no luck, same issue

rahulchoudhari commented 5 years ago

can you paste me the exact error and have tied both way?

ryannix123 commented 5 years ago

Looks like these deployments are going to radically change in OpenShift 4: https://www.youtube.com/watch?v=-xJIvBpvEeE

dennislabajo commented 5 years ago

well...no luck at all with this certificate stuff...anyone could help?

@fclaudiopalmeira - have you found a solution to the certificate issue?