Open dougbtv opened 6 years ago
Apparently have a run that is working, resulted in:
ok: [openshift-master-3 -> openshift-master-1.nfvpe.site]
TASK [openshift_master_certificates : Create the master server certificate] *****************************************************************************************************************************************************************
Friday 27 July 2018 14:06:38 -0400 (0:00:03.887) 0:22:18.826 ***********
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-2)
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-3)
TASK [openshift_master_certificates : copy]
[...snip...]
workstation / "where I run ansible" information
$ ansible --version
ansible 2.4.3.0
config file = /tmp/openshift-ansible/ansible.cfg
configured module search path = [u'/home/doug/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.14 (default, Feb 27 2018, 20:43:24) [GCC 7.3.1 20180130 (Red Hat 7.3.1-2)]
$ cat /etc/redhat-release
Fedora release 27 (Twenty Seven)
$ uname -a
Linux yoda 4.15.9-300.fc27.x86_64 #1 SMP Mon Mar 12 17:07:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Playbook almost completed -- apparently a different error entirely.
TASK [sa_telemetry : include_role] **********************************************************************************************************************************************************************************************************
Friday 27 July 2018 15:11:48 -0400 (0:00:02.890) 1:27:29.230 ***********
ERROR! The field 'loop' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleSequence'>
The error appears to have been in '/tmp/openshift-ansible/roles/sa_telemetry_prometheus/tasks/main.yml': line 94, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
# deploy separate instances via the operator
- include_tasks: deploy.yml
^ here
edit: This a feature from ansible version 2.5, according to this issue comment.
Attempting runs with Ansible 2.5, only having... limited successes. Might be related to network connectivity on the lab side? Unsure.
At any rate, here's what the errors tend to look like, and they tend to happen during gluster-fs plays...
TASK [openshift_storage_glusterfs : Label GlusterFS nodes] **********************************************************************************************************************************************************************************
Monday 30 July 2018 15:29:43 +0000 (0:00:00.073) 0:27:26.013 ***********
changed: [openshift-master-1] => (item=openshift-node-1)
failed: [openshift-master-1] (item=openshift-node-2) => {"changed": false, "item": "openshift-node-2", "msg": {"cmd": "/usr/local/bin/oc label node openshift-node-2.nfvpe.site glusterfs=storage-host --overwrite", "results": {}, "returncode": 1, "stderr": "Error from server (NotFound): nodes \"openshift-node-2.nfvpe.site\" not found\n", "stdout": ""}}
failed: [openshift-master-1] (item=openshift-node-3) => {"changed": false, "item": "openshift-node-3", "msg": {"cmd": "/usr/local/bin/oc label node openshift-node-3.nfvpe.site glusterfs=storage-host --overwrite", "results": {}, "returncode": 1, "stderr": "Error from server (NotFound): nodes \"openshift-node-3.nfvpe.site\" not found\n", "stdout": ""}}
Apparently, some VMs sometimes can't reach the internet?
[doug@overcloud-cephstorage-0 openshift-ansible]$ ssh centos@openshift-lb.nfvpe.site
[..snip..]
[centos@openshift-lb ~]$ sudo su
[root@openshift-lb centos]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
[root@openshift-lb centos]# docker pull docker.io/openshift/origin-haproxy-router
Using default tag: latest
Trying to pull repository docker.io/openshift/origin-haproxy-router ...
^C [just hung...]
[root@openshift-lb centos]# ping 4.2.2.2
PING 4.2.2.2 (4.2.2.2) 56(84) bytes of data.
^C
--- 4.2.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
[root@openshift-lb centos]# ip route
default via 10.19.110.254 dev eth0 proto static metric 100
10.19.110.0/24 dev eth0 proto kernel scope link src 10.19.110.64 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
[root@openshift-lb centos]#
I'm wondering if there is a problem with MAC tables or something, or ARP cache? This has been working for weeks now, so I'm not really sure what could have changed so dramatically.
Would be good to run virtually as well. I'm actually home tomorrow and might be able to find an hour to test this outside the hardware lab.
On Mon, Jul 30, 2018, 12:57 PM Doug Smith, notifications@github.com wrote:
Apparently, some VMs sometimes can't reach the internet?
[doug@overcloud-cephstorage-0 openshift-ansible]$ ssh centos@openshift-lb.nfvpe.site [..snip..]
[centos@openshift-lb ~]$ sudo su [root@openshift-lb centos]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE [root@openshift-lb centos]# docker pull docker.io/openshift/origin-haproxy-router Using default tag: latest Trying to pull repository docker.io/openshift/origin-haproxy-router ... ^C [just hung...]
[root@openshift-lb centos]# ping 4.2.2.2 PING 4.2.2.2 (4.2.2.2) 56(84) bytes of data. ^C --- 4.2.2.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms
[root@openshift-lb centos]# ip route default via 10.19.110.254 dev eth0 proto static metric 100 10.19.110.0/24 dev eth0 proto kernel scope link src 10.19.110.64 metric 100 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 [root@openshift-lb centos]#
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-nfvpe/base-infra-bootstrap/issues/49#issuecomment-408935028, or mute the thread https://github.com/notifications/unsubscribe-auth/AAu67x__iqnCL2rp1DtadgRoNslApROgks5uLzrzgaJpZM4VjyAl .
Quick update, I rebooted that VM and it got WAN. However I was getting intermittent failures pulling images.
TASK [openshift_master : Pre-pull master image] *********************************************************************************************************************************************************************************************
Monday 30 July 2018 18:01:30 +0000 (0:00:00.629) 0:03:50.774 ***********
ok: [openshift-master-1]
ok: [openshift-master-3]
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "openshift/origin:v3.9.0"], "delta": "0:00:15.028812", "end": "2018-07-30 18:01:45.703100", "msg": "non-zero return code", "rc": 1, "start": "2018-07-30 18:01:30.674288", "stderr": "Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"], "stdout": "Trying to pull repository docker.io/openshift/origin ... ", "stdout_lines": ["Trying to pull repository docker.io/openshift/origin ... "]}
@atyronesmith has taken over .9 & .11, and is doing a start-from-scratch on them.
Another quick one is -- I was having trouble with long waits after VM spinup during the "get IP address loop" of vm-spinup, and I needed to apply this patch to make it happen in reasonable time frames:
https://github.com/redhat-nfvpe/ansible-role-vm-spinup/pull/24
Here is the issue:
And another issue with restarting a script that had a glusterfs failure
From Leif:
More information in: https://gist.github.com/leifmadsen/e91a143655fbb3baabacfac539f307b2
cc: @leifmadsen @atyronesmith