redhat-nfvpe / base-infra-bootstrap

Generic node bootstrap for virtual KVM or baremetal
Apache License 2.0
3 stars 4 forks source link

Generate the loopback master client config: failed expects hostvars is a dict #49

Open dougbtv opened 6 years ago

dougbtv commented 6 years ago

From Leif:

TASK [openshift_master_certificates : Generate the loopback master client config] *************************************
Friday 27 July 2018  11:51:54 -0400 (0:00:03.523)       0:11:06.691 *********** 
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-2)
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-3)
ERROR! |failed expects hostvars is a dict

More information in: https://gist.github.com/leifmadsen/e91a143655fbb3baabacfac539f307b2

cc: @leifmadsen @atyronesmith

dougbtv commented 6 years ago

Apparently have a run that is working, resulted in:

ok: [openshift-master-3 -> openshift-master-1.nfvpe.site]

TASK [openshift_master_certificates : Create the master server certificate] *****************************************************************************************************************************************************************
Friday 27 July 2018  14:06:38 -0400 (0:00:03.887)       0:22:18.826 *********** 
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-2)
changed: [openshift-master-1 -> openshift-master-1.nfvpe.site] => (item=openshift-master-3)

TASK [openshift_master_certificates : copy] 
[...snip...]
dougbtv commented 6 years ago

workstation / "where I run ansible" information

$ ansible --version
ansible 2.4.3.0
  config file = /tmp/openshift-ansible/ansible.cfg
  configured module search path = [u'/home/doug/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.14 (default, Feb 27 2018, 20:43:24) [GCC 7.3.1 20180130 (Red Hat 7.3.1-2)]

$ cat /etc/redhat-release 
Fedora release 27 (Twenty Seven)

$ uname -a
Linux yoda 4.15.9-300.fc27.x86_64 #1 SMP Mon Mar 12 17:07:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
dougbtv commented 6 years ago

Playbook almost completed -- apparently a different error entirely.

TASK [sa_telemetry : include_role] **********************************************************************************************************************************************************************************************************
Friday 27 July 2018  15:11:48 -0400 (0:00:02.890)       1:27:29.230 ***********
ERROR! The field 'loop' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleSequence'>

The error appears to have been in '/tmp/openshift-ansible/roles/sa_telemetry_prometheus/tasks/main.yml': line 94, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

# deploy separate instances via the operator
- include_tasks: deploy.yml
  ^ here

edit: This a feature from ansible version 2.5, according to this issue comment.

dougbtv commented 5 years ago

Attempting runs with Ansible 2.5, only having... limited successes. Might be related to network connectivity on the lab side? Unsure.

At any rate, here's what the errors tend to look like, and they tend to happen during gluster-fs plays...

TASK [openshift_storage_glusterfs : Label GlusterFS nodes] **********************************************************************************************************************************************************************************
Monday 30 July 2018  15:29:43 +0000 (0:00:00.073)       0:27:26.013 *********** 
changed: [openshift-master-1] => (item=openshift-node-1)
failed: [openshift-master-1] (item=openshift-node-2) => {"changed": false, "item": "openshift-node-2", "msg": {"cmd": "/usr/local/bin/oc label node openshift-node-2.nfvpe.site glusterfs=storage-host --overwrite", "results": {}, "returncode": 1, "stderr": "Error from server (NotFound): nodes \"openshift-node-2.nfvpe.site\" not found\n", "stdout": ""}}
failed: [openshift-master-1] (item=openshift-node-3) => {"changed": false, "item": "openshift-node-3", "msg": {"cmd": "/usr/local/bin/oc label node openshift-node-3.nfvpe.site glusterfs=storage-host --overwrite", "results": {}, "returncode": 1, "stderr": "Error from server (NotFound): nodes \"openshift-node-3.nfvpe.site\" not found\n", "stdout": ""}}
dougbtv commented 5 years ago

Apparently, some VMs sometimes can't reach the internet?

[doug@overcloud-cephstorage-0 openshift-ansible]$ ssh centos@openshift-lb.nfvpe.site
[..snip..]

[centos@openshift-lb ~]$ sudo su
[root@openshift-lb centos]# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
[root@openshift-lb centos]# docker pull docker.io/openshift/origin-haproxy-router
Using default tag: latest
Trying to pull repository docker.io/openshift/origin-haproxy-router ... 
^C [just hung...]

[root@openshift-lb centos]# ping 4.2.2.2
PING 4.2.2.2 (4.2.2.2) 56(84) bytes of data.
^C
--- 4.2.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

[root@openshift-lb centos]# ip route
default via 10.19.110.254 dev eth0 proto static metric 100 
10.19.110.0/24 dev eth0 proto kernel scope link src 10.19.110.64 metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
[root@openshift-lb centos]# 
leifmadsen commented 5 years ago

I'm wondering if there is a problem with MAC tables or something, or ARP cache? This has been working for weeks now, so I'm not really sure what could have changed so dramatically.

Would be good to run virtually as well. I'm actually home tomorrow and might be able to find an hour to test this outside the hardware lab.

On Mon, Jul 30, 2018, 12:57 PM Doug Smith, notifications@github.com wrote:

Apparently, some VMs sometimes can't reach the internet?

[doug@overcloud-cephstorage-0 openshift-ansible]$ ssh centos@openshift-lb.nfvpe.site [..snip..]

[centos@openshift-lb ~]$ sudo su [root@openshift-lb centos]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE [root@openshift-lb centos]# docker pull docker.io/openshift/origin-haproxy-router Using default tag: latest Trying to pull repository docker.io/openshift/origin-haproxy-router ... ^C [just hung...]

[root@openshift-lb centos]# ping 4.2.2.2 PING 4.2.2.2 (4.2.2.2) 56(84) bytes of data. ^C --- 4.2.2.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms

[root@openshift-lb centos]# ip route default via 10.19.110.254 dev eth0 proto static metric 100 10.19.110.0/24 dev eth0 proto kernel scope link src 10.19.110.64 metric 100 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 [root@openshift-lb centos]#

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-nfvpe/base-infra-bootstrap/issues/49#issuecomment-408935028, or mute the thread https://github.com/notifications/unsubscribe-auth/AAu67x__iqnCL2rp1DtadgRoNslApROgks5uLzrzgaJpZM4VjyAl .

dougbtv commented 5 years ago

Quick update, I rebooted that VM and it got WAN. However I was getting intermittent failures pulling images.

TASK [openshift_master : Pre-pull master image] *********************************************************************************************************************************************************************************************
Monday 30 July 2018  18:01:30 +0000 (0:00:00.629)       0:03:50.774 *********** 
ok: [openshift-master-1]
ok: [openshift-master-3]
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["docker", "pull", "openshift/origin:v3.9.0"], "delta": "0:00:15.028812", "end": "2018-07-30 18:01:45.703100", "msg": "non-zero return code", "rc": 1, "start": "2018-07-30 18:01:30.674288", "stderr": "Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"], "stdout": "Trying to pull repository docker.io/openshift/origin ... ", "stdout_lines": ["Trying to pull repository docker.io/openshift/origin ... "]}

@atyronesmith has taken over .9 & .11, and is doing a start-from-scratch on them.

dougbtv commented 5 years ago

Another quick one is -- I was having trouble with long waits after VM spinup during the "get IP address loop" of vm-spinup, and I needed to apply this patch to make it happen in reasonable time frames:

https://github.com/redhat-nfvpe/ansible-role-vm-spinup/pull/24

atyronesmith commented 5 years ago

Here is the issue:

https://github.com/openshift/openshift-ansible/issues/7596

atyronesmith commented 5 years ago

And another issue with restarting a script that had a glusterfs failure

https://github.com/openshift/openshift-ansible/issues/9068