openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.17k stars 2.32k forks source link

Service Catalog install failed - 3.11.59 - multi master, multi etcd - Openstack #10926

Closed frippe75 closed 5 years ago

frippe75 commented 5 years ago

Description

Service Catalog install failed.

Seeing the same as in #10819 but my issue (from what I can tell) is not caused by the incorrect wildcard DNS entry as he had.

But other similarities are that we were both running non-co-located etcd and masters.

openshift_openstack_num_masters: 3 openshift_openstack_num_infra: 1 openshift_openstack_num_cns: 0 openshift_openstack_num_nodes: 3 openshift_openstack_num_etcd: 3

I'm running using the Openstack playbook and using CentOS 7.5.1804. On Openstack Rocky release. Also running behind a squid proxy. Using a single MS DNS server setting -pub as a name suffix for floating ip's. I.e master-0-pub.os.lab.net for the floating IP's and master-0.os.lab.net for internal addresses.

Version
$ ansible --version
ansible 2.7.4
  config file = /home/centos/openshift-on-openstack/openshift-ansible-openshift-ansible-3.11.59-1/ansible.cfg
  configured module search path = [u'/home/centos/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

Openshift-ansible install via openshift-ansible-3.11.59-1.zip
Steps To Reproduce

ansible-playbook --user openshift \ -vvvv \ -i openshift-ansible/playbooks/openstack/inventory.py \ -i inventory \ openshift-ansible/playbooks/openstack/openshift-cluster/provision_install.yml

Expected Results

Getting OKD 3.11 provisioned

Failure summary:

  1. Hosts:    master-0.os.lab.net
     Play:     Service Catalog
     Task:     Report errors
     Message:  Catalog install failed.
Observed Results
PLAY RECAP **************************************************************************************************************************************************
app-node-0.os.lab.net      : ok=193  changed=22   unreachable=0    failed=0
app-node-1.os.lab.net      : ok=193  changed=22   unreachable=0    failed=0
app-node-2.os.lab.net      : ok=209  changed=22   unreachable=0    failed=0
etcd-0.os.lab.net          : ok=105  changed=7    unreachable=0    failed=0
etcd-1.os.lab.net          : ok=105  changed=7    unreachable=0    failed=0
etcd-2.os.lab.net          : ok=122  changed=8    unreachable=0    failed=0
infra-node-0.os.lab.net    : ok=193  changed=22   unreachable=0    failed=0
localhost                  : ok=141  changed=17   unreachable=0    failed=0
master-0.os.lab.net        : ok=726  changed=155  unreachable=0    failed=1
master-1.os.lab.net        : ok=349  changed=49   unreachable=0    failed=0
master-2.os.lab.net        : ok=349  changed=49   unreachable=0    failed=0

And... The actual wait for the roll-out was like an hour.

Sunday 23 December 2018  00:03:49 +0000 (0:00:00.131)       2:15:40.222 *******
===============================================================================
openshift_service_catalog : Wait for API Server rollout success ----------------------------------------------------------------------------------- 3792.31s
/home/centos/openshift-on-openstack/openshift-ansible-openshift-ansible-3.11.59-1/roles/openshift_service_catalog/tasks/start.yml:2 ------------------------
openshift_service_catalog : Verify that the Catalog API Server is running ------------------------------------------------------------------------- 1320.29s
/home/centos/openshift-on-openstack/openshift-ansible-openshift-ansible-3.11.59-1/roles/openshift_service_catalog/tasks/start.yml:25 -----------------------
openshift_web_console : Verify that the console is running ----------------------------------------------------------------------------------------- 452.61s

The actual fail in detail...

fatal: [master-0.os.lab.net]: FAILED! => {
    "attempts": 60,
    "changed": false,
    "cmd": [
        "curl",
        "-k",
        "https://apiserver.kube-service-catalog.svc/healthz"
    ],
    "delta": "0:00:11.543649",
    "end": "2018-12-23 00:03:43.767866",
    "invocation": {
        "module_args": {
            "_raw_params": "curl -k https://apiserver.kube-service-catalog.svc/healthz",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 7,
Additional Information
[centos@deploy] $ cat all.yml  | grep -v ^# |grep -v ^$
---
openshift_disable_check: memory_availability
openshift_http_proxy: http://webproxy.lab.net:3128
openshift_https_prox: http://webproxy.lab.net:3128
openshift_openstack_use_nsupdate: False
openshift_openstack_clusterid: "os"
openshift_openstack_public_dns_domain: "lab.net"
openshift_openstack_dns_nameservers: [10.0.1.99]
openshift_openstack_public_hostname_suffix: "-pub"
openshift_openstack_keypair_name: "openshift"
openshift_openstack_external_network_name: "external_network"
openshift_openstack_private_network_name:  "openshift-{{ openshift_openstack_stack_name }}"
openshift_openstack_default_image_name: "CentOS-7-proxy"
openshift_openstack_num_masters: 3
openshift_openstack_num_infra: 1
openshift_openstack_num_cns: 0
openshift_openstack_num_nodes: 3
openshift_openstack_num_etcd: 3
openshift_openstack_default_flavor: "m1.medium"
openshift_openstack_docker_volume_size: "15"
openshift_openstack_subnet_cidr: "192.168.99.0/24"
openshift_openstack_pool_start: "192.168.99.3"
openshift_openstack_pool_end: "192.168.99.254"
openshift_openstack_nsupdate_zone: os.lab.net
openshift_openstack_external_nsupdate_keys:
  public:
     server: '10.0.1.99'
  private:
    server: '10.0.1.99'
ansible_user: openshift
openshift_openstack_disable_root: true
openshift_openstack_user: openshift
frippe75 commented 5 years ago

I re-ran the ansible playbook for the 40-th time. And now it passed this step and almost finnished. Could be that I'm resource starved or something around my networking/proxy. docker pull's have been ridicously slow (in segmented lab-network-type-of-setup).

My Openstack env consists of a single controller-node and three compute nodes each having 96GB mem and 12 core each. m1.medium is the recommended flavor to use and mine is the default one. So I should have resources available to be successful...