multi master OKD-3.11 setup fails if master-1 nodes is down

reddybhavaniprasad commented 3 years ago

I am trying to install multi-master openshift-3.11 setup in openstack VMs as per the inventory file present in the official documentation. https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha

Version

[centos@master1 ~]$ oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master1.167.254.204.74.nip.io:8443
openshift v3.11.0+ff2bdbd-531
kubernetes v1.11.0+d4cacc0

Steps To Reproduce

Bring up an okd-3.11 multi master setup as per the inventory file mentioned in here, https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha

Current Result

The setup is successful but struck with two issues as mentioned below,

unable to list down the load balancer nodes on issue of "oc get nodes" command.

[centos@master1 ~]$ oc get nodes
NAME                            STATUS    ROLES          AGE       VERSION
master1.167.254.204.74.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
master2.167.254.204.58.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
master3.167.254.204.59.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
node1.167.254.204.82.nip.io     Ready     compute        6h        v1.11.0+d4cacc0

The master nodes and the load balancer are totally dependent on master-1 node because if master-1 is down then rest of the master nodes or load balancer unable to run any of the oc commands,
```
[centos@master2 ~]$ oc get nodes
Unable to connect to the server: dial tcp 167.254.204.74:8443: connect: no route to host
```
The OKD setup works fine if the other master nodes (other than master-1) or the load balancer are down.

Expected Result

The OKD setup should be up & running though any one of the master nodes went down.

Inventory file:

[OSEv3:children]
masters
nodes
etcd
lb

[masters]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io

[etcd]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io

[lb]
lb.167.254.204.111.nip.io

[nodes]
master1.167.254.204.74.nip.io openshift_ip=167.254.204.74 openshift_schedulable=true openshift_node_group_name='node-config-master'
master2.167.254.204.58.nip.io openshift_ip=167.254.204.58 openshift_schedulable=true openshift_node_group_name='node-config-master'
master3.167.254.204.59.nip.io openshift_ip=167.254.204.59 openshift_schedulable=true openshift_node_group_name='node-config-master'
node1.167.254.204.82.nip.io openshift_ip=167.254.204.82 openshift_schedulable=true openshift_node_group_name='node-config-compute'

[OSEv3:vars]
debug_level=4
ansible_ssh_user=centos
ansible_become=true
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
openshift_enable_service_catalog=true
ansible_service_broker_install=true

openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}]

containerized=false
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

deployment_type=origin
openshift_deployment_type=origin

openshift_release=v3.11.0
openshift_pkg_version=-3.11.0
openshift_image_tag=v3.11.0
openshift_service_catalog_image_version=v3.11.0
template_service_broker_image_version=v3.11
osm_use_cockpit=true

# put the router on dedicated infra1 node
openshift_master_cluster_method=native
openshift_master_default_subdomain=sub.master1.167.254.204.74.nip.io
openshift_public_hostname=master1.167.254.204.74.nip.io
openshift_master_cluster_hostname=master1.167.254.204.74.nip.io

Please let me know the entire setup dependency on master-node and also any work around to fix this.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/origin/issues/25758#issuecomment-837158971): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / origin