openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.18k stars 2.31k forks source link

byo 3.9 to 3.10 automated inplace upgrade fails with apiservices/v1beta1.metrics.k8s.io missing #10784

Closed cameronbraid closed 4 years ago

cameronbraid commented 5 years ago

Description

Upgrade from 3.9 to 3.10 failed


  1. Hosts:    node04-2018
     Play:     Pre master upgrade - Upgrade all storage
     Task:     Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered
     Message:  non-zero return code
Version

linux version CentOS Linux release 7.5.1804 (Core)

docker version

 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-75.git8633870.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      8633870/1.13.1
 Built:           Fri Sep 28 19:45:08 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-75.git8633870.el7.centos.x86_64
 Go version:      go1.9.4
 Git commit:      8633870/1.13.1
 Built:           Fri Sep 28 19:45:08 2018
 OS/Arch:         linux/amd64
 Experimental:    false

oc version (after failed upgrade run) (was 3.9 before upgrade) oc v3.10.0+0c4577e-1 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://node04-2018:8443 openshift v3.9.0+ba7faec-1 kubernetes v1.9.1+a0ce1bc657

Ansible version v2.4.6.0-1

openshift-ansible version openshift-ansible-3.10.80-1

Steps To Reproduce

1) install 3.7 cluster 2) upgrade to 3.9 3) run playbooks/openshift-master/openshift_node_group.yml 4) run playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml

Expected Results

Upgrade to complete successfullly

Observed Results

Upgrade fails

TASK [openshift_control_plane : Check for apiservices/v1beta1.metrics.k8s.io registration] *******************************************************************************************************************************************************************************************
Wednesday 28 November 2018  13:40:10 +1100 (0:00:00.082)       0:06:24.049 **** 
FAILED - RETRYING: Check for apiservices/v1beta1.metrics.k8s.io registration (30 retries left).

...

FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (2 retries left).
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left).
fatal: [node04-2018]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["oc", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.187128", "end": "2018-11-28 13:45:37.855544", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-11-28 13:45:37.668416", "stderr": "Error from server (NotFound): the server could not find the requested resource", "stderr_lines": ["Error from server (NotFound): the server could not find the requested resource"], "stdout": "", "stdout_lines": []}
        to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry
Additional Information

inventory file

[OSEv3:children]
masters
etcd
nodes
glusterfs

[OSEv3:vars]
openshift_master_upgrade_pre_hook=/root/origin-cluster-2018/upgade-pre_master.yml
openshift_master_upgrade_post_hook=/root/origin-cluster-2018/upgade-post_master.yml
ansible_ssh_user=root
deployment_type=origin
docker_version=1.13.1
openshift_enable_service_catalog=false
os_firewall_use_firewalld=True
openshift_disable_check=docker_image_availability
osn_storage_plugin_deps=['glusterfs']
osm_cluster_network_cidr=10.128.0.0/14
openshift_hosted_manage_router=false
openshift_logging_install_logging=false
openshift_master_cluster_method=native
openshift_master_cluster_hostname=master-private-2018-456.drivenow.com.au
openshift_master_cluster_public_hostname=master-2018-456.drivenow.com.au
openshift_master_named_certificates=[{"certfile":"/root/origin-cluster-2018/drivenow.com.au.crt","keyfile":"/root/origin-cluster-2018/drivenow.com.au.key", "cafile":"/root/origin-cluster-2018/thawte_Primary_Root_CA.pem", "names": ["*.drivenow.com.au", "drivenow.com.au"]}]
openshift_master_overwrite_named_certificates=true
openshift_master_identity_providers=redacted
openshift_master_htpasswd_users=redacted
openshift_master_default_subdomain=drivenow.com.au
openshift_metrics_install_metrics=true
openshift_metrics_hawkular_hostname=hawkular-metrics-2018-456.drivenow.com.au
openshift_install_examples=false
openshift_examples_load_db_templates=false
openshift_examples_load_quickstarts=false
openshift_examples_load_centos=false
openshift_examples_load_rhel=false
openshift_node_kubelet_args={'image-gc-high-threshold': ['80'], 'image-gc-low-threshold': ['60']}
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true',]}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']},{'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/master=true','node-role.kubernetes.io/infra=true','node-role.kubernetes.io/compute=true']}, {'name': 'node-config-all-in-one-node04', 'labels': ['node-role.kubernetes.io/master=true','node-role.kubernetes.io/infra=true','node-role.kubernetes.io/compute=true','node=node04','region=primary']},{'name': 'node-config-all-in-one-node05', 'labels': ['node-role.kubernetes.io/master=true','node-role.kubernetes.io/infra=true','node-role.kubernetes.io/compute=true','node=node05','region=primary']},{'name': 'node-config-all-in-one-node06', 'labels': ['node-role.kubernetes.io/master=true','node-role.kubernetes.io/infra=true','node-role.kubernetes.io/compute=true','node=node06','region=primary']}]

[masters]
node04-2018
node05-2018
node06-2018

[etcd]
node04-2018
node05-2018
node06-2018

[nodes]
node04-2018 node=True storage=True master=True openshift_ip=10.118.56.23 openshift_kubelet_name_override=node04-2018 openshift_public_hostname=node04-2018.drivenow.com.au openshift_public_ip=168.1.85.42 openshift_schedulable=true openshift_node_group_name='node-config-all-in-one-node04' # until we migrate to using kubernets host name label and update region labels
node05-2018 node=True storage=True master=True openshift_ip=10.118.56.13 openshift_kubelet_name_override=node05-2018 openshift_public_hostname=node05-2018.drivenow.com.au openshift_public_ip=168.1.85.37 openshift_schedulable=true openshift_node_group_name='node-config-all-in-one-node05' # until we migrate to using kubernets host name label and update region labels
node06-2018 node=True storage=True master=True openshift_ip=10.118.56.4 openshift_kubelet_name_override=node06-2018 openshift_public_hostname=node06-2018.drivenow.com.au openshift_public_ip=168.1.85.46 openshift_schedulable=true openshift_node_group_name='node-config-all-in-one-node06' # until we migrate to using kubernets host name label and update region labels

[glusterfs]
node04-2018 glusterfs_devices="[ '/dev/xvdc2' ]"
node05-2018 glusterfs_devices="[ '/dev/xvdc2' ]"
node06-2018 glusterfs_devices="[ '/dev/xvdc2' ]"

[app-nodes]
node04-2018
node05-2018
node06-2018
cameronbraid commented 5 years ago

One fact that may be relevant is that this cluster started as v3.7 and was upgraded to 3.9

cameronbraid commented 5 years ago

Seems like a change made in openshift-ansible-3.10.63-1 brought in a fix that was for a bug in 3.11

c018682c     Scott Dodson, 3 months ago   (September 12th, 2018 6:16am)

Add a wait for aggregated APIs when restarting control plane
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1623571
cameronbraid commented 5 years ago

I managed to successfully upgrade when using openshift-ansible-3.10.62-1

hakanisaksson commented 5 years ago

Had the same issue with openshift-ansible-3.10.83-1 and yes upgrade worked with openshift-ansible-3.10.62-1 for me too. Thanks!

raffaelespazzoli commented 5 years ago

Hi, I just had the same issue with openshift-ansible-3.10.83-1. In addition the cluster of my customer is configured not to install the service catalog. So it would make sense to skip this step altogether.

jdeenada commented 5 years ago

Hi, I am doing upgrade from 3.9 to 3.10 using the latest rpm openshift-ansible-3.10.101-1 and I am still getting error "FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" and my upgrade failed. Please advice.

dlbewley commented 5 years ago

Seeing the same thing in upgrade from 3.10.127 to 3.10.153.

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 4 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 4 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/openshift-ansible/issues/10784#issuecomment-667491616): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.