nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Update old OpenShift Test cluster from 4.13 to 4.15 #527

Closed joachimweyl closed 1 month ago

joachimweyl commented 3 months ago

Motivation

We want to see how the update from 4.13-4.14 goes to test what it will be like for production to make this update.

Completion Criteria

Old test cluster updated from 4.13-4.14 and reviewed.

Description

Completion dates

Desired - ASAP Required - 2024-05-08

joachimweyl commented 2 months ago

Need to make sure RHOAI is shown instead of RH Data Science.

Milstein commented 2 months ago

Also we have some issues reported for KNative serverless functions on prod cluster.

dystewart commented 2 months ago

Going to use these docs for upgrading: 4.13 to 4.14 upgrade docs

This step overlaps with: 286

dystewart commented 2 months ago

Updated timeline in this issue

dystewart commented 2 months ago

Here is our own upgrade path: upgrade-path

Upgrades test cluster from 4.13.13 -> 4.13.40: 441

dystewart commented 1 month ago

Upgrade step 4.13.13 -> 4.13.40 completed flawlessly

Putting things in place for 4.13.40 -> 4.14.21.

dystewart commented 1 month ago

Update 4.13.40 -> 4.14.21 is underway. Had to bounce a couple of pods off some of the worker nodes in order to kick this one off after the prior upgrade (some pods were stuck in terminating state following rescheduling)

dystewart commented 1 month ago

We are officially in 4.14.2 in the test cluster. First step to get us ready for 4.15 is bumping odf operator sub upgrade channel to stable 4.14. Here is the PR

dystewart commented 1 month ago

Docs for moving from 4.14 to 4.15: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.15/html-single/updating_clusters/index

dystewart commented 1 month ago

Looks like the upgrade also broke our cluster alerts setup, here is the error we're seeing:

***AlertmanagerReceiversNotConfigured***
Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. Check the OpenShift documentation to learn how to configure notifications with Alertmanager.

UPDATE: This doesn't actually doesn't look to be related to the upgrade as I'm also seeing this in prod and infra clusters

dystewart commented 1 month ago

We've reached v4.15.11 in the test cluster!

dystewart commented 1 month ago

Closing this issue and continuing work over at: https://github.com/nerc-project/operations/issues/286