operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.7k stars 543 forks source link

Support for ensuring an upgrade is completed before starting another #2618

Open phantomjinx opened 2 years ago

phantomjinx commented 2 years ago

Feature Request

Is your feature request related to a problem? Please describe. The syndesis application is made up of an operator that spawns a number of image instances, ie. server, meta, ui, db, upgrade, s2i.

Most of these operand images are stateless so can be terminated and upgraded easily. However, the db image comprises of a mount containing the database instance. This has to be upgraded using postgres' own tools - to modify schema or upgrade to a new postgres version.

The automatic upgrade process, implemented by OLM, takes care of starting an upgrade so to get from 7.9.2 -> 7.9.3, the 7.9.2 operator is first replaced then the 7.9.3 operator goes into an upgrade phase in order to update the database.

We have come across following use-case:

  1. User has remained on a 7.9.x channel, which has the latest version of 7.9.3;
  2. A new channel has been published, 7.10.x, which is at first populated with version 7.10.0 and then soon after version 7.10.1 (due to a respin & priority bug fix);
  3. The user changes channel to 7.10.x and the upgrades kick-off. Due to the upgrade graph the following occurs - 7.9.3 -> 7.10.0 -> 7.10.1.

This upgrade process is correct. However, the problem comes with how quickly this upgrade path is performed. The moment the 7.10.0 operator replaces the 7.9.3 operator, the OLM begins the upgrade to 7.10.1. This has the effect that the upgrade phase being performed by the 7.10.0 operator has not yet completed (the database schema in the db image is still being backed-up and upgraded). Therefore, at worst the user ends up with a schema upgrade jump on the database and at worst a corrupt database.

Describe the solution you'd like Is it possible to implement a callback or event that delays an OLM operator upgrade until such time that the operator being upgraded is in a fully installed or quiet state? Thus ...

  1. User changes channel from 7.9.x to 7.10.x
  2. OLM replaces the 7.9.3 operator with the 7.10.0 operator
  3. 7.10.0 operator begins its own upgrade routine
  4. 7.10.0 operator advertises to OLM that it should not be upgraded
  5. 7.10.0 operator completes its upgrade routine and broadcasts that it is now ok to be upgraded
  6. OLM starts upgrading 7.10.0 to 7.10.1

It is possible for use to workaround this issue by having users select the Update Approval to Manual. However, this means the user has to approve each upgrade step which is not necessarily preferred.

dmesser commented 2 years ago

OLM has a feature where the operator controller can stamp out a OperatorCondition object to tell OLM to not update it right now (even if update policy is set to manual) because it's in a critical operation, check this out: https://olm.operatorframework.io/docs/advanced-tasks/communicating-operator-conditions-to-olm/

phantomjinx commented 2 years ago

OLM has a feature where the operator controller can stamp out a OperatorCondition object to tell OLM to not update it right now (even if update policy is set to manual) because it's in a critical operation, check this out: https://olm.operatorframework.io/docs/advanced-tasks/communicating-operator-conditions-to-olm/

This seems like just what we are after. Thanks!!

phantomjinx commented 2 years ago

Having tested the use of the upgradeable operationcondition, I've found that this does not function in the way I thought it would. When 7.10.2 of our operator is installed, I change the channel to latest which is loaded with a 7.11.0 and a 7.11.1. Immediately the 7.10.2 operator is upgraded to 7.11.0 but that new operator never actually gets to the Reconcile function loop before it in turn is upgraded to 7.11.1. Since this function is where I have added the SetUpgradeCondition calls for controlling the upgrade, they are never called and no operatorcondition is ever created. Any thoughts on how I can get around this please?