operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

Operator upgrade is stuck #2459

Open NiharDudam opened 2 years ago

NiharDudam commented 2 years ago

Bug Report

What did you do?

Hi everyone, I am facing an issue with operator upgrades where my operator upgrade is stuck with an error in my Subscription UI page as show here.

Screen Shot 2021-11-15 at 11 28 08 AM

Quick context:

I am thinking the problem is because OLM is doing some checks during upgrades which are different than fresh installs and causing upgrades to stuck.

I tested basic upgrades in base Kubernetes and it works as possible (Install old CRD's + Install old operator + update CRDs + update operator)

I am hoping if someone can shed some light if this is a known issue with OLM

What did you expect to see? Upgrade completes

What did you see instead? Under which circumstances? upgrades stuck

Environment Openshift 4.7 and openshift 4.8

Possible Solution Workaround is to delete and re-create the operator

Additional context I already raised this issue in #olm-dev channel at https://kubernetes.slack.com/archives/C0181L6JYQ2/p1637004932140400 but did not get precise answers.

cdjohnson commented 2 years ago

Workaround is to delete and re-create the cluster

I think this should be: Workaround is to delete and recreate teh subscription. Not the whole cluster.

NiharDudam commented 2 years ago

Thanks for pointing it out, just edited my comment.

NiharDudam commented 2 years ago

any updates here?

dmesser commented 2 years ago

@NiharDudam the error you are hitting comes from OLMs preflight checks while updating CRDs. It essentially checks the compatibility of the new CRD against existing CRs on the cluster to prevent data loss. The check failed in your case and that's why the upgrade was aborted. You'd either need to make these fields optional in your new CRD or adjust the CRs on the cluster.

NiharDudam commented 2 years ago

You'd either need to make these fields optional in your new CRD

The CRD fields are optional and that is the same reason why kubectl allows my kubectl apply -f new-crds.yaml afaik.

NiharDudam commented 2 years ago

I could perform kubectl apply -f new-crds.yaml but somehow OLM is not honoring that? Remember these CRDs are internal objects so those are not part of CSV

NiharDudam commented 2 years ago

@dmesser - I am assuming that pre-flight checks are validating existing CR's against new CRD's as per the error message so I would assume it would ignore those fields because I have manually set preserveUnknownFields: false in my new CRD.

Not exactly sure what validation pre-flight checks are doing? Can you explain?

If it is basic schema validation, I would assume kubectl client would also not ALLOW applying new CRDs which is not happening in my case.

cdjohnson commented 2 years ago

@NiharDudam and I did some experimentation and found the issue, which we need to discuss.

Evidently, the REST API and older versions of Kubectl (1.20 and previous) allow you to persist a null value in any field (Object, Array, Primitive...).

In kube 1.19, when you upgraded a CRD from v1beta1 to v1, these null values would be preserved when you attempt to retrieve these objects for those objects that are declared in the v1 schema.

Example:

spec:
  myvalue: null

In kube 1.20 and later, when the CRD is upgraded from v1beta to v1, ALL null values are removed in favor of empty objects:

Example: v1 CRD:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
spec:
  preserveUnknownFields: false
  ...
  versions:
  - name: v1
    schema:
      openAPIV3Schema:
        type: object
        properties:
          ...
          spec:
            type: object
            properties:
              myvalue:
                type: object
                ...

Version persisted with v1beta1 CRD with the API or kubectl 1.19

spec:
  myvalue: null

When v1 CRD is applied, this value now becomes:

spec:
  myvalue: {}

So, it appears there is a toleration for converting null values to empty objects in the Kubernetes conversion logic, however, the validator that OLM is using does not tolerate this.

The crunchydata operator, unfortunately does this by default. It's using null vs. default values (e.g. empty object, false, empty string) which are tolerated with the kubectl and API server, but not the validator code.

The workaround is to remove any null values from the CRs prior to upgrading the operator (the customer would need to do this, or there would need to be an intermediate version of the operator that would need to run to repair any CRs prior to the version that upgrades the CRD to v1. It also MAY be a workaround to add nullable: true to the schema as long as the Operator truly can tolerate null values and isn't assuming that Kube is converting them (which it doesn't in 1.19 / OCP 4.6).

An alternative we should discuss is

  1. Update the Kubernetes validation API to support an optional toleration null values fields
  2. Update OLM to scrub the CR prior to validating, injecting default values.
dmesser commented 2 years ago

@cdjohnson interesting find. @NiharDudam the reason you don't get any errors with kubectl is that there is no validation of existing CRs that take places. This is an OLM feature that is meant to protect you from data loss. In this particular case it seems like we could do something to automatically remediate incompatibilities without data loss but it also seems the validation could just be enhanced to tolerate null values.

cdjohnson commented 2 years ago

The code in question here is in the Catalog Operator: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1829

It invokes the Kubernetes validation go library: https://github.com/operator-framework/operator-lifecycle-manager/blob/40b8b7104e3ce577c1c0f0f08e6854d62978169c/vendor/k8s.io/apiextensions-apiserver/pkg/apiserver/validation/validation.go#L47

dinhxuanvu commented 2 years ago

I will try to reproduce this and investigate how k8s apiserver is handling this particular situation.