operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.69k stars 543 forks source link

Subscription and CSV don't bind each other #2201

Open horis233 opened 3 years ago

horis233 commented 3 years ago

Bug Report

This is an intermittent defect we observe in the operator install and upgrade. Operator Subscription and CSV can't bind each other.

What did you do? A clear and concise description of the steps you took (or insert a code snippet).

The issue has been seen in both fresh install and upgrade

What did you expect to see? A clear and concise description of what you expected to happen (or insert a code snippet).

I expect the operator could be deployed or upgraded successfully

What did you see instead? Under which circumstances? A clear and concise description of what you expected to happen (or insert a code snippet).

What I observe is the CSV of the operator is created, but there is no update in the subscription status, which cause even if the install plan is completed, the subscription is in the unknown status and CSV is in the Cannot Update status

Screen Shot 2021-06-09 at 10 01 47 AM

Also, it will block the catalog operator sync other operators.

E0609 14:02:55.116596       1 queueinformer_operator.go:290] sync "ibm-common-services" failed: constraints not satisfiable: pkgunique/ibm-odlm permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, gvkunique/operator.ibm.com/v1alpha1/OperandRegistry permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0 is mandatory, ibm-odlm is mandatory, ibm-odlm requires at least one of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0
I0609 14:02:55.116762       1 event.go:278] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"ibm-common-services", UID:"92621a07-5877-4ef2-bffa-dfb5e4252992", APIVersion:"v1", ResourceVersion:"146584", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: pkgunique/ibm-odlm permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, gvkunique/operator.ibm.com/v1alpha1/OperandRegistry permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0 is mandatory, ibm-odlm is mandatory, ibm-odlm requires at least one of opencloud-operators/openshift-market...
E0609 14:02:58.912745       1 queueinformer_operator.go:290] sync "ibm-common-services" failed: constraints not satisfiable: ibm-odlm requires at least one of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0 is mandatory, pkgunique/ibm-odlm permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, gvkunique/operator.ibm.com/v1alpha1/OperandRegistry permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, ibm-odlm is mandatory
I0609 14:02:58.912840       1 event.go:278] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"ibm-common-services", UID:"92621a07-5877-4ef2-bffa-dfb5e4252992", APIVersion:"v1", ResourceVersion:"146584", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: ibm-odlm requires at least one of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0 is mandatory, pkgunique/ibm-odlm permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, gvkunique/operator.ibm.com/v1alpha1/OperandRegistry permits at most 1 of opencloud-operators/openshift-marketplace/v3/...
I0609 14:03:02.744559       1 event.go:278] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"ibm-common-services", UID:"92621a07-5877-4ef2-bffa-dfb5e4252992", APIVersion:"v1", ResourceVersion:"146584", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: ibm-odlm requires at least one of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, pkgunique/ibm-odlm permits at most 1 of opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.5.0, opencloud-operators/openshift-marketplace/v3/operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0, @existing/ibm-common-services//operand-deployment-lifecycle-manager.v1.6.0 is mandatory, gvkunique/operator.ibm.com/v1alpha1/OperandConfig permits at most 1 of opencloud-operators/openshift-marketplace/v3/op...

I upload files of CSV, Subscription and installplan for further investigations. yaml-files.zip

Environment

0.16.1

Since we have seen this issue on OCP 4.7 and 4.8, I believe this defect is on 0.17.0 and 0.17.1 as well.

OCP 4.6,4.7,4.8

OCP

Possible Solution

Delete the operator CSV and let the catalog operator reconcile it again.

Additional context Add any other context about the problem here.

cc @pgodowski

exdx commented 3 years ago

The handoff between upgrades of different versions of an operator has some known visibility issues that plan to be addressed largely in the new APIs. Some work was done here recently but the fact that operators in a namespace are treated as a set during installation, where one failure affects all subsequent installs, is a problematic consequence of the multitenant nature of the OLM v1 APIs. Relates to #1565.

This problem can be considered as something that could be addressed by the new v2 Bundle APIs and resolution.

horis233 commented 3 years ago

@njhale

Please let me know if this can make you think of something about the root cause :)

We have seen this issue again in our product.

Operator CSV is succeeded and install plan is completed, but subscription doesn't have status like installplan or currentCSV.

This is the status of the operator

status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: ace
      namespace: openshift-marketplace
      resourceVersion: "49186"
      uid: 063ddf21-b2c7-48a0-9f9d-4d09998d96d9
    healthy: true
    lastUpdated: "2021-06-21T11:59:19Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: automation-base-pak-operators
      namespace: openshift-marketplace
      resourceVersion: "46005"
      uid: 23ab4d68-1ea1-46c3-80ca-3358b737ace4
    healthy: true
    lastUpdated: "2021-06-21T11:59:19Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: certified-operators
      namespace: openshift-marketplace
      resourceVersion: "49182"
      uid: 73615a74-1d38-4972-8f37-9a0000ba465b
    healthy: true
    lastUpdated: "2021-06-21T11:59:19Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: community-operators
      namespace: openshift-marketplace
      resourceVersion: "49185"
      uid: d45a5fc0-132c-4c39-b695-c7a1233e8703
    healthy: true
    lastUpdated: "2021-06-21T11:59:19Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1

Catalog operator shows

I0621 16:28:10.713590       1 event.go:278] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"ibm-common-services", UID:"95a40970-4685-4d80-bcf5-8dddbfb091e7", APIVersion:"v1", ResourceVersion:"77866", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: @existing/ibm-common-services//ibm-platform-api-operator.v3.10.0 is mandatory, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.0, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.1, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.10.0 and @existing/ibm-common-services//ibm-platform-api-operator.v3.10.0 originate from package ibm-platform-api-operator-app, subscription ibm-platform-api-operator requires at least one of opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.10.0, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.1 or opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.0, subscription ibm-platform-api-operator exists

E0621 16:33:13.732081 1 queueinformer_operator.go:290] sync "ibm-common-services" failed: constraints not satisfiable: opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.10.0, @existing/ibm-common-services//ibm-platform-api-operator.v3.10.0, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.0 and opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.1 originate from package ibm-platform-api-operator-app, @existing/ibm-common-services//ibm-platform-api-operator.v3.10.0 is mandatory, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.10.0 and @existing/ibm-common-services//ibm-platform-api-operator.v3.10.0 provide PlatformAPI (operator.ibm.com/v1alpha1), subscription ibm-platform-api-operator exists, subscription ibm-platform-api-operator requires at least one of opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.10.0, opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.1 or opencloud-operators/openshift-marketplace/v3/ibm-platform-api-operator.v3.9.0

benluddy commented 3 years ago

I suspect this can happen when an error occurs here: https://github.com/operator-framework/operator-lifecycle-manager/blob/2c623e1e4877608fd16a6089a6aeeac5b1217f18/pkg/controller/operators/catalog/operator.go#L948

Since the InstallPlan is created successfully, the new operator version will be created, but the information necessary to populate the Subscription status is lost.

horis233 commented 3 years ago

@benluddy Thanks for the information.

Please check if my analysis here is correct.

  1. Update subscription status failed in https://github.com/operator-framework/operator-lifecycle-manager/blob/cd40303284a287d6bb920c18807e4f70fd7dd048/pkg/controller/operators/catalog/operator.go#L948

  2. When reconciling again, it will be failed at operator resolving. Taking this https://github.com/operator-framework/operator-lifecycle-manager/issues/2201#issuecomment-865184465 as an example

@benluddy @njhale Please correct me if I am wrong and please advise if there is an enhancement we can do to prevent this issue.

benluddy commented 3 years ago

Yes, exactly. Setting .status.installedCSV changes the system of constraints when the enclosing namespace is resolved:

That much is a current limitation due to the lack of a globally-unique bundle identity. That is, we can't be sure that a given CSV named "foo" represents exactly the same operator as another named "foo." Also, we don't have a record of the catalog that an operator was installed from -- or whether the catalog contents themselves have changed since installation.

The InstallPlan is supposed to be the record of the changes applied to the namespace due to resolution. Preventing the issues caused by an error on https://github.com/operator-framework/operator-lifecycle-manager/blob/cd40303284a287d6bb920c18807e4f70fd7dd048/pkg/controller/operators/catalog/operator.go#L948 probably involves deriving the relevant parts of Subscription status from the latest InstallPlan.

horis233 commented 3 years ago

@benluddy @njhale

Do we have a plan to fix this defect? or reduce the risk because when this issue happens, the operator install and upgrade will be blocked and users can't find the cause easily.

gmarcy commented 1 year ago

when debugging the catalog-operator during one of these failed operator deployments the container crashed with the error fatal error: concurrent map writes details can be found here: crash.log

gmarcy commented 1 year ago

Adding this comment from a thread on olm-dev channel on kubernetes slack

I have reached a tentative conclusion after several days of continuous testing that the version of OLM I'm using from OCP 4.11.9 is working, and that previous releases of OCP included an OLM with an intermittent container crash.

Since there have been multiple occasions that new code has added to the catalog operator that resulted in such a crash, and that the OCP process to choose a version of OLM to ship has been unlucky more than once, I am wondering if there is any way to handle this type of failure better?

taylormgeorge91 commented 1 year ago

We are seeing this more often now it appears and it is having more of an impact on our product teams, and has hit some customers now. May be related to https://github.com/openshift/operator-framework-olm/pull/415

anik120 commented 1 year ago

Fyi we just pushed through https://github.com/openshift/operator-framework-olm/pull/415 and this issue should be fixed in the next 4.10.z that it is available in.

teethediva34 commented 1 year ago

@anik120 when will the next 4.10.z release be available? fyi @yuchen-fan

teethediva34 commented 1 year ago

@anik120 any update on the fix for this issue in 4.10.z and is it included in 4.11 and 4.12?

anik120 commented 1 year ago

@teethediva34 this KCS article has all the information about the concurrent map write fix for OCP (including which z streams the fix is available in).

kapilrajyaguru commented 10 months ago

Upgrading CPD from 4.5.3 to 4.7.3 and while running apply-olm command, got the following error.

Conditions: Last Transition Time: 2023-11-13T21:33:00Z Message: targeted catalogsource ibm-cpd-operators/ibm-cpd-ccs-operator-catalog missing Reason: UnhealthyCatalogSourceFound Status: True Type: CatalogSourcesUnhealthy Message: constraints not satisfiable: no operators found from catalog ibm-cpd-ccs-operator-catalog in namespace ibm-cpd-operators referenced by subscription ibm-cpd-ccs-operator, subscription ibm-cpd-ccs-operator exists Reason: ConstraintsNotSatisfiable Status: True Type: ResolutionFailed Install Plan Generation: 7 Last Updated: 2023-11-13T21:41:25Z Events: <none>

teethediva34 commented 4 months ago

This issue is still outstanding @anik120 and we are hitting it on later versions of OCP greater than 4.10.

oc version

Client Version: 4.14.17 Kustomize Version: v5.0.1 Server Version: 4.14.17 Kubernetes Version: v1.27.11+d8e449a

Problem Description: We are trying to upgrade from 4.8.4 to 4.8.5 with following services installed.

cpd_platform,edb_cp4d,mongodb_cp4d,watson_assistant,watson_speech,watsonx_orchestrate command used image

[✘] Error in /tmp/work/cpfs_scripts/4.8.5/cp3pt0-deployment/common/utils.sh at line 126 in function wait_for_condition: Timeout after 10 minutes waiting for operator ibm-common-service-operator to be upgraded [ERROR] 2024-04-19T08:25:15.597425Z cmd.Run() failed with exit status 1 [ERROR] 2024-04-19T08:25:15.597500Z Command exception: The setup-instance-topology command failed (exit status 1). You may find output and logs in the /tmp/work/cpd-cli-workspace/olm-utils-workspace/work directory. [ERROR] 2024-04-19T08:25:15.598237Z RunPluginCommand:Execution error: exit status 1

the oc get subs ibm-common-service-operator -o yaml -n operator-ns was showing the below error. followed https://www.ibm.com/docs/en/cloud-paks/foundational-services/4.5?topic=issues-olm-known-issue-resolutionfailed-message