operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

when create multiple subscriptions,installplan status is null #2686

Open aiyijing opened 2 years ago

aiyijing commented 2 years ago

Bug Report

What did you do? I created multiple subscriptions in a short time, then these subscription status is InstallPlanPending, Like the following:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
...
status:
  catalogHealth:
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: catalog
        namespace: cpaas-system
        resourceVersion: '5728'
        uid: 514fc6ad-b5e5-4ade-a731-37ba50f7eeda
      healthy: true
      lastUpdated: '2022-03-10T10:02:41Z'
    - catalogSourceRef:
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        name: system
        namespace: cpaas-system
        resourceVersion: '5435'
        uid: 2aae92cc-f946-4cac-ac3f-4459d793a7fb
      healthy: true
      lastUpdated: '2022-03-10T10:02:41Z'
  conditions:
    - lastTransitionTime: '2022-03-10T10:02:41Z'
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: 'False'
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: '2022-03-10T10:08:47Z'
      reason: InstallPlanNotYetReconciled
      status: 'True'
      type: InstallPlanPending
  currentCSV: redis-operator.v3.8.0-13-gc1a15353
  installPlanGeneration: 439
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-7g6lp
    namespace: operators
    resourceVersion: '94336'
    uid: ae6b8a8b-126a-40ce-b4fc-a213fb7143d3
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-7g6lp
    uuid: ae6b8a8b-126a-40ce-b4fc-a213fb7143d3
  lastUpdated: '2022-03-10T10:08:47Z'
  state: UpgradePending

I can see the installplan status is nil

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
metadata:
...
spec:
  approval: Automatic
  approved: true
  clusterServiceVersionNames:
  - rds-operator.v3.8.0
  - pmm-operator.v3.8.0
  - percona-xtradb-cluster-operator.v3.8.0
  - strimzi-kafka-operator.v3.1.0
  - tekton.v0.54.0
  - redis-operator.v0.8.0
  generation: 439

At the same time,the catalog-operator logs:

time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing id=fi43d ip=install-shmzv namespace=operators phase=
time="2022-03-10T10:08:41Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=fi43d ip=install-shmzv namespace=operators phase=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing id=vK+jd ip=install-fr9n8 namespace=operators phase=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=vK+jd ip=install-fr9n8 namespace=operators phase=
E0310 10:08:41.682192       1 queueinformer_operator.go:290] sync "operators" failed: etcdserver: request is too large
E0310 10:08:41.894006       1 queueinformer_operator.go:290] sync {"update" "operators"} failed: etcdserver: request is too large
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:41Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:42Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=warning msg="no installplan found with matching generation, creating new one" id=FRSXt namespace=operators
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:44Z" level=info msg=syncing id=D8ka2 ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:44Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=D8ka2 ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:45Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing id=zRSXl ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=zRSXl ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-10T10:08:46Z" level=info msg=syncing id=UyQb+ ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=UyQb+ ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg=syncing id=xcBMx ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=xcBMx ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg=syncing id=RcYlH ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:46Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=RcYlH ip=install-7g6lp namespace=operators phase=
time="2022-03-10T10:08:47Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=

Environment

Possible Solution

I guess catalog-operator needs to retry here https://github.com/operator-framework/operator-lifecycle-manager/blob/2d649b0d5935ecfecfefbe56f266cc7b04d0b290/pkg/controller/operators/catalog/operator.go#L1255-L1260

exdx commented 2 years ago

Hi @aiyijing,

Looking at this issue more in-depth it's not immediately clear what the problem is. The interesting log here is

sync "operators" failed: etcdserver: request is too large

it's an unusual error. When you encounter this issue, could you also provide the api-server logs on the cluster? That should help us look into this issue more.

Also, for reproducibility, if you could provide the manifests (subscription, catalogsources) that you used to generate the installplan that would also be helpful.

aiyijing commented 2 years ago

@exdx Unfortunately, the log no longer exists。

But I remember that I checked the audit log about apiserver. The catalog-operator update installplan.status caused the apiserver/etcd 500 error.

As you said, maybe installplan.status is too large?

I found some instructions about etcd https://etcd.io/docs/v3.3/dev-guide/limit/

So,Does installplan need to limit the number of CSV installations or updates? 😄😄😄

aiyijing commented 2 years ago

There is very confused why single installplan need to install multiple csv and dependent resource of csv. Maybe it need to improve here because of etcd requests limit

dmesser commented 2 years ago

@aiyijing This is because OLM install / update resolution logic works at the namespace level and therefore considers all operators currently installed in the namespace via Subscription. The solution is to install these operators in other namespaces.

aiyijing commented 2 years ago

@dmesser This seems to be another solution: Installplan be split according to Resolving of step, and then create and update the installplan.

Otherwise we have to modify etcd request limit,obviously it is not reasonable.

aiyijing commented 2 years ago

There are more restrictions:

I extend etcd --max-request-bytes to 10Mi but it is sadly that grpc server has restrictions of grpc send message size limit.

If I install a lot of operator in global NS and at upgrade the index image, all operators will fail to upgrade because installplan cannot be updated successfully.

time="2022-03-29T17:16:17Z" level=warning msg="no installplan found with matching generation, creating new one" id=RpdmR namespace=operators
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing id=wdOG+ ip=install-fnm47 namespace=operators phase=
time="2022-03-29T17:16:17Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=wdOG+ ip=install-fnm47 namespace=operators phase=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-03-29T17:16:17Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
E0329 17:16:18.184570       1 queueinformer_operator.go:290] sync "operators" failed: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2627600 vs. 2097152)