operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

Per-CRD RBAC resources not being created #1957

Closed LCaparelli closed 3 years ago

LCaparelli commented 3 years ago

Bug Report

What did you do? I am attempting to install an operator via the following CSV:

Click to expand ```yaml apiVersion: v1 items: - apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion metadata: annotations: alm-examples: |- [ { "apiVersion": "apps.m88i.io/v1alpha1", "kind": "Nexus", "metadata": { "name": "nexus3" }, "spec": { "networking": { "expose": false }, "persistence": { "persistent": false }, "replicas": 1, "resources": { "limits": { "cpu": "2", "memory": "2Gi" }, "requests": { "cpu": "1", "memory": "2Gi" } }, "useRedHatImage": false } } ] capabilities: Seamless Upgrades categories: Developer Tools certified: "false" containerImage: quay.io/m88i/nexus-operator:0.5.0 createdAt: "2019-11-16T13:12:22Z" description: Nexus Operator to deploy and manage Nexus 3.x servers olm.operatorGroup: test-operators olm.operatorNamespace: test-operators operators.operatorframework.io/builder: operator-sdk-v1.2.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v2 repository: https://github.com/m88i/nexus-operator support: m88i Labs tectonic-visibility: ocs creationTimestamp: "2021-01-14T12:20:46Z" generation: 1 labels: olm.api.d9aed9862d42d8e0: provided olm.copiedFrom: test-operators managedFields: - apiVersion: operators.coreos.com/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:alm-examples: {} f:capabilities: {} f:categories: {} f:certified: {} f:containerImage: {} f:createdAt: {} f:description: {} f:olm.operatorGroup: {} f:olm.operatorNamespace: {} f:olm.targetNamespaces: {} f:operators.operatorframework.io/builder: {} f:operators.operatorframework.io/project_layout: {} f:repository: {} f:support: {} f:tectonic-visibility: {} f:labels: .: {} f:olm.api.d9aed9862d42d8e0: {} f:olm.copiedFrom: {} f:spec: .: {} f:apiservicedefinitions: {} f:customresourcedefinitions: .: {} f:owned: {} f:description: {} f:displayName: {} f:icon: {} f:install: .: {} f:spec: .: {} f:clusterPermissions: {} f:deployments: {} f:permissions: {} f:strategy: {} f:installModes: {} f:keywords: {} f:labels: .: {} f:name: {} f:links: {} f:maintainers: {} f:maturity: {} f:provider: .: {} f:name: {} f:version: {} f:status: .: {} f:conditions: {} f:lastTransitionTime: {} f:lastUpdateTime: {} f:message: {} f:phase: {} f:reason: {} f:requirementStatus: {} manager: olm operation: Update time: "2021-01-14T12:20:46Z" name: nexus-operator.v0.5.0 namespace: operators resourceVersion: "1431" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/operators/clusterserviceversions/nexus-operator.v0.5.0 uid: fd231b17-7702-4400-92ec-8c1d2be80c1a spec: apiservicedefinitions: {} customresourcedefinitions: owned: - displayName: Nexus kind: Nexus name: nexus.apps.m88i.io version: v1alpha1 description: |- Creates a new Nexus 3.x deployment in a Kubernetes cluster. Will help DevOps to have a quick Nexus application exposed to the world that can be used in a CI/CD process: * Deploys a new Nexus 3.x server based on either Community or Red Hat images * Creates an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) in Kubernetes (1.14+) environments to expose the application to the world * On OpenShift, creates a Route to expose the service outside the cluster * Automatically creates Apache, Red Hat and JBoss Maven repositories * Automatically updates Nexus within the same minor version [See our documentation](https://github.com/m88i/nexus-operator/blob/main/README.md) for more installation and usage scenarios. If you experience any issues or have any ideas for new features, please [file an issue in our Github repository](https://github.com/m88i/nexus-operator/issues) or send an email to our maillist: [nexus-operator@googlegroups.com](mailto:nexus-operator@googlegroups.com) *Please note that the operator is an individual work and it's not provided nor supported by Sonatype.* displayName: Nexus Operator icon: - base64data:  mediatype: image/svg+xml install: spec: clusterPermissions: - rules: - apiGroups: - apps resources: - deployments verbs: - create - delete - get - list - patch - update - watch - apiGroups: - apps resources: - deployments/finalizers verbs: - update - apiGroups: - apps resources: - replicasets verbs: - get - apiGroups: - apps.m88i.io resources: - nexus verbs: - create - delete - get - list - patch - update - watch - apiGroups: - apps.m88i.io resources: - nexus/finalizers verbs: - get - patch - update - apiGroups: - apps.m88i.io resources: - nexus/status verbs: - get - patch - update - apiGroups: - "" resources: - configmaps verbs: - create - get - apiGroups: - "" resources: - events - persistentvolumeclaims - secrets - serviceaccounts - services verbs: - create - delete - get - list - patch - update - watch - apiGroups: - "" resources: - pods verbs: - get - apiGroups: - monitoring.coreos.com resources: - servicemonitors verbs: - create - get - apiGroups: - networking.k8s.io resources: - ingresses verbs: - create - delete - get - list - patch - update - watch - apiGroups: - route.openshift.io resources: - routes verbs: - create - delete - get - list - patch - update - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create serviceAccountName: default deployments: - name: nexus-operator-controller-manager spec: replicas: 1 selector: matchLabels: control-plane: controller-manager strategy: {} template: metadata: creationTimestamp: null labels: control-plane: controller-manager spec: containers: - args: - --secure-listen-address=0.0.0.0:8443 - --upstream=http://127.0.0.1:8080/ - --logtostderr=true - --v=10 image: gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0 name: kube-rbac-proxy ports: - containerPort: 8443 name: https protocol: TCP resources: {} - args: - --metrics-addr=127.0.0.1:8080 - --enable-leader-election command: - /manager image: quay.io/m88i/nexus-operator:0.5.0 name: manager resources: requests: cpu: 100m memory: 20Mi terminationGracePeriodSeconds: 10 permissions: - rules: - apiGroups: - "" resources: - configmaps verbs: - get - list - watch - create - update - patch - delete - apiGroups: - "" resources: - configmaps/status verbs: - get - update - patch - apiGroups: - "" resources: - events verbs: - create - patch serviceAccountName: default strategy: deployment installModes: - supported: true type: OwnNamespace - supported: true type: SingleNamespace - supported: true type: MultiNamespace - supported: true type: AllNamespaces keywords: - nexus - sonatype - maven - docker - ci - continuous integration - continuous delivery - repository - repository manager - dev tools labels: name: nexus-operator links: - name: Documentation url: https://github.com/m88i/nexus-operator/blob/main/README.md - name: Nexus Operator source code repository url: https://github.com/m88i/nexus-operator maintainers: - email: nexus-operator@googlegroups.com name: m88i Labs maturity: alpha provider: name: m88i Labs version: 0.5.0 ```

What did you expect to see? All RBAC resources are created according to the CSV and the operator is installed successfully.

What did you see instead? Under which circumstances? The RBAC resources are not created and installation fails.

I dug around a bit and I believe the most relevant fields are the following annotations:

      olm.operatorGroup: test-operators
      olm.operatorNamespace: test-operators

This OperatorGroup exists and is the only one present in that namespace:

❯ kubectl --kubeconfig /tmp/kubeconfig -n test-operators get operatorgroup
NAME             AGE
test-operators   75m
❯ kubectl --kubeconfig /tmp/kubeconfig -n test-operators get operatorgroup -o yaml
apiVersion: v1
items:
- apiVersion: operators.coreos.com/v1
  kind: OperatorGroup
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"labels":{"operator":"test"},"name":"test-operators","namespace":"test-operators"}}
      olm.providedAPIs: Nexus.v1alpha1.apps.m88i.io
    creationTimestamp: "2021-01-14T12:20:15Z"
    generation: 2
    labels:
      operator: test
    managedFields:
    - apiVersion: operators.coreos.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:labels:
            .: {}
            f:operator: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: "2021-01-14T12:20:15Z"
    - apiVersion: operators.coreos.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:olm.providedAPIs: {}
        f:spec: {}
        f:status:
          .: {}
          f:lastUpdated: {}
          f:namespaces:
            .: {}
            v:"": {}
      manager: olm
      operation: Update
      time: "2021-01-14T12:20:46Z"
    name: test-operators
    namespace: test-operators
    resourceVersion: "1408"
    selfLink: /apis/operators.coreos.com/v1/namespaces/test-operators/operatorgroups/test-operators
    uid: 14d88ad1-1e6d-4006-92f6-2e441111e664
  spec: {}
  status:
    lastUpdated: "2021-01-14T12:20:15Z"
    namespaces:
    - ""
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

If I'm reading the OperatorGroup docs correctly, this should suffice for this operator to be a member of the test-operators group. And if I'm reading the RBAC section of the docs correctly, being a member o the OperatorGroup should suffice for the appropriate ClusterRoles to be created following the policies defined in the CSV.

However, these ClusterRoles don't get created:

❯ kubectl --kubeconfig /tmp/kubeconfig get clusterroles | grep -ci nexus
0

Which ultimately causes the deployment to fail as the CRD is installed, the ServiceAccount is present, but the policies are not satisfied by any role:

  status:
    conditions:
    - lastTransitionTime: "2021-01-14T12:20:46Z"
      lastUpdateTime: "2021-01-14T12:20:46Z"
      message: requirements not yet checked
      phase: Pending
      reason: RequirementsUnknown
    - lastTransitionTime: "2021-01-14T12:20:46Z"
      lastUpdateTime: "2021-01-14T12:20:46Z"
      message: one or more requirements couldn't be found
      phase: Pending
      reason: RequirementsNotMet
    lastTransitionTime: "2021-01-14T12:20:46Z"
    lastUpdateTime: "2021-01-14T12:20:46Z"
    message: The operator is running in test-operators but is managing this namespace
    phase: Pending
    reason: Copied
    requirementStatus:
    - group: apiextensions.k8s.io
      kind: CustomResourceDefinition
      message: CRD is present and Established condition is true
      name: nexus.apps.m88i.io
      status: Present
      uuid: e8b9e900-0a5c-4543-af3b-019fda6c8976
      version: v1
    - dependents:
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: namespaced rule:{"verbs":["get","list","watch","create","update","patch","delete"],"apiGroups":[""],"resources":["configmaps"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: namespaced rule:{"verbs":["get","update","patch"],"apiGroups":[""],"resources":["configmaps/status"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: namespaced rule:{"verbs":["create","patch"],"apiGroups":[""],"resources":["events"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","delete","get","list","patch","update","watch"],"apiGroups":["apps"],"resources":["deployments"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["update"],"apiGroups":["apps"],"resources":["deployments/finalizers"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["get"],"apiGroups":["apps"],"resources":["replicasets"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","delete","get","list","patch","update","watch"],"apiGroups":["apps.m88i.io"],"resources":["nexus"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["get","patch","update"],"apiGroups":["apps.m88i.io"],"resources":["nexus/finalizers"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["get","patch","update"],"apiGroups":["apps.m88i.io"],"resources":["nexus/status"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","get"],"apiGroups":[""],"resources":["configmaps"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","delete","get","list","patch","update","watch"],"apiGroups":[""],"resources":["events","persistentvolumeclaims","secrets","serviceaccounts","services"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["get"],"apiGroups":[""],"resources":["pods"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","get"],"apiGroups":["monitoring.coreos.com"],"resources":["servicemonitors"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","delete","get","list","patch","update","watch"],"apiGroups":["networking.k8s.io"],"resources":["ingresses"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create","delete","get","list","patch","update","watch"],"apiGroups":["route.openshift.io"],"resources":["routes"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create"],"apiGroups":["authentication.k8s.io"],"resources":["tokenreviews"]}
        status: NotSatisfied
        version: v1
      - group: rbac.authorization.k8s.io
        kind: PolicyRule
        message: cluster rule:{"verbs":["create"],"apiGroups":["authorization.k8s.io"],"resources":["subjectaccessreviews"]}
        status: NotSatisfied
        version: v1
      group: ""
      kind: ServiceAccount
      message: Policy rule not satisfied for service account
      name: default
      status: PresentNotSatisfied
      version: v1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Environment

0.16.1

❯ kubectl --kubeconfig /tmp/kubeconfig version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-14T07:30:52Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

Kind v0.9.0

Additional context

Digging into the olm-operator pod logs, I found:

time="2021-01-14T12:20:46Z" level=info msg="csv in operatorgroup" csv=nexus-operator.v0.5.0 id=W7Gfw namespace=test-operators opgroup=test-operators phase=Pending

So it is indeed a member of the expected group. This was followed by:

time="2021-01-14T12:20:46Z" level=info msg="requirements were not met" csv=nexus-operator.v0.5.0 id=W7Gfw namespace=test-operators phase=Pending
time="2021-01-14T12:20:46Z" level=info msg="couldn't ensure RBAC in target namespaces" csv=nexus-operator.v0.5.0 error="no owned roles found" id=7ckY6 namespace=test-operators phase=Pending
E0114 12:20:46.872746       1 queueinformer_operator.go:290] sync {"update" "test-operators/nexus-operator.v0.5.0"} failed: no owned roles found
time="2021-01-14T12:20:47Z" level=info msg="csv in operatorgroup" csv=nexus-operator.v0.5.0 id=D8bUy namespace=test-operators opgroup=test-operators phase=Pending

Full log output

I am facing this issue while I test a new release using these tests, though I'm not sure if this is relevant.

exdx commented 3 years ago

Hi @LCaparelli, thanks for the detailed writeup. I believe we had a bug that we addressed that fixed an issue very similar to this, I believe @dinhxuanvu looked into it. May be the same thing and is already fixed in master. Could you maybe try the same installation but off master and see if the same issue comes up?

LCaparelli commented 3 years ago

EDIT: tried out with minikube v1.16 this time around, btw, not kind

Hey @exdx thanks for the reply! I pulled the repo and ran make run-local while checked out at the master branch.

Installation went fine, so I followed the steps in this guide, though I have created the Subscription at the operators namespace instead of default:

apiVersion: operators.coreos.com/v1alpha1            
kind: Subscription
metadata:
  name: my-nexus-operator-m88i
  namespace: operators
spec:
  channel: alpha
  name: nexus-operator-m88i
  source: my-test-catalog
  sourceNamespace: olm

Which I believe should be enough for the operator to be part of the global-operators group, which is the only one present at that namespace:

❯ kubectl get operatorgroup -n operators
NAME               AGE
global-operators   15m

Indeed the annotations are correct:

❯ kubectl -n operators get csv nexus-operator.v0.5.0 -o json | jq '.metadata.annotations' | grep "  \"olm"
  "olm.operatorGroup": "global-operators",
  "olm.operatorNamespace": "operators",
  "olm.targetNamespaces": "",

So far so good, but still we run into the same issue, no ClusterRole gets created:

❯ kubectl get clusterroles | grep -ci nexus
0

And the CSV's status says we got the CRD and the ServiceAccount, but no roles to satisfy the policies.

Not sure if this is relevant, but I created the catalog image keeping only the operator we're interested in:

❯ kubectl -n olm get catalogsource my-test-catalog
NAME              DISPLAY   TYPE   PUBLISHER   AGE
my-test-catalog             grpc               23m
❯ kubectl get packagemanifests
NAME                  CATALOG   AGE
nexus-operator-m88i             24m

I also deleted the operatorhub CatalogSource for good measure, just to avoid any possible conflicts.

Taking a look at the olm-operator logs they also seem to reflect the same issue: correct OperatorGroup, but fails to ensure RBAC at target namespace.

time="2021-02-09T02:01:25Z" level=info msg="couldn't ensure RBAC in target namespaces" csv=nexus-operator.v0.5.0 error="no owned roles found" id=I9cot namespace=operators ph
ase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="done syncing CSV" csv=nexus-operator.v0.5.0 id=I9cot namespace=operators phase=Pending
E0209 02:01:25.408107       1 queueinformer_operator.go:290] sync {"update" "operators/nexus-operator.v0.5.0"} failed: no owned roles found
time="2021-02-09T02:01:25Z" level=debug msg="copying CSV" csv=nexus-operator.v0.5.0 id=8kxW0 namespace=operators phase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="copying csv to targets" csv=nexus-operator.v0.5.0 id=8kxW0 namespace=operators phase=Pending targetNamespaces=
time="2021-02-09T02:01:25Z" level=debug msg="checking annotations" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-system
time="2021-02-09T02:01:25Z" level=debug msg="checking status" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-system
time="2021-02-09T02:01:25Z" level=debug msg="checking annotations" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-public
time="2021-02-09T02:01:25Z" level=debug msg="checking status" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-public
time="2021-02-09T02:01:25Z" level=debug msg="checking annotations" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-node-lease
time="2021-02-09T02:01:25Z" level=debug msg="checking status" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=kube-node-lease
time="2021-02-09T02:01:25Z" level=debug msg="checking annotations" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=default
time="2021-02-09T02:01:25Z" level=debug msg="checking status" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=default
time="2021-02-09T02:01:25Z" level=debug msg="checking annotations" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=olm
time="2021-02-09T02:01:25Z" level=debug msg="checking status" csv=nexus-operator.v0.5.0 operator-ns=operators target-ns=olm
time="2021-02-09T02:01:25Z" level=debug msg="syncing CSV" csv=nexus-operator.v0.5.0 id=CoaRM namespace=operators phase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="annotations correct" annotationTargets= opgroupTargets=
time="2021-02-09T02:01:25Z" level=debug msg="csv in operatorgroup" csv=nexus-operator.v0.5.0 id=iKgFv namespace=operators opgroup=global-operators phase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="no intersecting operatorgroups provide the same apis" apis=Nexus.v1alpha1.apps.m88i.io csv=nexus-operator.v0.5.0 id=iKgFv namespace=operators phase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="checking if csv is replacing an older version"
time="2021-02-09T02:01:25Z" level=debug msg="unable to get previous csv" error="clusterserviceversions.operators.coreos.com \"nexus-operator.v0.4.0\" not found" replacing=nexus-operator.v0.4.0
time="2021-02-09T02:01:25Z" level=debug msg="perm.ServiceAccountName: default"
time="2021-02-09T02:01:25Z" level=debug msg="perm.ServiceAccountName: default"
time="2021-02-09T02:01:25Z" level=debug msg="permissions/requirements not met" minKubeMet=true permMet=false reqMet=true
time="2021-02-09T02:01:25Z" level=debug msg="checking if csv is replacing an older version"
time="2021-02-09T02:01:25Z" level=debug msg="unable to get previous csv" error="clusterserviceversions.operators.coreos.com \"nexus-operator.v0.4.0\" not found" replacing=nexus-operator.v0.4.0
time="2021-02-09T02:01:25Z" level=info msg="requirements were not met" csv=nexus-operator.v0.5.0 id=iKgFv namespace=operators phase=Pending
time="2021-02-09T02:01:25Z" level=debug msg="opgroup is global" csv=nexus-operator.v0.5.0 opgroup=global-operators
time="2021-02-09T02:01:25Z" level=debug msg="perm.ServiceAccountName: default"
time="2021-02-09T02:01:25Z" level=debug msg="perm.ServiceAccountName: default"
time="2021-02-09T02:01:25Z" level=debug msg="lift roles/rolebindings to clusterroles/rolebindings" csv=nexus-operator.v0.5.0 opgroup=global-operators
time="2021-02-09T02:01:25Z" level=info msg="couldn't ensure RBAC in target namespaces" csv=nexus-operator.v0.5.0 error="no owned roles found" id=CoaRM namespace=operators phase=Pending

So yeah, looks like there is no change from my original report, unfortunately. I'll also attempt to perform the same tests using the bundle format instead when I get some time, do you think that may help? Any other pointers I could try following?

I took a look at the code, and it seems that this is the only place this "lift roles/rolebindings to clusterroles/rolebindings" log can come up in:

https://github.com/operator-framework/operator-lifecycle-manager/blob/2294bcc907c834c160c5b99fbf15988d0706853c/pkg/controller/operators/olm/operatorgroup.go#L445-L448

Which is where the final "no owned roles" error comes from:

https://github.com/operator-framework/operator-lifecycle-manager/blob/2294bcc907c834c160c5b99fbf15988d0706853c/pkg/controller/operators/olm/operatorgroup.go#L456-L464

What I don't understand is why does this section attempts to "upgrade" existing Roles into ClusterRoles. Are namespaced RBAC resources created before in all installations of Operators which watch all namespaces? Should we actually be investigating why these aren't created first instead?

ricardozanini commented 3 years ago

@J0zi @mvalarh looks like we might have a bug in the OLM. That's why our PR can't pass the CI? Can you guys give us a hand or any advice? It's been a month :cry:. Our operator is there waiting to be released on k8s. We already merged the OCP counterpart.

exdx commented 3 years ago

It's not entirely evident on our end what the issue is exactly from the logs, but I think you're right in pinpointing the no owned roles found error message. The packaging format of the operator bundle shouldn't be affecting this however.

Could you say whether you supply the SA and the role in the operator CSV, and how they look like? I can see how conceptually we lift roles to clusteroles to enable cluster scoped operators to function. If you specify a single namespace installation (via the target namespace field on the operator group) and the install the operator, do you hit this same error? Could be useful to check this.

J0zi commented 3 years ago

OK, we are waiting for a fixed OLM version.

LCaparelli commented 3 years ago

Could you say whether you supply the SA and the role in the operator CSV, and how they look like?

We don't supply either, and we never did on past releases which work fine.

But I'm confused, aren't these supposed to be generated by OLM based on the CSV and OperatorGroup membership? It's what seems to happen on our previous releases.

In case you want to have a closer look at the PackageManifest and what not, check out operator-framework/community-operators#2933

If you specify a single namespace installation (via the target namespace field on the operator group) and the install the operator, do you hit this same error?

I created the following OperatorGroup:

apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: my-operatorgroup
  namespace: default
spec:
  targetNamespaces:
  - default

And the following Subscription on the same namespace:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: namespaced-test
  namespace: default
spec:
  channel: alpha
  name: nexus-operator-m88i
  source: my-test-catalog
  sourceNamespace: olm

And the result is the same: CRD present, SA present, no role satisfying the policies.

exdx commented 3 years ago

Potential solution:

LCaparelli commented 3 years ago

Full InstallPlan

So, the status' conditions point to a failure:

❯ kubectl -n operators get installplan install-s8q4j -o yaml | yq -Y '.status.conditions'
- lastTransitionTime: "2021-02-24T21:59:53Z"
  lastUpdateTime: "2021-02-24T21:59:53Z"
  message: the server could not find the requested resource
  reason: InstallComponentFailed
  status: "False"
  type: Installed

Taking a look at individual components (removed the manifest for readability):

❯ kubectl -n operators get installplan install-s8q4j -o yaml | yq -Y '.status.plan[] | del(.resource.manifest)'
resolving: nexus-operator.v0.5.0
resource:
  group: operators.coreos.com
  kind: ClusterServiceVersion
  name: nexus-operator.v0.5.0
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1alpha1
status: Present
---
resolving: nexus-operator.v0.5.0
resource:
  group: apiextensions.k8s.io
  kind: CustomResourceDefinition
  name: nexus.apps.m88i.io
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1beta1
status: Present
---
resolving: nexus-operator.v0.5.0
resource:
  group: monitoring.coreos.com
  kind: ServiceMonitor
  name: nexus-operator-controller-manager-metrics-monitor
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: ""
  kind: Service
  name: nexus-operator-controller-manager-metrics-service
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nexus-operator-metrics-reader
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1beta1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: ""
  kind: ServiceAccount
  name: default
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: rbac.authorization.k8s.io
  kind: Role
  name: nexus-operator.v0.5.0-default-7ff78f79d
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: rbac.authorization.k8s.io
  kind: RoleBinding
  name: nexus-operator.v0.5.0-default-7ff78f79d
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nexus-operator.v0.5.0-568f6bddd8
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
---
resolving: nexus-operator.v0.5.0
resource:
  group: rbac.authorization.k8s.io
  kind: ClusterRoleBinding
  name: nexus-operator.v0.5.0-568f6bddd8
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown

Oddly enough, it says the SA status is unknown, even though it's there:

❯ kubectl get sa default -n operators
NAME      SECRETS   AGE
default   1         22m

And even though the CSV status deems it present as well:

❯ kubectl -n operators describe csv nexus-operator-v0.5.0
(...)
    Group:      
    Kind:       ServiceAccount
    Message:    Policy rule not satisfied for service account
    Name:       default
    Status:     PresentNotSatisfied
    Version:    v1

I'm not sure what caused the installplan to enter a failed state. Any ideas? Shouldn't we be seeing a failed status for at least one of the resources as well?

Is it possible the installplan enters a failed state incorrectly, causing the controller to give up reconciling the remaining resources prematurely? Would explain why the roles never get created, leading to the "no owned roles" message and the failure to lift roles into clusterroles.

I've been having a look at the code myself and would love to potentially work on a fix if this confirms there's a bug somewhere, but could use some pointers on what to check. :-)

benluddy commented 3 years ago

The resources in an InstallPlan are applied sequentially, so I am interested in the first one that doesn't say Present:

resolving: nexus-operator.v0.5.0
resource:
  group: monitoring.coreos.com
  kind: ServiceMonitor
  name: nexus-operator-controller-manager-metrics-monitor
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown

Is the ServiceMonitor API available on your cluster? I just tried to install your latest downstream bundle on a vanilla cluster and it also fails on the ServiceMonitor step. The catalog-operator logs show:

time="2021-02-25T06:58:30Z" level=debug msg="execute resource" kind=ServiceMonitor name=nexus-operator-controller-manager-metrics-monitor
time="2021-02-25T06:58:30Z" level=info msg="could not query for GVK in api discovery" err="the server could not find the requested resource" group=monitoring.coreos.com kind=ServiceMonitor version=v1

That's consistent with the InstallComponentFailed Condition:

- lastTransitionTime: "2021-02-25T06:58:30Z"
  lastUpdateTime: "2021-02-25T06:58:30Z"
  message: the server could not find the requested resource
  reason: InstallComponentFailed
  status: "False"
  type: Installed

That failure message could be much better.

Oddly enough, it says the SA status is unknown

Unknown is the initial status for all of the steps in the InstallPlan -- they're updated as the plan executes, but this one terminated before it reached that step.

ricardozanini commented 3 years ago

@LCaparelli @J0zi looks like the Community Operators CI needs to have the Prometheus CRDs available, or we can work on an install script to not install our manifests if ServiceMonitor is not available in the cluster. That's why the CI passed in OCP (which should have Prometheus Operator installed), but not on k8s.

@LCaparelli for now let's just remove the ServiceMonitor from our bundle to unblock this release.

@J0zi this can be a problem for others as well.

That failure message could be much better.

Can we use this issue to update this message? Many thanks @benluddy !!!

benluddy commented 3 years ago

You may want to specify ServiceMonitor as a required API in your CSV spec (https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/building-your-csv.md#required-crds), which would cause OLM to attempt to automatically install an operator (e.g. Prometheus) that provides it. If you don't want the dependency automatically satisfied, you can also use nativeAPIs to declare a GVK that you expect to be available on the cluster (https://olm.operatorframework.io/docs/tasks/creating-operator-manifests/#nativeapis-recommended). That should cause your CSV status to reflect the missing API.

LCaparelli commented 3 years ago

Ah, this makes more sense, thanks a lot @benluddy!

From v0.4.0 to v0.5.0 we updated the sdk and this sort of "sneaked in". This whole thing was driving me crazy hahahaha

I'm not sure we want to add it as a hard dependency, the operator works just fine without servicemonitors. It's more of a "if it's available, use it. Don't worry if it isn't" sort of scenario. From what I can tell, either option (required API or nativeAPIs) will not successfully install if the GVK isn't present in the cluster, right?

Is there a way to declare a "weak" dependency? It looks like this was brought up before (#819), but I'm under the impression that this still isn't supported.

If a user installs our operator by simply applying a manifest with all we wanted to have, it might fail to apply the ServiceMonitor, but the operator itself will still be installed and available. On the other hand, it seems we can't do that with OLM, which we were hoping to be the "canon" way to install our operator. As suggested in that issue, we may document this and suggest users to install these resources manually, but that's not ideal. OLM allows management of operators to be a no-brainer task, it shouldn't require users to find "missing" parts (even if not fully required) in upstream docs IMO. What do you think?

LCaparelli commented 3 years ago

Also... Shouldn't this have been marked as failed then? Not unknown? :thinking:

resolving: nexus-operator.v0.5.0
resource:
  group: monitoring.coreos.com
  kind: ServiceMonitor
  name: nexus-operator-controller-manager-metrics-monitor
  sourceName: my-test-catalog
  sourceNamespace: olm
  version: v1
status: Unknown
LCaparelli commented 3 years ago

Anyway, thank you all very much for your support, closing the issue the original problem was fixed. :-)