Open Daniel-Fan opened 2 years ago
I am seeing this bug relatively frequently on a couple different clusters. Is there any plan to resolve this or at least make the bug more clear? The install plan reports that the clusterrole and clusterrolebinding are present when they do not exist on the cluster. Is it possible to include some kind of check in OLM to verify the presence of these resources and then re-attempt the install/creation of these resources?
The install plan was designed as a one-shot style API. It sounds like the install plan either found or successfully created those objects at the time it ran, but OLMv0 will not re-create them if they were deleted after the fact.
We know this is a limitation of OLMv0 and have accounted for it in the OLMv1 design that we are currently implementing. Between the architecture of the API and our attention turned primarily toward OLMv1, there is no plan to make changes in this area of OLMv0.
However, if the bug is "I can prove that the InstallPlan reconciliation lied about creating the objects in the first place", that is something we would consider fixing.
@joelanford I am not sure if the resource is created and later deleted or if it is never present and the install plan is "lying" but I will try to find evidence to prove. The difficult part is that it is intermittent and only recognizable after whatever is relying on these clusterrole/bindings is far enough along in its deployment process to start failing due to lack of permissions outside of their home namespace
@joelanford We have been able to reproduce this issue consistently on the same cluster (and occasionally on different clusters) over the last week. To verify whether or not the ClusterRole/ClusterRoleBindings were created, we checked the kube-apiserver and openshift-apiserver audit logs. Searching for the ClusterRole/ClusterRoleBinding (which share a name) in these audit logs returns 0 results. It is not mentioned as a create or delete event which we would expect to see if it was successfully created by the install plan and somehow deleted by something else. To verify this, we created/edited/deleted a dummy ClusterRole/ClusterRoleBinding and saw it included as a create/edit/delete event in the audit logs.
Open questions:
@joelanford can we get word on whether this will be addressed?
Bug Report
What did you do? Install
ibm-namespace-scope-operator
by creating subscriptionWhat did you expect to see? The operator is installed successfully and CSV is in
succeeded
phaseWhat did you see instead? Under which circumstances? The operator is in
Pending
phase and shows errorEnvironment
Possible Solution
This looks like an intermittent issue in OLM. A re-installation should be able to solve the problem. But we have not apply the workarounds in case further information is required.
Additional context
InstallPlan for operator is completed
ClusterRoleBinding shows created in InstallPlan
ClusterRoleBinding is missing in cluster
OLM logs
CSV for namespace-scope oeprator