Closed pedjak closed 1 year ago
I think we should also check the memory leak metrics on a namespace based installation and a cluster wide operator installation. To assess how broad/large the impact is.
So feature wise SBOperator should:
CSV watching should be removed IMHO - it is core of the problem.
Also this approach will allow sbo to exist on clusters without putting any load on them. Openshift dedicated on start has 962 objects what operator will watch. When this is applied operator will only watch for SB objects and become invisible to admins when not used
With little bit of research I think that moving to that model will not cause any API changes - only behaviour under some corner cases. It is good to adapt this model before 1.0 as later it would be hard to do so without braking people's expectations.
This will:
Currently operator looks for all CSV on cluster to get spec (even when user do not want to use it). We can make change to fetch spec (metadata) on reconcile loop only when binding was created.
We can stop reconciling binding that has errors continuously by flagging them
We could listen to kubernetes events on the secrets rather than continuous reconciliations
CC @gorkem
To improve the performance:
What can be read: The above list doesn't necessarily impact what can be 'read'. The controller may still be allowed to read a wide variety of types - that's OK!
In addition, we must introduce a way for an admin to configure what can be watched.
What is the usecase we would like to cover by watching anything besides Service Bindings? As @wtrocki stated above, it is very surprising for app developer that binding between application and service changes on its own. Such systems are hard to reason about.
So, you are going to stop watching CSVs, right? Our Sandbox clusters have thousands of them.
We are going to implement watches for resources bound via label selectors. The work is tracked here https://issues.redhat.com/browse/APPSVC-1112
As the investigation in #605 shown, the operator monitors a huge number of resources in all namespaces:
Such strategy puts a very high pressure on the operator's event queue on clusters with non-trivial workload, increasing the operator memory consumption, and eventually crashing it.
In order to reduce resource usage and behave reliably, we should start monitoring resources only if they are referred through a service binding:
This strategy is going to reduce significantly the number of events landing in operator's queue and open up possibility to scale better in large clusters containing a lot of namespaces.