pedjak commented 3 years ago

As the investigation in #605 shown, the operator monitors a huge number of resources in all namespaces:

all config maps and secrets
all CSVs
all CRDs owned by CSV

Such strategy puts a very high pressure on the operator's event queue on clusters with non-trivial workload, increasing the operator memory consumption, and eventually crashing it.

In order to reduce resource usage and behave reliably, we should start monitoring resources only if they are referred through a service binding:

at the operator startup, start monitoring service bindings in all or given namespaces
at service binding reconcile, start monitor referred service and application kinds, but only in the namespace of reconciled service binding. On top of that start monitoring service CRD and its CSV if any

This strategy is going to reduce significantly the number of events landing in operator's queue and open up possibility to scale better in large clusters containing a lot of namespaces.

Avni-Sharma commented 3 years ago

I think we should also check the memory leak metrics on a namespace based installation and a cluster wide operator installation. To assess how broad/large the impact is.

wtrocki commented 3 years ago

So feature wise SBOperator should:

Watch only for SBO objects (operate on existing state)
when it is created validate if all provided info is right and resources exist. Provide error if problems as part of the status
Mount resources to the app container according to the spec
new flag should be introduced for SBO to watch secrets and underlying CRs for changes so config can be adjusted for those.. but only if user wants that. I see SBO as tool that does the job when asked (SB object created) and optionally can also watch and mount again changes in service CR or secret - but that is optional - I would not want some operator to randomly redeploy my app every time my CR change

CSV watching should be removed IMHO - it is core of the problem.

wtrocki commented 3 years ago

Also this approach will allow sbo to exist on clusters without putting any load on them. Openshift dedicated on start has 962 objects what operator will watch. When this is applied operator will only watch for SB objects and become invisible to admins when not used

wtrocki commented 3 years ago

With little bit of research I think that moving to that model will not cause any API changes - only behaviour under some corner cases. It is good to adapt this model before 1.0 as later it would be hard to do so without braking people's expectations.

This will:

reduce codebase by half
reduce memory usage and overal footprint of operator when not used
simplify binding processing and it's documentation
would allow to adapt proper error handling using conditions applied in reconcile loop.
would be still relatively easy as it will be based on reduction of the features that are not functioning/little value

example simplifications

Currently operator looks for all CSV on cluster to get spec (even when user do not want to use it). We can make change to fetch spec (metadata) on reconcile loop only when binding was created.

We can stop reconciling binding that has errors continuously by flagging them

We could listen to kubernetes events on the secrets rather than continuous reconciliations

CC @gorkem

sbose78 commented 3 years ago

To improve the performance:

The controller should stop watching changes to CSVs
The controller should watch only specific workload types we've tested with ( may or may not be from OLM operators )
The controller should watch only specific backing service types we've tested with ( may or may not be from OLM operators )

What can be read: The above list doesn't necessarily impact what can be 'read'. The controller may still be allowed to read a wide variety of types - that's OK!

In addition, we must introduce a way for an admin to configure what can be watched.

pedjak commented 3 years ago

What is the usecase we would like to cover by watching anything besides Service Bindings? As @wtrocki stated above, it is very surprising for app developer that binding between application and service changes on its own. Such systems are hard to reason about.

alexeykazakov commented 3 years ago

So, you are going to stop watching CSVs, right? Our Sandbox clusters have thousands of them.

pmacik commented 1 year ago

We are going to implement watches for resources bound via label selectors. The work is tracked here https://issues.redhat.com/browse/APPSVC-1112

redhat-developer / service-binding-operator

Watch only for changes on resources participating in a service binding #846

example simplifications