Make Knative eventing control plane more serverless and scalable

aslom commented 4 years ago

Problem A short explanation of the problem, including relevant restrictions.

When running very small clusters would be good minimize footprint of knative eventing control plane (as it is used rarely). When running large and possibly multi-tenant clusters there may be many source and channel types installed that may be used rarely but take resources (memory, cpu).

Persona: Which persona is this feature for?

System Integrator, Event consumer (developer)

Exit Criteria A measurable (binary) test that would indicate that the problem has been resolved.

Run with minimal footprint required to operate knative eventing control-plane.

Time Estimate (optional): How many developer-days do you think this may take to resolve?

weeks

Additional context (optional) Add any other context about the feature request here.

Ref https://github.com/knative/eventing/issues/2152

aslom commented 4 years ago

We have ongoing discussion about scaling knative eventing control plane https://github.com/knative/eventing/issues/2152#issuecomment-550409573 and moving it here as a separate issue.

aslom commented 4 years ago

The way I see scaling control-plane problem: each source type and channel will be running its own controller taking resources (memory, cpu) even when there is nothing to do.

Instead we could run one API server event listener in control-plane(with minimal footprint - scale almost to "zero') that spins up reconciler and stores status of reconciliation as k8s events for ease of use.

antoineco commented 4 years ago

You're right about the 2 scenarios, and both have a different problematic.

In the "very busy" scenario, it is not recommended to scale controllers horizontally, because multiple instances may end up synchronizing the same object at the same time, which could have undesired effects. You would have to move the work queues outside of the controllers if you wanted to scale horizontally, so that one object can only be synchronized once at a given time regardless of the number of active controllers. That's why controllers usually scale vertically instead, with a configurable (possibly dynamic) amount of worker threads.

In the "not busy at all" scenario, scaling down to 0 may provide some benefits, but I doubt there would be much gain if we have to keep another watcher active to wake controllers up whenever an event is received from a watch. Controllers are very passive components, they do use some memory due to caching (in-memory object stores) but barely any CPU when nothing happens to watched resources. We could save on the memory consumption with cache-less watchers and reconcilers though.

The topic is worth discussing though. Thanks for creating this issue!

slinkydeveloper commented 4 years ago

In the "very busy" scenario, it is not recommended to scale controllers horizontally, because multiple instances may end up synchronizing the same object at the same time, which could have undesired effects

Oh so k8s apis don't provide you the guarantee that only one picks up the stuff and synchronize it?

I drop the :bomb: : since knative eventing is a pluggable system, what if, to make controllers pluggable, we use the golang plugin system to create one big controller that manages all resources and scales? This for sure can make the controllers more efficient and also avoid tons of boilerplate.

aslom commented 4 years ago

If one of plugins panic then mega controller would restart potentially in loop?

Knative controllers are already doing bundling:

https://github.com/knative/eventing/blob/1cb281d91da222f6f86974d5aca8b04f3e577fa0/cmd/controller/main.go#L36-L44

https://github.com/knative/eventing/blob/1cb281d91da222f6f86974d5aca8b04f3e577fa0/cmd/sources_controller/main.go#L30-L33

Would use of golang plugin system make possible ot load dynamically new controller?

slinkydeveloper commented 4 years ago

@aslom of course we need to pay attention to recover panics to isolate them and avoid the "mega controller" to crash, but yes that's more or less the idea. This could also improves the actual resource (cpu/mem) efficiency, avoiding running multiple controllers when just one could be used

Knative controllers are already doing bundling:

yeah but not the bits from eventing-contrib

I think this could be an interesting first step :)

aslom commented 4 years ago

@slinkydeveloper sounds good to me: a simple experiment to see if we can get eventing-contrib "mega" (or mini?) controller to work and what is difference in cpu/mem usage? And see if it could use plugins?

aslom commented 4 years ago

Following on Eventing WG https://docs.google.com/document/d/1uGDehQu493N_XCAT5H4XEw5T9IWlPN1o19ULOWKuPnY/edit#heading=h.br6wvqieqj5l recording related links mentioned in slack:

From @mikehelmick "Update from WG - the “serverless controllers” for kubernetes work is more of a thought experiment and not as far along as I have thought. So, no prior art to worry about. Still if we come up with something here (along those lines at least), I think it is generally useful to the k8s community."

From @grantr "The prior state of the art is probably metacontroller: https://github.com/GoogleCloudPlatform/metacontroller. But that project seems abandoned. There were a few issues keeping it from being widely usable. IIRC, security (no mTLS on hook deliveries) and inability to receive content of related objects in hook payloads."

Kyle Schlosser @kyleschlosser, @lionelvillard @aslom did work on serverless operator and we wrote 3 blog posts: https://www.ibm.com/cloud/blog/new-builders/being-frugal-with-kubernetes-operators https://www.ibm.com/cloud/blog/new-builders/being-frugal-with-kubernetes-operators-part-2 https://www.ibm.com/cloud/blog/new-builders/being-frugal-with-kubernetes-operators-part-3

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

aslom commented 3 years ago

/reopen

slinkydeveloper commented 3 years ago

Is this still relevant? Since I see you worked on this in more specific issues, can we close this more general issue?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative / eventing

Make Knative eventing control plane more serverless and scalable #2161