Data plane should be able to handle connections from several control plane pods

slinkydeveloper commented 3 years ago

When developing multitenant components, this is an issue.

Per https://github.com/knative-sandbox/eventing-kafka-broker/pull/657#issuecomment-831194059

slinkydeveloper commented 3 years ago

So I've analyzed a bit the use cases, in order to find out what we're talking about here. Broadly speaking, we have three non-exclusive solutions to this problem:

When sending from data plane to control plane, just broadcast to all connected controllers. Because the controller then knows if a resource is assigned to it, it will just ignore it, hence this should work without much changes.
Make sure no data plane pod handles resources from multiple control plane pods. In other words, the multi tenant resource scheduling should be aware of the bucket where the resource is assigned, and to which data pod it can communicate. This requires number of control plane pods < number of data plane pods (which is definitely a reasonable assumption).
Create a new API that overlays Service, in order for the data plane to route messages to specific pods. This requires a logical addressing between resources uuid and control plane pods, and such assignment must be kept in sync.

The third solution is awfully complicated. I want to aim for a mixed first and second solution, because I think (and I might be wrong) there's no reason for controllers to scale more than the data plane.

Because the third solution requires a new API, we can always add it later.

slinkydeveloper commented 3 years ago

After banging my head on this problem for 3 days, I've found out broadcasting is not a proper solution. The problem is not of the network protocol implementation per se, but it's a semantic problem: control protocol guarantees at least once delivery assuming that at some point, it will be able to reach the other party who's interested in the message you're sending. With one connection at the time, this is easy to implement: when no connection is there, just enqueue waiting for a new connection to come. But you know that whoever will connect, it's interested in that message.

Introducing broadcasting, this assumption is no longer true: because the data plane doesn't know who's interested in what message, there's no way to guarantee at least once, because even with broadcasting, i might have sent the message only to controllers not interested in that particular message, and I can't do any assumption on when the controller interested in that message is going to reconnect.

IMO we should always enforce the rule "one data plane pod is connected to only one control plane pod", which is the second point of this comment https://github.com/knative-sandbox/control-protocol/issues/66#issuecomment-831274925. This is reasonable, and I can't honestly think to a use case where one user needs more "buckets" than data plane pods.

pierDipi commented 3 years ago

In our Kafka components, we associate a Trigger/Subscription to a consumer group, so we might have n consumers for a given Trigger/Subscription that are scheduled on different data plane pods.

Situation:

3 dispatcher pods
2 controller pods
2 buckets
trigger1 is handled by controller1 (it's the leader of bucket 1)
trigger2 is handled by controller2 (it's the leader of bucket 2)
trigger1 topic has 100 partitions
trigger2 topic has 1 partition
controller1 schedules the first consumer to dispatcher1 (so dispatcher1 can only communicate with controller1)
controller2 schedules the first consumer to dispatcher2 (so dispatcher2 can only communicate with controller2)
now we want to scale trigger1's consumer group to 11 consumers
let's say that a single dispatcher can handle only 5 consumers
controller1 can schedule 5 consumers to dispatcher1, 1 consumer to dispatcher2 and 5 consumers to dispatcher3.
dispatcher2 has to communicate with both controller1 and controller2

I need to think a little more, but there could be other cases where the assumption doesn't hold.

slinkydeveloper commented 3 years ago

now we want to scale trigger1's consumer group to 11 consumers let's say that a single dispatcher can handle only 5 consumers

What does that consumer mean in this context? For the data plane, one trigger == one consumer verticle.

3 dispatcher pods 2 controller pods 2 buckets controller1 can schedule 5 consumers to dispatcher1, 1 consumer to dispatcher2 and 5 consumers to dispatcher3.

As soon as N buckets < N dispatcher pods, the control plane can perform the scheduling in a way that the assumption holds. If I understood your comment, what you're trying to say here is that this might break fairness when N buckets ~= N dispatcher pods ~= N k8s resources (e.g. triggers).

But if you think at a scale, the scale where the user effectively needs to have more than one control plane pods, I expect:

N buckets <<< N dispatcher pods <<< N k8s resources

At least of an order of magnitude, for example N dispatcher pods = 10 * N buckets and N triggers = 10 * N dispatcher pods. Such inequality has very important consequences:

There will be plenty of computing resources to allocate such Triggers, so fairness will be easier to hold.
Assuming distribution of the triggers among the buckets is random, assuming each trigger load is random, the load of each bucket sum (load(t) for each t in bucket x) tend towards sum(load(t) for each t)/N buckets. https://en.wikipedia.org/wiki/Law_of_large_numbers. Because of that, even if the available computing resources associated to a bucket are shortening (because most of dispatcher pods are 100% allocated), that means that computing resources in another bucket are shortening too

pierDipi commented 3 years ago

What does that consumer mean in this context? For the data plane, one trigger == one consumer verticle.

Nope, 1 trigger == 1 consumer group == N consumer verticles (distributed in multiple pods).

what you're trying to say here is that this might break fairness

No, I'm not talking about fairness, it's just an example, you can place 11 consumers however you want in the given dispatcher pool, this

dispatcher2 has to communicate with both controller1 and controller2

still happens.

Also, the ingress tier (receiver) is another example, we use a single service, so every receiver pod should be able to accept events for all brokers, therefore every receiver should be able to communicate with all controllers.

slinkydeveloper commented 3 years ago

1 trigger == 1 consumer group == N consumer verticles (distributed in multiple pods).

We don't have such granularity today AFAIK, right?

Also, the ingress tier (receiver) is another example, we use a single service, so every receiver pod should be able to accept events for all brokers, therefore every receiver should be able to communicate with all controllers.

That's a fair point, TBH I didn't thought about that at all... But, that also means we need on each receiver pod a kafka producer for every Ingress, right? Is that reasonably scalable as well? Shouldn't we try to redirect traffic somehow?

pierDipi commented 3 years ago

We don't have such granularity today AFAIK, right?

We sort of do, I can just scale up and down the dispatcher deployment. When N > num partitions, we have consumers doing nothing but the granularity is there.

Is that reasonably scalable as well? Shouldn't we try to redirect traffic somehow?

Ideally, yes, we should try to redirect traffic, but I wouldn't rely on a single pod anyway since it becomes a SPOF. This is a separate problem tho.

slinkydeveloper commented 3 years ago

Ideally, yes, we should try to redirect traffic, but I wouldn't rely on a single pod anyway since it becomes a SPOF. This is a separate problem tho.

Yeah agree. But, related to this issue, there's no point to have more than one bucket for receivers (aka for broker objects) if then there's no way to partition that load, right?

We sort of do, I can just scale up and down the dispatcher deployment.

Yeah but that's an unintended effect I guess? Or better, that's something we cannot control now, but we will (and granularly) in the future, right?

pierDipi commented 3 years ago

Yeah agree. But, related to this issue, there's no point to have more than one bucket for receivers (aka for broker objects) if then there's no way to partition that load, right?

Data plane traffic is split by the usual mechanism of a k8s service, we don't have a mechanism to redirect it to a subset of replicas. I can have a lot of control-plane activities and low data plane traffic (or the other way around), control-plane and data-plane are 2 separate components and they should be scaled and partitioned indipendently.

Yeah but that's an unintended effect I guess? Or better, that's something we cannot control now, but we will (and granularly) in the future, right?

Why unintended? We can control it atm, but it's just not optimal (more on https://github.com/knative-sandbox/eventing-kafka-broker/issues/785)

slinkydeveloper commented 3 years ago

I can have a lot of control-plane activities and low data plane traffic (or the other way around).

Sure, that might be the case, but it's quite rare compared to the other way around. And still, even with high traffic, the control plane needs to do very little work for each resource. Because there is no partitioning to do, the work is going to be even simpler in this case: just broadcast to everybody the same thing.

That's why I'm questioning in this thread, is it really useful to support horizontal control-plane scaling given the data plane can scale independently from it?

Why unintended? We can control it atm, but it's just not optimal (more on knative-sandbox/eventing-kafka-broker#785)

Ah ok, I get what you mean. With control protocol and scheduling (with a partitioning like described in your proposal), we'll be able to do it too.

So let me ask you another question. If we can't get this issue solved, what alternatives do you propose?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative-extensions / control-protocol

Data plane should be able to handle connections from several control plane pods #66