KEDA for multi-cluster use-case

dmeytin commented 3 years ago

Support workload expansion on multiple clusters

Use-Case

HTTP/gRPC based workloads have first class support for multi-cluster expansion by several multi-cluster ingress controllers. But queue workers are more challenging for multi-cluster scheduling. The list of use-cases when multi-cluster is useful is as following:

Support huge workloads that can't fit single cluster's resources
Seamless cluster and infrastructure version upgrade
BCP

Specification

it would be great to have a prototype for Kubernetes Federation that enables ScaledObjects/Jobs integration with FederatedDeployment/Jobs

tomkerkhove commented 3 years ago

WDYT @zroubalik @jeffhollan @ahmelsayed @anirudhgarg ?

coderanger commented 3 years ago

controller-runtime support for multi-cluster operators is still in progress. The latest versions have some basics but probably not ready for full use yet. I would very much avoid spending time on integrating with kubefed tooling given it's level of community uptake has remained extremely low, and last I heard the SIG was focusing on a kubefed2 rebuild.

dmeytin commented 3 years ago

That is correct , but the link below is to kubefed2 project. Yet, I believe it could be a great showcase for expansion on multiple clusters. Queue workers are significant part of the modern workloads but it is left without appropriate treatment for multiple cases of cluster’s day two operations.

On Wed, 10 Feb 2021 at 22:29 Noah Kantrowitz notifications@github.com wrote:

controller-runtime support for multi-cluster operators is still in progress. The latest versions have some basics but probably not ready for full use yet. I would very much avoid spending time on integrating with kubefed tooling given it's level of community uptake has remained extremely low, and last I heard the SIG was focusing on a kubefed2 rebuild.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-777008465, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TWG4OHEEH3SPXRVSW3S6LUD5ANCNFSM4XLLHTJQ .

coderanger commented 3 years ago

The overall direction of controller-runtime is for multi-cluster operators to handle things directly (i.e. talk to the API of every cluster) rather than use a secondary federation backend. Not required, of course, but we're explicitly trying to support that use case.

dmeytin commented 3 years ago

Sounds great. Can you please share please the link to PR that I can follow?

On Wed, 10 Feb 2021 at 23:00 Noah Kantrowitz notifications@github.com wrote:

The overall direction of controller-runtime is for multi-cluster operators to handle things directly (i.e. talk to the API of every cluster) rather than use a secondary federation backend. Not required, of course, but we're explicitly trying to support that use case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-777032230, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TQW2XIRHYWDXVIY7Y3S6LXWDANCNFSM4XLLHTJQ .

coderanger commented 3 years ago

There's a lot of little pieces. https://github.com/kubernetes-sigs/controller-runtime/pull/1075 has already happened (there's now a Cluster struct distinct from Manager) and https://github.com/kubernetes-sigs/controller-runtime/pull/1192/files will allow multi-instantiation which is part of the same overarching use case in the end.

dmeytin commented 3 years ago

Awesome! I would recommend opening a parent feature that will aggregate all pieces together for simplicity of tracking. I'm willing to be the better tester for this feature. In general the solution will not be complete w/o multicluster ingress controller and DNS service. For the full reference implementation we should add these components from the existing 3rd party projects. WDYT?

On Thu, Feb 11, 2021 at 8:59 AM Noah Kantrowitz notifications@github.com wrote:

There's a lot of little pieces. kubernetes-sigs/controller-runtime#1075 https://github.com/kubernetes-sigs/controller-runtime/pull/1075 has already happened (there's now a Cluster struct distinct from Manager) and https://github.com/kubernetes-sigs/controller-runtime/pull/1192/files will allow multi-instantiation which is part of the same overarching use case in the end.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-777240079, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TRDVZGOHPNXMRKIMALS6N54PANCNFSM4XLLHTJQ .

coderanger commented 3 years ago

Maybe? KEDA is usually scaling some kind of task worker or consumer system. These are not usually the same components as the web services, instead they communicate through some kind of broker which KEDA is monitoring and creating or removing consumers are needed :) The web tier 100% if you want a federated approach, would need the tools you describe. But the kinds of consumer pods/jobs/etc that KEDA is managing are usually separate from that (maybe needing some Service integration for Prometheus metrics discovery but more often you would do that cluster-local and federate at the Prometheus level instead since it has powerful tooling for that already).

There's work being done in the keda-http addon to look at request-based scaling for web services, there would also need that kind of stuff, but it's still early phases so let's get that working in a simpler setup first :D

dmeytin commented 3 years ago

That is absolutely correct. I agree - it's better to keep components loosely coupled and avoid unnecessary integrations. As an infra provider I will need to glue all components together for the holistic solution and it would be great to ensure that integration will go smoothly.

On Thu, Feb 11, 2021 at 11:30 AM Noah Kantrowitz notifications@github.com wrote:

Maybe? KEDA is usually scaling some kind of task worker or consumer system. These are not usually the same components as the web services, instead they communicate through some kind of broker which KEDA is monitoring and creating or removing consumers are needed :) The web tier 100% if you want a federated approach, would need the tools you describe. But the kinds of consumer pods/jobs/etc that KEDA is managing are usually separate from that (maybe needing some Service integration for Prometheus metrics discovery but more often you would do that cluster-local and federate at the Prometheus level instead since it has powerful tooling for that already).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-777309526, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TSFD4D3NSVEJH43A5DS6OPR3ANCNFSM4XLLHTJQ .

dmeytin commented 3 years ago

Do we have any news on this issue?

tomkerkhove commented 3 years ago

Any thoughts on this @zroubalik?

zroubalik commented 3 years ago

I am happy to see any POC :)

dmeytin commented 3 years ago

I have a list of use cases for multi-cluster support:

the workload that is bigger than a single k8s cluster’s capacity
gradual migration from the old cluster to the new one (due to control plan breaking changes)
BCP/DR due to full cloud provider’s regional failure (large workloads will not scale fast enough in remaining regions because of the lack of available resources, it’s recommended to provision DR workload on multiple regions to utilize resources better)

All use-cases above could be satisfied by having required distribution of the workload across clusters and ability to fill the gap in case when others clusters aren’t complete the request till the timeout

Does it make sense?

On Fri, 9 Jul 2021 at 12:39 Zbynek Roubalik @.***> wrote:

I am happy to see any POC :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-877056624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TTZOS2IMYCX5KEE2M3TW27VRANCNFSM4XLLHTJQ .

zroubalik commented 3 years ago

Yeah, that does make sense.

What I'd love to see is some actual proposal on how do we want to achieve this from technical point.

dmeytin commented 3 years ago

we need to add a few operations: 1) join cluster - responsible to configure KEDA operator of other clusters within the squad with the certificate of the current cluster 2) Scaling object extra configuration

  - capacity:

     cluster-eus: 20

     cluster-eus: 20

     cluster-wus: 60

the configuration done on any cluster is immediately synced with the squad (if the configuration is different, the last one will win)

the actual replica size values will be synched with the squad as well

when one of the clusters has a discrepancy between requested capacity and the actual size, the cluster's status will be set as frozen and the required capacity will not be increased

When one of the clusters fails to communicate with the squad, it will be set as unhealthy and its capacity will be decreased to zero.

The frozen cluster will be periodically tested whether it can scale.

For cross cluster data synchronization we can use hashicorp consul.

3) leave cluster This operation will notify the squad that the cluster is leaving

On Mon, Jul 12, 2021 at 12:10 PM Zbynek Roubalik @.***> wrote:

Yeah, that does make sense.

What I'd love to see is some actual proposal on how do we want to achieve this from technical point.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-878109389, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TRFGEAWFV4CNJNJT4TTXKWSDANCNFSM4XLLHTJQ .

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

dmeytin commented 2 years ago

@Tom Kerkhove @.***>, What will happen if KEDA is executed jointly with Admiralty? Maybe it will achieve the requested functionality?

On Thu, Oct 14, 2021 at 9:47 AM stale[bot] @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kedacore/keda/issues/1587#issuecomment-943049485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGJ2TXY6GSWFG7DYVSXOZ3UGZ4J7ANCNFSM4XLLHTJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

rwkarg commented 2 years ago

We run in a multi-cluster scenario (for the ability to destroy/recreate individual clusters without impact) and I don't know that this can be addressed outside of a full scheduling manager.

One option is if KEDA knew how many clusters were participating, it could scale reported metrics based on that proportion (ex. with 2 clusters, report metrics at 50% value). This would have ~half the instances on each cluster. KEDA would need to be constantly checking to see of all participating clusters are "working" to know if one of them is down or unable to provision more instance. If it's down then update the cluster count to rescale metrics and load will be distributed to the other clusters. "Unable to scale instance" is harder to detect. A workload at 2/10 is below the threshold, but it's not clear why. Is the cluster hitting some quota that won't ever let that workload scale? Is it waiting for a cluster auto-scale event to get more worker nodes? Is it pulling an image or just waiting for the initial scheduling? Or are the new instances failing to start up? There doesn't seem to be a clear way to generally determine this.

The other option is to set your Trigger values based on expected cluster count and desired resilience. If you expect to have 4 clusters normally and want to be able to tolerate losing one cluster, then scale your Trigger config to 3x of what you want in total.

Example (normally 4 clusters, tolerate losing one): Desired Global Trigger Value: 600 Individual Cluster Trigger Value: 600(4-1) = 6003 = 1800

which results in

Example raw metric value (from source like queue length): 1800 Perfectly efficient global instance count (same as if running on one cluster): 3 Actual global instance count (when 4 clusters running): 4 (one per cluster)

This will result in having an "extra" 33% of instances running when there are 4 clusters but:

Running mulit-cluster implies a level of over provisioning as the tradeoff for decreased risk. If you're capacity is "efficiently" packed so you aren't overprovisioned, then you don't get the risk mitigation of multiple clusters (If you have two clusters, then you need to be 100% overprovisioned to be able to absorb losing one of those clusters).

Of the two methods above, the first (actively discovering and monitoring clusters and trying to infer if they're "healthy" for scaling or not) seems like a very involved and nuanced problem that may be difficult to generalize. The second over-provisioning method (by scaling down trigger values) is achievable today. While more manual, it does provide reliability at the desired level of overprovisioning. This could potentially be made more formalized by having an optional multi-cluster scaling section on triggers that allows configuration of the max participating clusters and desired resiliency (4 and 1 from the example above) and performing the above scaling for each cluster's HPA.

tomkerkhove commented 2 years ago

I tend to agree that this is more of a scheduling problem rather than autoscaling problem. We are working on CloudEvent (#479) support that should give insights on how apps are autoscaling but we can evaluate if we can add more events in the list that could be helpful in this scenario.

However, most probably it will be related to scheduling again which is the same question - Is this up to KEDA or not? It depends.

dmeytin commented 2 years ago

Any plans to PoC CloudEvents for multi-cloud support?

tomkerkhove commented 2 years ago

Not at the moment since CloudEvents is still being added but curious to hear what events you would like to have.

dmeytin commented 1 year ago

@tomkerkhove, check

tomkerkhove commented 1 year ago

Check for what? :)

dmeytin commented 1 year ago

Checking the status

tomkerkhove commented 1 year ago

Nothing was posted here so it's safe to assume no changes. I think we are still looking for solid use-cases and needs.

CloudEvent support is tracked in dedicated issues.

gabrieljones commented 1 year ago

I have a set of 20 or so microservices that are replicated across three OpenShift clusters, one cluster for each availability zone. Each replicated microservice deployment has its own kafka topic. All three clusters/AZs pull from a single kafka cluster. I would like for the required pod count calculated by the Apache Kafka Scaler to be spread evenly across the currently up clusters/AZs.

Is this possible today? I've seen several fits and starts in this regard but they all seem to fade away.

mchpa in mck8s
kubefed
https://github.com/kubernetes-sigs/controller-runtime/pull/1192

gabrieljones commented 1 year ago

Just found Karmada. How do I combine FederatedHPA with KEDA Apache Kafka Scaler?

Orfeasfil commented 3 months ago

Just found Karmada. How do I combine FederatedHPA with KEDA Apache Kafka Scaler?

Did you find any way?

kedacore / keda

KEDA for multi-cluster use-case #1587

Use-Case

Specification