Please support Kubernetes Federation

https://kubernetes.io/docs/concepts/cluster-administration/federation/

Please allow Service Fabric to act as a federated cluster in a Kubernetes federation.

That's incredibly unlikely to ever happen. Federation in K8s itself is incredibly new and doesn't work terribly well - See their readme.

With SF, you can already achieve all of things that you would do in a federated multi-cluster environment by just having a single geographically spanned cluster in SF. So it's not immediately clear what this gains.

Maybe you can provide more information about why this is interesting to you?

Please never do this. SF federation is so much better.

@masnider I disagree that you can do everything in a single SF cluster that you can do in K8s federation. I agree that K8s federation is very early but that is implementation details at this point, not design. One of the biggest benefits to K8s federation is the internal and external DNS based service discovery that can work across clusters. It also has the ability to have cluster level tagging instead of only node level tagging. This is hugely beneficial. There are several others as well.

My company worked for a long time to get geo HA SF clusters and it was hard as hell. We still have several unsolved problems. Like logs still, end up in one region incurring lots of cost overhead, stateful services have horrible performance and latency when trying to replicate across regions. The networking between the two clusters is very complicated and difficult to get right. Fault tolerance is not well isolated between clusters. Adding new public IPs is hard, no ability within service fabric to automate that in any way.

Because of all of these, it would be much better to have a first class distinction just like you do for fault domains and upgrade domains, regions/clusters instead of having a single cluster span multiple clusters.

@mkosieradzki SF does not have federation

In any case, I would like to federate SF and K8s together. This would reduce orchestrator lock-in and allow both sets of APIs to be used individually.

@AceHack this is technical problem with networking. SF Team was mentioning multiple times during Q&A that they are not officially for their external customers supporting geo cluster without proper QoS. This is one of many reasons why Microsoft is investing in strong networking and Azure features like VNET Peering.

Microsoft is using geo-distributed SF clusters on production grade.

Global VNET Peering in Azure should be solution too many problems mentioned but it is not out yet. Latency comes from the speed of light and is something you need to deal with.

You also have Availability Zones for lower latency synchronous replication and multiple new close geos with low-latency to each other.

Hierarchical fault domains have great support in SF. Did you use fault domain prefix override in your geo cluster? Because it sounds like not.

Regarding federation - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-architecture#federation-subsystem - it is just designed in a different way (the way Ubernetes was planned initially). Eventually K8S has dropped their plans for Ubernetes and went with simplistic loose-federation approach. This is not how SF is designed to work. For example because SF needs way more advanced Fault Detection and cluster-wide leader election. I don't believe you can get stuff like that with "federated clusters".

Going through all those burden to get DNS based discovery sounds like a really bad joke. However I definitely support stuff like virtual kubelet support (#720 ).

@mkosieradzki Cross region vnet peering will be good when both clusters are in Azure but this will not help at all when doing cross provider clusters. Even with VNET peering one has to keep huge spreadsheets to make sure all the different addresses do not conflict and overlap, this is a management nightmare even for vnet peering within a single region.

Also, the federation subsystem you point to is node federation i.e. federation between computers, not cluster or region federation, these are two different concepts.

We did try and use hierarchical fault domain but this was not a silver bullet, it added a lot of complication, independent tags not in a hierarchy seems a better solution. I do not know what pefix override is, I can't find any documentation on the subject. For instance, what we really wanted to do was define how one region looked by defining the service manifests gear twards a single region and say sync that to other regions. This is exactly what K8s cluster federation cluster sync gives you.

Also, globally wide leader election is something I don't want, things like that are exactly what I mean by not isolating faults and latency. You can almost think of this as an eventually consistent vs always consistent approach. SF without cluster federation is always consistent, K8s with federation is eventually consistent. This is not an exact metaphor but close enough. There are benefits and drawbacks to each.

You can run a globally distributed single K8s cluster today, it's similar to the way one would do for SF. The difference is with federation I now get that choice. Also K8s brings to the table much more than DNS based discovery. It has several different layer SDN that can be accomplished through its manifest files. It can reach out and bend infrastructure to fit the needs of the services inside it without having to do coordination outside of the normal service provisioning process. Even in Azure, it's management plane is PaaS whereas SF is IaaS and K8s in Azure can provision LBs, create public IPs on the fly, attach disks, and even provision MySQL databases all on the fly as services need them without external coordination like needing to deploy additional ARM templates for external resources.

I'm not trying to have a SF vs K8s debate or even a cluster federation vs single cluster debate. I would just really like to have the feature to have several orchestrators collaborating together through something like federation. Where each can operation on their own, no one is the global "leader" yet they can collaborate together to so I can start thinking of apps at a higher abstraction level and not tie myself to any one orchestrator, I would like my app to stay portable across any orchestrator or even a mix of orchestrators.

I would love to see swarm, DC/OS and others, all part of this orchestrator federation.

@mkosieradzki one other issue with a single global cluster with a single elected leader vs multi-master or multiple "federated" leaders is usually quorum loss. Let's say there is a temporary DNS or routing issue across the internet that only affects regions from talking to each other. Customers in region 1 can still route to cloud region 1, and customers in region 2 can still route to cloud region 2. In this case when there is a single global cluster then one region is dead even though it can still server traffic to customers because a quorum cannot form. Only one of the two regions will have a majority allowing the quorum to form. K8s federation does not suffer from this issue. From my be it limited testing a single global service fabric cluster would suffer from this quorum loss issue. Seems backed up by this SF documentation.

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery

Thanks.

@AceHack You are using term "both regions". Just to make sure: you cannot do a cross-geo SF cluster on 2 regions. You need 3+ because otherwise you will not be able to have 3 top-level fault domains. It will have terrible results, but I am sure you know this.

Prefix override is azure facility for creating hierarchical FDs. If you are not using Azure Clusters you are not affected.

What I am defending here is the strong consistent nature of SF-federation. I believe that sooner or later we will be able to retain this nature and it's only a matter of connection quality/redundancy. Even in multi-provider setup.

However the appliance provisioning etc you have mentioned is definitely a different (important) area and might be somehow interesting, but I am not sure if this should be handled by federation.

Honestly, I would personally prefer SF team committing the resources to creating the best experience for homogenous SF cross-geo, cross-provider clusters than to creating federation based interoperability layer. But that's only my wishful thinking...

I also think that it would be the best to the entire ecosystem to have two orchestrators basing on different philosophies than trying to make them as similar as possible and, as long as I agree with most of your concerns, I think it would be most beneficiary.

@AceHack I am very aware of this issue and for many "stateless" scenarios federated approach works better, especially today.

I think we could learn a lot from Microsoft internal CosmosDB... because it does not require strong consistency and if I understand correctly it runs on top of multiple not federated SF clusters. I can be very wrong here...

But as I have said before: with improving infrastructure the strong consistent approach will be more and more feasible, and the strong consistent approach is much easier to reason about.

@mkosieradzki To get multi-master support in CosmosDB it looks like you have to do some work in your data layer, so it pushes some of the complexity to your code. https://docs.microsoft.com/en-us/azure/cosmos-db/multi-region-writers

It seems like DynamoDB from AWS handles some of this complexity for you, I could be wrong here. https://aws.amazon.com/dynamodb/global-tables/

Also in all situations because of the laws of physics, any strong consistent system will hit limits that it can't scale past. As technology get's faster and more reliable this incrementally increases the scale but there will always be a limit. Eventually consistent systems will always have the advantage when it comes to scaling and availability. This is the basics of CAP Theorem. SF, for now, chooses only C whereas in K8s you get the choice between C or A. C being a single global cluster and A being multiple federated clusters.

CAP is so 1999 (and has nothing to do with scalability) ;). Seriously the main reason you create 3+ regions is to avoid P :). This is why you have more the updated modern PACELC theorem which addresses the scalability in absence of partitions :) and SF is mostly CP+EL (correction: should be CP+EC). And yes federated K8S should have lower latency.

But you must understand that you cannot run stateful services on top of K8S, even if you do using StatefulSets those systems have usually their own Paxos/Raft and/or fault detection and you need to the all the reasoning from the beginning.

SF and K8S are just fundamentally different. Both are well designed with different purposes in mind. IF you don't need stateful trivial federation is good enough, if you need stateful than you need advanced fault detection, global leader election etc - simple as that.

@mkosieradzki CAP theorem is the easy one to mention since when I say it most people are aware of what I'm talking about, I've not run into a lot of people who reference or know about PACELC(A). When saying choosing A makes you more scalable I will agree I was making a number of assumptions not especially called out in CAP theorem, I was really talking about PA+EL. As it seems to me SF is more a PC+EC but I would be curious to know why specifically you consider it PC+EL.

You can run stateful services on top of K8s it just does not have first-class built-in memory replicated data types for .NET and a handful of other languages. It's very trivial to get rabbit, Redis, or ceph via rook.io stood up in HA mode in K8s. I realize this requires an extra network hop for reads as compared to reliable dictionary for instance but it comes with it many benefits like easy backups, tools to view the data, no need to conflate my business logic with a lot of performance-based partition rules, etc... If I want similar features on K8s I can just rely on Apache Ignite or Hazelcast. Also, I know in preview SF Linux used firebird for stateful actors, not sure if this is still the case.

When it comes to stateful you absolutely don't need global leader election in all use cases. For me, I'm in an IoT world so I'm not dealing with transferring money between bank accounts. Even if I was not all systems in a bank need the same level of consistency as money transfers. In my perfect world, I would love a data store that chose CP+(EC or EL on some app determined basis) within a region/cluster and then chose PA+EL across regions/clusters. Heck, it would be great if all these knobs were configurable based on some node/cluster/region selectors in a perfect world. Also, why not throw the option in there to switch between Blockchain consensus and Raft/Paxos if I could ask for anything. Also if your problem domain can fit into a conflict-free replicated data type then Riak would be great and having split brain is not really an issue since it's trivial to resolve. I just want the choice at runtime in my app, I can play it safe with consistency and lose some scale, or I can add some complexity with availability and reduced latency to get better scale if that makes sense for my particular use case.

Again I just want to reiterate this whole ask of this enhancement request is about federation between different orchestrators in the hope it will lead to making my application easier to write and administer in a more orchestrator agnostic way. We already are aware of vendor lock-in for things like cloud and *aaS offerings. I just want to avoid orchestrator lock-in as well if at all possible.

@AceHack It was late night yesterday my bad - I was of course meaning PC+EC - my apologies for confusion.

Yup, much agreed - I was raising up eventual consistency and specifically CRDT scenarios during SF Q&A, but I don't believe it's the way to go with SF. I believe that sooner you will be able to run SF on top of StatefulSet in K8S or Virtual Kubelets on top of remote SF.

BTW. Isn't running SF on top of K8S StatefulSet solving most problems for you? I believe that during the latest Q&A SF Team said that SF Linux version should work in containers (it does not have dependencies on kernel drivers AFAIK)?

That would be much closer to what I think about those orchestrators:

K8S simplistic, stateless, with a lot of infrastructure management, simple and predictable scheduler, responsible for capacity management, federated, great to run pre-packaged software
SF complex, integrated, stateful, with advanced SDK, mostly designed to run SF-native applications, with advanced resource manager, designed for high density scenarios, and trusted applications

You can then run multiple SF clusters on top of K8S federated cluster family.

More or less thanks for an interesting discussion.

@AceHack - apologies for the late reply (holidays) and for the long post here, but I really want to clear some of this up for you.

@masnider I disagree that you can do everything in a single SF cluster that you can do in K8s federation.

Then please be specific about what you're looking for. These could make for some good feature requests, if you can help us dig them out.

I agree that K8s federation is very early but that is implementation details at this point, not design. One of the biggest benefits to K8s federation is the internal and external DNS based service discovery that can work across clusters.

The naming service and DNS service within SF would also work across the multi-region cluster, so there's no difference here, except that our federation already works :)

It also has the ability to have cluster level tagging instead of only node level tagging.

What of those capabilities is not achievable with node type level tags, or the placement policies?

Stateful services have horrible performance and latency when trying to replicate across regions.

Ok well then don't replicate them across regions, or put your regions closer together, or get a better network. You have the choice between latency and safety/HA. If you don't want to pay for it, then don't do it, that's fine. Constrain the service to a particular region. If you don't require RPO/RTO of 0 then multiple clusters with some other backup and restore mechanism is usually simpler for people.

The networking between the two clusters is very complicated and difficult to get right. Adding new public IPs is hard, no ability within service fabric to automate that in any way.

What about this do you think is specific to SF?
Can you point to how K8s (or any other orchestrator) solves the problem of obtaining new publicly routable IPs, say across or between current cloud providers? AFAIK all orchestrators are bound at the end of the day to the real physical resources provided by whatever infrastructure they're running on top of, and most don't do tons of integration with those specific hardware environments because they're trying to ensure portability.

Fault tolerance is not well isolated between clusters.

What do you mean by this and why would that be desirable? Ex: If one region fails, you'd like the other region to pick up the workloads right? Just as a note, with the placement policies, you can configure SF either way.

In any case, I would like to federate SF and K8s together. This would reduce orchestrator lock-in and allow both sets of APIs to be used individually.

Today the platforms and entirely too different in terms of the guarantees they offer, and even then that's not the pattern that other platforms have taken or in any likelihood that we would either.

Cross region vnet peering will be good when both clusters are in Azure but this will not help at all when doing cross provider clusters. Even with VNET peering one has to keep huge spreadsheets to make sure all the different addresses do not conflict and overlap, this is a management nightmare even for vnet peering within a single region.

This is just how networking is. Yes it is hard, no argument, but it's got nothing to do with SF. How would you directly IP connect two services running in those providers today?

Also, the federation subsystem you point to is node federation i.e. federation between computers, not cluster or region federation, these are two different concepts.

In Service Fabric, those are the same. You're looking for federation of two specifically different clusters, and the question then is what do you think that really buys you? In our comparisons to other platforms and looking at customer reports, the main downside to that layout is that management of the whole environment and handling things like upgrades of the underlying platform become harder. When you have a single cluster/environment stretched across those regions, then you don't have those problems, and if you have the ability to express your workload the way you want then it shouldn't matter. Early thinking about this sort of difficulty is why we built SF to be able to span regions without having to stand up separate clusters in the first place.

I can't find any documentation on the subject.

Here's some useful links on this topic:

This disaster recovery topic , which was already referenced, discusses when and why you may want to build multi-regional clusters.
Check out this case study from Cortana. I want to be very clear that clusters spanned across regions is a scenario that SF has supported for years. The main issues you've identified around networking are correct, but we don't have much control over that.
This is a video where we demonstrate both cross-region and cross-azure-az clusters. In the associated decks, please see slides 16-27 specifically on this topic.
This is a link to a template for ARM which should get you started.

For instance, what we really wanted to do was define how one region looked by defining the service manifests gear towards a single region and say sync that to other regions. This is exactly what K8s cluster federation cluster sync gives you.

To be clear, it doesn't really give you that today because the feature isn't released as a part of the platform, and that alone doesn't meet all your IP space issues and requirements that you are combining in above. Let's compare SF to things that actually exist and are supported, and let's try to narrow the conversation to things that SF can actually control, can we please?

globally wide leader election is something I don't want, things like that are exactly what I mean by not isolating faults and latency.

It would be great if you could clarify what you think the actual problem is here. So for example, why don't you want this? You may not need it - ok then so don't use it. Or what problem is this actually causing you today, in concrete terms? Feedback like that will help us figure out what we can do that will actually improve things for you.

In most of this discussion, you seem to be complaining that you have built a geo-cluster, but don't really want any of the geo features and aren't willing to pay for them. Ok - so then don't build a geographically spanned cluster. But then in other places you say you want things to automatically sync, you want some sorts of automatic deployment but not(?) automatic failover, etc. I'm confused :) It may end up being that multiple clusters are better for your requirements, or that you just need to configure the geo cluster a little differently.

Again I just want to reiterate this whole ask of this enhancement request is about federation between different orchestrators

Sure - that's a nice thing to want. That said there aren't really other orchestrators out there that have any sort of standard federation mechanism that anyone could all snap to. Also while I think that orchestrators will eventually all reach some sort of feature parity, they aren't there today, each is just different enough that there's things you can do in one that you can't do in others. Stateful services in SF is one differentiator from SF, whereas a lot of the isolated networking stuff that you mention from K8s is an area where they shine.

I would love to get more information about your actual problems today though, and to figure out how SF could help you. Most of what you've pointed out seems like you're using SF and it's almost working, but turns out just to be a little frustrating. Rather than sending us down the CAP rabbit hole, let's try to discuss those specific issues. Are there corresponding issues you've already created for these difficulties, or can you open them?

First off thanks for the long and detailed response, I really appreciate it. See my responses below.

Then please be specific about what you're looking for. These could make for some good feature requests, if you can help us dig them out.

I have been creating many feature requests on the SF issues. When you have one geographically distributed cluster all "master" nodes have to communicate and cause lots of cross traffic between regions spinning up costs since cross regions traffic is not free. The biggest advantage to "federating" clusters is much, much, much less crosstalk. Basically, each cluster is somewhat it's own entity, it does not have to communicate with others constantly for agreement on leadership and also it's able to continue operating even if it loses communications to other regions since it does not require a single leader. This does come at the cost of not being strong consistent only eventually consistent which does increase complexity for the sake of reduced costs and availability in the face of network partitions. Some "multi-master" system can accomplish this using only one cluster again at the cost of potential "split brain" issues that have to be resolved but service fabric only allows a single master. In a perfect world within a region, "replication" (or erasure coding) would be strongly consistent, but cross region it would be eventually consistent.

The naming service and DNS service within SF would also work across the multi-region cluster, so there's no difference here, except that our federation already works :)

Service Fabric DNS is not used as external DNS, it's only available within the cluster. It also cannot reach out and configure Azure DNS or Amazon Route 53. Service Fabric naming service also lacks the concept of namespaces so that I can further subdivide up a single cluster into "sub-clusters", allowing development teams share a single cluster for Dev1, Dev2, Staging, QA1, QA2, within a single cluster.

What of those capabilities is not achievable with node type level tags, or the placement policies?

It's all hierarchical based, no way to define more "tags/labels" that are orthogonal to fault and upgrade domains. Well, you get one extra label, node type. Imagine trying to say I only want to schedule something in the eastern region on machines that do not have GPUs exposed. Something more like the taint/tolerance and label/selector based mechanism from Kubernetes is much easier to reason about for certain things. I think for fault domains and upgrade domains service fabric's built-in understanding is great!!! I just want more scheduling control for heterogeneous nodes and workloads.

Ok well then don't replicate them across regions, or put your regions closer together, or get a better network. You have the choice between latency and safety/HA. If you don't want to pay for it, then don't do it, that's fine. Constrain the service to a particular region. If you don't require RPO/RTO of 0 then multiple clusters with some other backup and restore mechanism is usually simpler for people.

I don't have the choice between latency and safety/HA within the service fabric model. If I choose latency then I need two clusters that have no ability to communicate or sync with each other even in an eventual consistency way. It's not all about the costs but it would be nice to consider the latency, bandwidth, and actual dollar costs when running a distributed system. The same tradeoffs likely don't make sense across the globe as they do within the cluster. It would be great to not have to completely abandon any cluster cohesiveness just to get latency over safety. The thing is I was to basically specific the RPO/RTO for cross-cluster read replica partitions. That way when someone comes to the east all writes are happening in the east for the "east" partitions and there are RPO/RTO 0 read replicas in the east but much more batched and further behind replicas in other regions, kinda like geo-redundant storage in Azure. I want the mirror of this to happen in other regions with the ability in the case of data center outage or network partitions for other regions to take over the writes in outages. This part would be great to be eventually consistent and I know this adds complication because in the case of network partitions actually, more than one region could take over for writes of a single partition and conflicts would have to be resolved.

What about this do you think is specific to SF? Can you point to how K8s (or any other orchestrator) solves the problem of obtaining new publicly routable IPs, say across or between current cloud providers? AFAIK all orchestrators are bound at the end of the day to the real physical resources provided by whatever infrastructure they're running on top of, and most don't do tons of integration with those specific hardware environments because they're trying to ensure portability.

K8s automates infrastructure including the creation of public IPs, just specify type LoadBalancer for instance and you get an Azure, AWS, or Google public IP that's in front of a Load Balancer all through infrastructure automation. Yes K8s has tons of specific integrations with Azure, AWS, and Google but I don't have to, they abstract all that away from me, that's what I want as an application developer building apps that can run anywhere, in any cloud, or on-premises.

What do you mean by this and why would that be desirable? Ex: If one region fails, you'd like the other region to pick up the workloads right? Just as a note, with the placement policies, you can configure SF either way.

There are lots of meanings. If an attacker gets into one region and compromises the cluster he has access to all regions. Also not really a great story for network partitions, even in a 3 region setup where us east is unable to talk to any other regions but still serve traffic in us east, it won't serve that traffic because it does not have a majority. This problem could become worse the more regions that are added since the minority could grow much large and get shut down.

Today the platforms and entirely too different in terms of the guarantees they offer, and even then that's not the pattern that other platforms have taken or in any likelihood that we would either.

Please just follow suit with Pivotal, Mesosphere, OpenShift, Docker, etc... and run K8s inside service fabric. That would be AMAZING, I can't imagine a better paring, although this is something I asked for a while back when docker announced, I never got any response which is why I've been asking for all these Kubernetes like features within service fabric. If you just shipped with K8s built in it would pretty much solve most of my problems.

This is just how networking is. Yes it is hard, no argument, but it's got nothing to do with SF. How would you directly IP connect two services running in those providers today?

All I would say is orchestrator infrastructure and networking automation that is abstracted to the software developers and operations in a cloud-agnostic way simplifies the problem. It does not solve all problems but it simplifies it a lot.

In Service Fabric, those are the same. You're looking for federation of two specifically different clusters, and the question then is what do you think that really buys you? In our comparisons to other platforms and looking at customer reports, the main downside to that layout is that management of the whole environment and handling things like upgrades of the underlying platform become harder. When you have a single cluster/environment stretched across those regions, then you don't have those problems, and if you have the ability to express your workload the way you want then it shouldn't matter. Early thinking about this sort of difficulty is why we built SF to be able to span regions without having to stand up separate clusters in the first place.

Node federation and Cluster federation are not the same thing in any system. Service fabric offers node federation but does not offer cluster federation. In service fabric, you completely lose any concept of two different cluster identities. In a dream world, I would set things up how I want for a phantom fake region and just checkbox what regions I actually want to support. Please see all my above comments for what I mean about how cross region communications and latencies should be accomplished. Just naive strong consistency and a single leader will not accomplish what I'm looking for.

This disaster recovery topic , which was already referenced, discusses when and why you may want to build multi-regional clusters. Check out this case study from Cortana. I want to be very clear that clusters spanned across regions is a scenario that SF has supported for years. The main issues you've identified around networking are correct, but we don't have much control over that. This is a video where we demonstrate both cross-region and cross-azure-az clusters. In the associated decks, please see slides 16-27 specifically on this topic. This is a link to a template for ARM which should get you started.

In none of these documents or videos do I see a concept of "pefix override". Maybe I'm just missing it but this is specifically what I was talking about I'm unable to find any documentation on as this was suggested as a way to make geo clusters "easier". Thanks for the information you've shared though, it's great to have all those links in one place.

To be clear, it doesn't really give you that today because the feature isn't released as a part of the platform, and that alone doesn't meet all your IP space issues and requirements that you are combining in above. Let's compare SF to things that actually exist and are supported, and let's try to narrow the conversation to things that SF can actually control, can we please?

Federation has been released for Kubernetes, some parts are beta, some are alpha so changes could occur but the feature itself has been released. Also, I'm not sure why a feature request to service fabric would require some other orchestrator has it implemented fully and supported. Seems fresh ideas would be welcome as well. Also, I know SF currently does not do networking and infrastructure automation so it's can't control those things today but I really wish it could.

It would be great if you could clarify what you think the actual problem is here. So for example, why don't you want this? You may not need it - ok then so don't use it. Or what problem is this actually causing you today, in concrete terms? Feedback like that will help us figure out what we can do that will actually improve things for you.

In most of this discussion, you seem to be complaining that you have built a geo-cluster, but don't really want any of the geo features and aren't willing to pay for them. Ok - so then don't build a geographically spanned cluster. But then in other places you say you want things to automatically sync, you want some sorts of automatic deployment but not(?) automatic failover, etc. I'm confused :) It may end up being that multiple clusters are better for your requirements, or that you just need to configure the geo cluster a little differently.

I think I've covered this one pretty much above, chatty, expensive, synchronous, strongly consistent, single leader/master, single cluster across geographic regions. Take a look at geo-redundant storage, it's not perfect but it's ability to asynchronously replicate to other regions is the kind of tradeoffs I'm talking about that makes sense across regions but not within regions.

Sure - that's a nice thing to want. That said there aren't really other orchestrators out there that have any sort of standard federation mechanism that anyone could all snap to. Also while I think that orchestrators will eventually all reach some sort of feature parity, they aren't there today, each is just different enough that there's things you can do in one that you can't do in others. Stateful services in SF is one differentiator from SF, whereas a lot of the isolated networking stuff that you mention from K8s is an area where they shine.

I would love to get more information about your actual problems today though, and to figure out how SF could help you. Most of what you've pointed out seems like you're using SF and it's almost working, but turns out just to be a little frustrating. Rather than sending us down the CAP rabbit hole, let's try to discuss those specific issues. Are there corresponding issues you've already created for these difficulties, or can you open them?

You could be the leader in creating cross orchestrator federation but it seems like it's not something you are interested in so please just follow suit with all the other orchestrators and include Kubernetes as part of service fabric. I would not say Stateful services is an SF differentiator, other orchestrators can use orchestrator agnostic in-memory data grids (IMDGs) like Apache Ignite or Hazelcast which in some way are ahead of SF Stateful services like with built-in backup and restore, ability to run SQL queries, cross partition queries, even the ability for cross-partition transactions. I don't think the new stuff with service fabric block storage is going to be a differentiator either with technologies like Rook.io using Ceph a tried and battle-tested distributed block storage using capable of erasure coding as well as naive replication. Microsoft has done some great research in the area of erasure coding. There is also heketi on top of GlusterFS and several other very innovative stateful and software-defined storage projects.

I work for a large corporation, I actually sit right behind our representative from Microsoft and he's set up several meetings for us with the service fabric team. Last time I spoke with you guys on the phone I was trying to drive home how important it was to pick up on the 3rd party stateful open source container projects and make them supported as first class in service fabric. Many of these projects ship out the gate with Kubernetes Helm charts, operators, statefulsets, and/or deployment yaml files. Some of them ship with docker compose files, but none of them ship with service fabric application manifests. There is also very complicated interwoven dependencies between projects. Supporting some of the compose yaml api is a great start but lack of core networking automation fundamentals, storage class like interfaces for persistence in containers, and lack of open source community support and adoption make getting any 3rd party stateful application up and running in service fabric in HA mode a very difficult proposition. I've tried with several and it's TOUGH as compared with AKS in which I was able to get 15 different ones up in running in a weekend in HA mode. All of the many, many feature requests I've submitted to service fabric is in the hopes of making it a better product with similar parity to orchestrators the open source communities are adopting like wildfire. I'm really a big SF fan.

Thank you so much for this conversation, I really enjoy going deep on subjects and I'm so glad you've taken the time and consideration to respond in such a detailed way.

Thank you so much for this conversation, I really enjoy going deep on subjects and I'm so glad you've taken the time and consideration to respond in such a detailed way.

ACK, +1, 👍 , golfclap :)

Now on to the good stuff.

When you have one geographically distributed cluster all "master" nodes have to communicate and cause lots of cross traffic between regions spinning up costs since cross regions traffic is not free. The biggest advantage to "federating" clusters is much, much, much less crosstalk

I don't know how we can be making these comparisons without real implementations running similar workloads. Sure - theoretically! - I can see that maybe you could get away with less. But I don't know that for sure. One way to get away with less communication is to do less work and provide fewer guarantees :) Is that what you really want? :) Unless there's specific comparisons, we're comparing something that does exist to something that would/could/should work. That's hard, and vaporware/whiteboard discussions tends to win, because they can promise anything and be designed in any which way.

This is not to say that the communication costs aren't real, just that 1) You're making a comparison to something that isn't concrete (at least to me and 2) To be a fair comparison anyway there would need to be a "why" and "can you turn it off" part of the discussion. So far we're not having that discussion. It's a bit too high level.

it does not have to communicate with others constantly for agreement on leadership and also it's able to continue operating even if it loses communications to other regions since it does not require a single leader.

In a lot of SF there isn't a single leader. The services for the most part run themselves and don't touch the system services. And again, if a) you don't want leaders, then don't have stateful services and b) if you don't want services communicating across these boundaries and would rather backup/restore in the background then just do that. A lot of this (again) feels like you've done something you don't want to do. Try instead - what do you want and then ask if there's some way to achieve that.

In a perfect world within a region, "replication" (or erasure coding) would be strongly consistent, but cross region it would be eventually consistent.

Lots of applications that require perfect consistency 0 RPO/RTO would disagree with this statement. The thing to know is that in SF we support both. So if you don't want this but still want a single cluster, then that can be configured.

Service Fabric DNS is not used as external DNS, it's only available within the cluster. It also cannot reach out and configure Azure DNS or Amazon Route 53.

How is this different from anything else? Is that a feature request? The initial statement I thought was the assertion that within these other (theoretical) federated clusters that there was thankfully some DNS that could be used for addressing even though the clusters were federated. I was saying - Ok - SF has that. These are different requirements now. Please please try to be specific and not to hop around, or we're never going to get to any understanding on what the requirements actually are.

Service Fabric naming service also lacks the concept of namespaces so that I can further subdivide up a single cluster into "sub-clusters", allowing development teams share a single cluster for Dev1, Dev2, Staging, QA1, QA2, within a single cluster.

Good solid actual feature request! We're working on this. First drops will be basically isolated networks between different Service Fabric application instances. Today to be clear you can absolutely share the cluster between teams (lots of people do) however you have to careful not to claim global resources like service names. That will still be mostly the case in the first releases, but at least services won't be able to communicate with other services accidentally. This separation and tenancy will be strengthened over time.

It's all hierarchical based, no way to define more "tags/labels" that are orthogonal to fault and upgrade domains. Well, you get one extra label, node type. Imagine trying to say I only want to schedule something in the eastern region on machines that do not have GPUs exposed.

PlacementPolicy RequiredDomain = fd:/East (NodeType == "MyNodeTypeThatHasGPUs") || (NodeType == "MyOtherNodeTypeThatHasGPUs")

Alternatively, instead of relying on node types, you can skip to specific NodeProperties that happen to be the same across node types, if you find node types too cumbersome or not granular enough. See the example here, HasSSD for example could be a property that exists on multiple node types. NodeType is just a shortcut since lots of sets of properties are consistent across node types since they correlate with VM types (VMSS) in Azure. This looks like a good match to that "heterogeneous" scenarios that you mentioned.

Seriously - just ask these questions rather than saying that you can't do certain things. We'll help you or we'll go make the features work so that you can. Bringing this sort of stuff into some multi-cluster federation thing is just confusing.

I don't have the choice between latency and safety/HA within the service fabric model.

Of course you do, come off it :)

If I choose latency then I need two clusters that have no ability to communicate or sync with each other even in an eventual consistency way.

This sounds like it could be an actual feature request :) What do you want to sync and what do you want to not sync? We've already discussed how you can easily prevent workloads from replicating or doing election across these boundaries. What else? Be specific.

It would be great to not have to completely abandon any cluster cohesiveness just to get latency over safety.

You don't and I don't understand what makes you think that you do. Give an example. Maybe break that example question out into a separate thing since this is getting hard to track :)

The thing is I was to basically specific the RPO/RTO for cross-cluster read replica partitions. That way when someone comes to the east all writes are happening in the east for the "east" partitions and there are RPO/RTO 0 read replicas in the east but much more batched and further behind replicas in other regions, kinda like geo-redundant storage in Azure. I want the mirror of this to happen in other regions with the ability in the case of data center outage or network partitions for other regions to take over the writes in outages. This part would be great to be eventually consistent and I know this adds complication because in the case of network partitions actually, more than one region could take over for writes of a single partition and conflicts would have to be resolved.

We would probably just tell you to build a single cluster, and then constrain the services so that they're running in only one region, then to have them stash their state in a georeplicated azure storage blob that their corresponding other-region partner would pick up and apply every so often (CosmosDB would also work). That's not so bad of a setup, right? Or what's the problem? What specifically do you wish was different?

There are lots of meanings. If an attacker gets into one region and compromises the cluster he has access to all regions.

Sure - so again, how is this different from the other options out there? Be specific. If the whole point of federation is to provide a single management plane then this should be the same for everyone.

Also not really a great story for network partitions, even in a 3 region setup where us east is unable to talk to any other regions but still serve traffic in us east, it won't serve that traffic because it does not have a majority.

? Of course it still can serve traffic. Today lease durations are 30 seconds in Azure by default. That region gets cut off, everything pretty much keeps working for 30 seconds. Look - fundamentally this is all configurable. You want a region to stay up and serving reads and only locally replicating for a week even when it's cut off from the rest of the world, set your lease duration to a week and be done. But that sounds like a really weird configuration and I wonder what you're really gaining by having a single cluster at that point.

This problem could become worse the more regions that are added since the minority could grow much large and get shut down.

Then don't federate that many. Conversely - put it this way: SF is trying to give you guarantees about the reliability of the work that is going on, and you keep saying you want to trade those guarantees away for perf. There's probably other ways to get perf, but fine. Just have separate clusters and backsync to some eventually consistent store and you're done.

In none of these documents or videos do I see a concept of "pefix override".

Yeah I don't know where that term came from. In this template see "FaultDomainOverride"

I'm not sure why a feature request to service fabric would require some other orchestrator has it implemented fully and supported.

Because otherwise how do you expect them to interact? When K8s breaks their stuff again getting it to GA, who do you call when it doesn't work? :)

I think I've covered this one pretty much above, chatty, expensive, synchronous, strongly consistent, single leader/master, single cluster across geographic regions.

We don't really have that, and there's ways to configure your service placement so that you don't have that, which I've touched on. I don't understand what you're reacting to, other than I guess you didn't have this configured the way you really wanted and had a bad experience? Also you can always batch in your service code, tell the client it's all good and then write to the collections in the background, or just sync back and forth to Cosmos. Lot's of different patterns for doing what you want! If you can help us get specific on the semantics you're trying to achieve there's ways to do them all already today (as far as I can tell).

You could be the leader in creating cross orchestrator federation but it seems like it's not something you are interested in

I really don't get this statement and it makes me sad. We built this and it works today for most workloads. It sounds like you had some configuration difficulties and I want to help you fix those. But just because someone else did it a different way doesn't immediately make that way the best way, particularly when there's already alternative solutions out there. Again - maybe it would help to separate out 1) the things you want from the infrastructure (that SF doesn't control today), 2) the things you want from the orchestration management layer (such as the isolated networks that you mentioned) and 3) the semantics around availability and consistency you want for your applications. I'm confident that SF can already meet the requirements around 2 and 3 today.

Supporting some of the compose yaml api is a great start but lack of core networking automation fundamentals, storage class like interfaces for persistence in containers, and lack of open source community support and adoption make getting any 3rd party stateful application up and running in service fabric in HA mode a very difficult proposition

This is all well taken. I think that now that the container work is in place that making progress here will be easier. I also think that hopefully we'll be able to get the community engaged on building out support for other tools and utilities. Examples of this are already starting to pop up, like Traefik WooHoo Hopefully the new year will also bring some progress on the open source front.

But most of this is far and away from the actual issue you have here around K8s Federation. For that I would say call us back when their solution is a little more concrete at which point it would be more sensible for us to consider it. However I would also strongly encourage you to work with us so that we can figure out what design for your cluster and app would work for you today. I'm really pretty sure there's already ways to solve your problems, given that we've already got some applications that sound a lot like yours running in production in these types of environments. We also do have plans to support separate clusters that do some sort of top level federation so that you don't have to do as much at the application level, but that's still different than this ask about k8s federation interop and we would like to have a discussion more on what the behavior you're looking for so we know what to go build :)

Getting to a concrete list of behaviors that are not achievable today would be really, really good, and I suggest that you open specific issues for those that you're having trouble configuring as you desire today. We're happy to help.

Hope this all makes sense and thank you also for the discussion (and pushing us :) )

@masnider To be honest the most important feature we need (from perspective of this discussion) is a good (or any) documentation about what happens when something goes wrong. Apparently we have some features that neither me or @AceHack are aware of.

I am confident that Service Fabric is great and mature solution - I have been working with SF for a long time and I am amazed that: every time I need something that I was not predicting that feature was already inside (often undocumented, but usually all scenarios are addressable).

We cannot conduct experiments for every feature and scenario especially in GEO HA setup. So please let me ask couple of questions that might clear up our invalid convictions.

Let's consider a cluster running in 4 regions (3 nodes per region) (VNETs peered with upcoming global VNET Peering)

What amount of traffic is generated by Federation Subsystem?
What is the communication topology for Federation Subsystem: is it more like a ring or ring of rings? Is the Federation Subsystem hierarchical-FD aware or are the hierarchical FDs handled at Failover Manager level only?
What happens when region 2 looses network connectivity: a) I assume that a global Stateful Service would move primary to one of the 3 other regions. And the lease for primary replicas in region 2 will expire or be evicted sooner by Fault Detection in the Lease Layer. Right? b) What happens to a region 2-bound Stateful Service. Is it able to keep the local leader leases up and running? c) What happens to Stateless Services in region 2? Are they able to continue working in region 2? My understanding is that additional instances will be created to keep proper instance count in regions 1,3,4. What will happen to the instances in region 2? Will those be kept alive or shutdown by the infrastructure? d) (Interesting only when the answers for b and c are true) Which system services are available from region 2 during network split: HealthManager? Can services still report health and the health reports will be eventually visible after network join? FailoverManager? When a service dies in region 2 would it be restarted inside region 2? ImageStore? When a node dies in region 2 would the services be moved to another node inside region 2? NamingService? are the services in region 2 able to keep resolving names of services inside region 2? Or external services from regions 1,3,4 using last-seen snapshot.

Small (only a few bytes maybe up to a couple K) every few seconds based on your leasing intervals. Slightly more when nodes are entering or leaving the ring or when there's failures that need to be managed, or messages that are getting routed around (these are usually point to point, but can bounce and require additional routing. There's no single number but we have never seen this layer be accountable for an appreciable portion of traffic in the cluster.
The federation is a ring of overlapping neighborhoods. Each neighborhood is centered on each node, with some neighbors out in each direction around the ring. Leases are within neighborhoods for the most part. Most communication is point to point within the cluster, however if a node needs to route to a node that it doesn't already know an address for, it uses ring routing to bounce the message to the closest node that it does know about, which will then forward the message in a similar manner (either directly or in another hop closer to the desired node). FDs are really only handled at the higher levels. The actual federation may use FD info to decide how long a lease can wait, which means that on the whole the shorter leases that will be in place between neighbors that landed in the same region will be the ones that detect failures. The cross-fd lease duration is higher only to prevent nodes that end up with neighbors in the other regions from being penalized.

3a) Leases are at the node level, there's no such thing as a "leader lease" at the replica or service level. When the network failed, everything would work normally, and nodes within region 2 would find themselves unable to obtain new leases with any of their neighbors that were in the other regions, and would go into arbitration. In Arbitration their partner would be able to contact the seed nodes while the nodes in region 2 would not, and hence would declare themselves to have lost would take themselves down. Their partners would know that they must be down and would advertise that the nodes were down, which is what the higher levels observe and trigger appropriate actions like failing over primary replicas from those nodes to other nodes. You'd expect the lease failures to start at the "edges" (any nodes with neighbors in the other regions), and propagate through the nodes in that region. The way we auto-bootstrap the cluster and the nodes and decide what neighborhoods they're in means that most neighborhoods there will be neighbors in the other regions, hence you'd expect the entire region to fail within 1 lease duration (~30 seconds by default). For normally configured stateful services (with replicas in all regions), any primaries would remain up for this duration but unable to make any stateful progress, as writes to the replicas in the other regions would be unsuccessful.

3b/c) More or less the the same things happen regardless of whether the service is stateless or stateful. If the service is stateful and constrained entirely to a region, then this service will continue operating and making progress until the nodes underneath it shut down. If it's replicas are in other regions it will remain up until the nodes underneath it go down, but it wouldn't be able to make additional progress (could still serve reads). Same deal for a stateless service, it stays up and running until the nodes go down under it. If network connectivity is re-established during or after this, the nodes will rejoin the cluster and the services will be restarted.

Whether additional services are created to replace the ones that were in region 2 when they shut down depends a lot on how the service is configured. Ex: if the service is constrained to only run in region 2, then no new replicas will be created because no place would be valid. If the stateless service is configured with instancecount -1, then it's already running on all nodes in all other regions and so there's no additional instances to create. Otherwise, in general, if the service is free to run in the other regions and there's available capacity then as it fails in one region it will be failed over to some other region as you indicate in order to maintain desired target replica and instance counts.

3d) There's nothing magic here. The system services are just like other services and follow the same rules. So if the primary of that system service is in some other region, then it won't be accessible, if it's in region 2 which is isolated then it will continue functioning but might not be able to make further progress, but could still serve reads. So to take some examples, in the case of the health manager, that's a stateful service, so it follows those rules. What you'll end up seeing in health is the nodes failing and being down, and then the replica/instance counts for those services being below target until they are brought back up to target. Health reports from those nodes that went down may show up later, but it depends a lot on the nature of the network interruption (slow is the same as failure, so if they show up later they could get recorded, but they may also have been superseded by that time). As another example, naming information is cached on the nodes, so any information that was still accurate would continue to be used and so services would be able to continue to discover each other.

Hope this helps -Matt

@masnider I would like to thank you for taking your time to describe this architecture in such a detail. I really appreciate it and it would definitely help!

I really need to clear things up here:

When the network failed, [...] nodes within region 2 would find themselves unable to obtain new leases with any of their neighbors that were in the other regions, and would go into arbitration. In Arbitration their partner would be able to contact the seed nodes while the nodes in region 2 would not, and hence would declare themselves to have lost would take themselves down. [...] you'd expect the entire region to fail within 1 lease duration (~30 seconds by default).

If the service is stateful and constrained entirely to a region, then this service will continue operating and making progress until the nodes underneath it shut down.

The cross-fd lease duration is higher only to prevent nodes that end up with neighbors in the other regions from being penalized.

If I understand correctly this means that we can expect the entire region down (despite some of the region-constrained services can still make progress) in a period of lease duration (30 seconds) and cross-FD lease duration is not a mechanism intended to give such a services more time like 30 minutes but more like 60 seconds.

Are there any practical setups with lease duration like 30 minutes? My understanding was that it makes no sense and the lease duration is the key factor in failover efficiency. But also you have told that Service Fabric has very strong fault-detection mechanism so I a bit confused here. My understanding is by analogy to the standard simplistic etcd approach by using leases on a key in KV store to store the leader information for services not doing their own consensus.

My previous understanding was that each stateful service replica-set elect their local leader by acquiring a lease, but this is apparently wrong as you have written:

Leases are at the node level, there's no such thing as a "leader lease" at the replica or service level.

Does this mean that there is an another similar lease mechanism provided by one of the system services? Or is the primary replica re-election constrained in the edge cases by the cluster lease duration. My understanding is that the latter.

If I am reasoning correctly, it all means that: if one wanted to create an eventually consistent and network-fault-resilient system on top of SF, they should create multiple independent SF clusters and build some layer on top of them.

From my perspective, this is not a problem, as I agree, that "get better networking" is the way to go in Azure-era, but I also understand some concerns (for scenarios where getting better networking is too expensive ;) ):

Consider you need to run two services:

region-bound service A
global service B

Today in SF to get the best resiliency you should run region-bound service A on a local cluster or cross-AZ cluster and global service on multi-region cluster.

Because dark-fibre outage for a local region will take the service A down, the best way is to have two separate clusters local and global.

I believe that @AceHack considers this as SF weakness in comparison to K8S:

mixing local and global services in one SF cluster might be worse than having separate clusters
when running Eventually Consistent system you get better resiliency on separate SF clusters (because longer networking outage will not shutdown your capacity)

As federated K8S does not provide any interesting guarantees, only some management facilities - I think that the ultimate solution for him is to run multiple SF clusters on top of federated K8S.

Federated K8S will provide a single "infrastructure view" and properly configured multiple SF clusters will provide guarantees and do the heavy lifting. I believe this is the best fit to what @AceHack expects. I don't think that containerizing SF on Linux should be very difficult especially that it does not depend on any kernel drivers.

K8S seems to be getting local volumes support which can enable containerized SF to use ephemeral SSDs, if I am not wrong. So this setup might not be actually that stupid as it sounds in the first place.

However I still cannot imagine any benefits of neither federating multiple SF clusters, nor SF with K8S. Why make everything more complicated? The only configurations that make sense to me in this context:

SF on top of K8S with strong capacity guarantees and ephemeral SSD support
using SF as a virtual kubelet provider to be able to dynamically exchange stateless workloads between K8S and SF

I would not use those myself as I would rather get a better network connection ;) and run a multi-region SF cluster and benefit from all the guarantees, abstractions and services.

I'm going to close down this conversation for now. There are some good feature asks in here that we should pull out into separate, specific requests. An "eventually consistent" model between separate SF clusters, or the ability to somehow designate certain clusters as owners or managers of other clusters seems at least partially legitimate, but again I'll reiterate that almost everyone just deploys separate clusters and uses some shared storage solution (or their own client->service syncing) to manage the relationships between clusters.

At this time there are no plans to:

Have SF support K8's federation or networking protocols
Develop support for federated SF clusters
Run SF on top of K8s or vice versa

I want to thank @mkosieradzki and @AceHack for the conversation. If there are other specific questions you want to pull out of here that I missed along the way, please do and tag me.

microsoft / service-fabric-issues

Please support Kubernetes Federation #721