tags: `linkerd` `kubernetes` `service mesh` `multicluster`

Multi-cluster Discovery Proposal

LFX Mentorship proposal By Rushikesh Butley & Matei David

Problem :

Linkerd's multicluster extension allows operators to export headless services, which are primarily used for workloads deployed as StatefulSets (traditionally databases).

Working with a StatefulSet in complex cluster topologies (three or more clusters) is unwieldy. It requires constant manual intervention from cluster operators and application owners. For each new (or removed) exported service, the StatefulSet has to be edited, and all of the workloads have to be redeployed.

The above problem makes endpoint discovery for stateful workloads hard to deal with.

Goals & Constraints :

For this proposal below are the goals and some of the constraints, which we have scoped out.

Goals :
- Create global service in each linked cluster, and make sure it aggregates all the mirrored services.
- Develop a reference architecture that can be used in subsequent tests and investigations.
- Document what the problem is and how the solution works.
- Manually validate the solution.
Constraints :
- Implementation should be a prototype, separate from the Linkerd repo.
- Implementation is done by the start of June.
- Will not be part of multicluster official extension.
- Will not focus on any UX through CLI.

How should the problem be solved?

Design Exploration :

Background :

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named source and target. Let's assume that there is service name foo existing in source and target. And it might need to discover all other replicas existing in target/

As we can imagine, there are various applications like which does DNS based service discovery for various purposes including but not limited to Statefulsets for operating in HighAvailability. Ex. Thanos, Nginx Upstreams etc.

Background for this proposal :

While it works fine for cluster with one target, it will be quite cumbersome issue, when there are more than one target cluster. To solve this problem, we want standalone service existing in cluster, which when queried, can return all the Endpoints from it's peered cluster.

Solution :

Global Service :

Possible solution to this problem is, creating global service in each linked cluster. This global service will act as aggregator and will have all the mirrored EndpointSlices. Below is the example to clarify it further.

Once the proposal is implemented below will be the view for source cluster.

With Current Release

currentRelease

Proposed Solution

comingRelease

Summary :

In summary, if a Statefulset workload is deployed in a source cluster, and it needs to discover replicas from other clusters, each cluster will need to be linked against the source. Each individual headless service will be mirrored in the source cluster. This leads to operational problems, where a statefulset needs to be rolled out and manually modified each time a new service is added. Consider the following example where a single target cluster is linked against source:

- name: foo
  args:
     - join
     - foo,foo-target1

Linking another cluster leads to a manual intervention to add the new service:

- name: foo
   args:
     - join
     - foo,foo-target1,foo-target2

Implementation Ideas :

To achieve the above solution using EndpointSlices, we might start building on service mirror component. We will deploy controller in each cluster, alongside current linkerd services. Controller will keep watching all the local services inside cluster.

To break it down into steps :

As we discussed above, will create the new Service suffixed by -global.
In source cluster, we can have controller/operator, which keep looking for the new Service being added. To filter the services, controller will be filtering on the label mirror.linkerd.io/mirrored-service: "true", this annotation is already place by Service Mirror component, when it mirrors service in source cluster.
This will help us, executing our logic when new service with respective anotation appears in our cluster.
Once our controller receives any event for this filter, controller if there is respective global service exist for mirrored service in our cluster by the suffix -global.

a) If it doesn't exist first we create Headless Service using our controller. And name it by snapping -global as suffix at the end of mirror service name.

b) If it exist, we skip creating Headless Service.
To find respective synthetic endpoint for mirror service existing in our cluster, we query our K8s registery for Service's respective Endpoints.
All the retrieved Endpoints for mirrored services, will be used to create new EndpointSlice and it will be owned by the -global service . We will start with maintaining only one EndpointSlice and if required we can split it/scale it later.

Development Plan along with Testing Strategy :

We can deliver above feature in two phases.

Phase 1 :
- In first phase we can focus on writing Controller/or refactoring Service Mirror, which look for new Services being added with label : mirror.linkerd.io/headless-mirror-svc-name
- Create Headless Service with the name suffixed with -global from the initial service which is mirrored.
- Testing Strategy : To test this phase, we can create Multicluster topolgy of 3-4 cluster. And make sure that only One Global Service is being created per cluster, respective to each type of unique Service being mirrored.
Phase 2 :
- Once we have Headless Service ready from the Phase 1, in this phase we create only one EndpointSlice (Maybe we can scale it later if required).
- And we add respective synthetic endpoints from mirrored service in this EndpointSlice.
- We also check for any changes, to the respective endpoints and add/remove them accordingly.
- Testing Strategy : To test this phase, we can query the Headless Service created in Phase 1 and make sure that all the EndpointsSlices being returned match against the synthetic endpoints create by Service Mirror.

How we will run it ? :

We will build docker images of our controller, and deploy these images in our multicluster environment using helm charts. It will monitor all the local Services inside respective cluster. It will take required action when encountered label of interest.

For testing, we will be simply running controller against local kubernetes cluster with default config.

Open Questions :

Even after creating this Service Discovery mechanism, we need to make sure that consuming service has capablity to handle multiple ips return by the Service endpoint.
From my intial assumptions, I dont think there will be any change in proxy. As the request will still be handled in the same way, currently being handled.
Is it best bet to rely on Service Mirror, for the base implementation ? And write this as seprate controller. Or is it better to refactor the Service Mirror code. From my thoughts we should go with former one.

Language of Choice

I would like to propose writing this Controller in Rust instead of GoLang. As I think with Go lang it will be alot more boilerplate code.
With rust it will be quite fast and efficient, to write new Kubernetes Controller as there will be lot less code.
Because of strict type system, interacting with types will be quite explicit and error handling will be lot more easy.
As it's Rust, we will get out of the box performance, along with very less cpu and memory footprint.
Also I am quite keen about learning Rust, this will be very good opportunity for me to write some hands on code. And learn from the best :).

Would you like to work on this feature?

yes

We have started the initial development here : https://github.com/rushi47/service-mirror-prototype

Thanks for the writeup @rushi47 It seems the doc has some duplicated paragraphs (verbatim), and also there's lot duplicated ideas. It'd be easier to read if it were more concise. Also you could remove the alternative proposal and focus on the selected one. And I think this is ripe for some diagrams that would make things clearer.

About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services. There should be one global service per each Stateful ordinal index. Like, for each index x, a service mysql-x-global fronting mysql-x-north, mysql-x-south, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.

Looking forward for your thoughts (and diagrams! :wink: )

@alpeb Thanks a lot for reading the proposal and the suggestions. 😅 Yeah apologies for the dedup, we were trying to follow the RFC format. And as I am new to writing, I might have mentioned same things couple of times. Regarding the second part, you are correct we trying to do One global service per each stateful ordinal index. To shed more light, i.e for mysql-x-north and mysql-x-south there will be mysql-x-global. For foo-x-north and foo-x-south there will foo-x-global. As you stated, I think diagrams will be quite helpful here. I will try to get them in as well.

@rushi47 thanks again for the proposal, here are some more notes:

StatefulSet workloads are able to use newly added or removed exported services without being restarted. This doesn't sound right to me. We don't want to use newly exported services, we want to use newly linked clusters, without restarting the workload.

Each linked cluster currently creates its own service. We want it to have its endpoints mirrored in the global service instead.

Design Exploration

Background

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named east (source) and west (target). Let's assume that there is service name foo-mysql existing in East and bar-mysql existing in west.

We should be consistent with our naming throughout the proposal. If we refer to the two linked clusters as source and target, we should stick with it instead of giving an example where we use cardinal points.

Similarly, I think I mentioned this before, we don't need foo-mysql and bar-mysql. Let's simplify it by saying that each cluster has a mysql service. When this service is mirrored, it will exist as mysql-target. Let's lower the cognitive burden here, it's an already complicated proposal and problem, with a lot of specialised terminology.

As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.

To do so, we need to discover the sister services running along side our main service. In this case, bar-mysql could be sister to foo-mysql and they might want to operate in consensus.

Keeping the same analogy, we can imagine that foo-mysql might form the consesus with bar-mysql. The problem is foo-mysql can't find bar-mysql in east cluster directly/locally (it can if it's exported), the way it can find other services natively in east.

I don't think this really explains the problem, and we're using complicated terminology. How about we replace it with a much smaller paragraph:

"Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up. Since a stateful workload may be replicated, it needs to elect a leader for writes. In a multicluster context, we want stateful workloads to discover replicas across local cluster boundaries".

For above purpose, we need to make sure that foo-mysql can find bar-mysql natively by calling fqdn like bar-mysql.default.svc.east-cluster.local. And this will be backed by all the Endpoints, existing in its native west cluster.

Again, let's change those names. It's very hard to follow.

"To discover replicas, stateful workloads generally rely on a DNS hostname to connect to. To allow discovery across more complex cluster topologies, stateful workloads need hostnames that resolve to endpoints not present locally in the cluster".

service_names followed by eps :

bar-mysql-west bar-mysql-0-west bar-mysql-1-west

zoo-mysql-north zoo-mysql-0-north zoo-mysql-1-north

woo-mysql-south woo-mysql-0-south woo-mysql-1-south

This is a convoluted example. What do we want to illustrate here, what the database needs to accept for service discovery? Or how services are actually organised? If it's the former, we do not need to specify any hostname services since we won't use these when configuring service discovery.

" Assuming a topology that includes three clusters, here is an example of the DNS records that will be created when two targets clusters are linked against a source:

# Original DNS records created when mysql was deployed
Headless Service: mysql.default.svc.cluster.local (may resolve to any A records corresponding to statefulset pods)
Endpoints:
   -  mysql-0.mysql.default.svc.cluster.local
   - mysql-1.mysql.default.svc.cluster.local

# DNS records created when clusters were linked
Headless Services:
  - mysql-target1.default.svc.cluster.local
  - mysql-target2.default.svc.cluster.local
Endpoints (one per cluster for simplicity):
  - mysql-0.mysql-target1.default.svc.cluster.local
  - mysql-1.mysql-target2.default.svc.cluster.local

This already lets me know that a consideration we should have made is not present in the proposal. Currently, DNS records are configured based on the mirror service name: <pod-name>.<svc>-<link-name>.<namespace>.svc.cluster.local. If we include a global service, we will need: <pod-name>-<link-name> since we can no longer guarantee that DNS records are unique.

Proposed solution

let's remove the proposed solution present in the proposal you sent at first, per alpeb's suggestion.

Misc feedback

We talked about including a section on which programming language is best fit for the job, since you expressed a desire to learn Rust. I don't see this section in (with advantages and disadvantages), only an open question.
We keep referencing mysql, but we haven't actually compiled a list of databases that support this --join option. Can we actually configure mysql to use two different services? I can't find anything in the bitnami chart. If we can't, then let's remove mentions of it. We should have a list of statefulset workloads that will be affected by this change.
Will hostnames have an impact on this? Now that each hostname (e.g. mysql-0) needs to have its cluster appended to it?

re: @alpeb

About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services.

I think this is more for the workload to do discovery. There will be two services in total:

the service deployed with the workload itself
a global service that aggregates all endpoints from other clusters.

The goal is to have a reliable way for workloads to do service discovery without restarting pods (or patching) whenever clusters are linked or unlinked.

There should be one global service per each Stateful ordinal index. Like, for each index x, a service mysql-x-global fronting mysql-x-north, mysql-x-south, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.

This is a good point. I think the idea is to still create a ClusterIP for each host backing a service (mysql-x-<cluster>). We will need these for A records and to support actual pod-to-pod communication through the gateway. However, instead of having multiple headless services, we have only one.

The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.

# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

# Current behaviour as of 2.13
mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

Looking forward for your thoughts (and diagrams! 😉 )

Diagrams would be 100% helpful here 👍🏻

Hello @mateiidavid Thanks a lot for all the comments. I am working on refactoring this proposal but before i post edited version I have slight doubt.

The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.

Proposed

mysql-global.svc.cluster.local will resolve to

mysql-0-target1.mysql-global.svc.cluster.local

mysql-1-target1.mysql-global.svc.cluster.local

mysql-0-target2.mysql-global.svc.cluster.local

mysql-1-target2.mysql-global.svc.cluster.local

Current behaviour as of 2.13

mysql-target1.svc.cluster.local will resolve to

mysql-0-target1.mysql-global.svc.cluster.local

mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to

mysql-0-target2.mysql-global.svc.cluster.local

mysql-1-target2.mysql-global.svc.cluster.local

I think i am little confuse here as I was thinking after discussions, we will have parallel service running named global with all other existing replicas. This will also help us in incremental rollout, even if we publish this as extension.

# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

I have tested this locally and I noticed that we can have two services pointing to same endpoints. Maybe we can start with using only one EndpointSlice and if required then start separating it.

mysql-svc-global       10.43.26.159:80,10.43.128.231:80,10.43.164.5:80   
mysql-svc-target1     10.43.26.159:80,10.43.128.231:80,10.43.164.5:80

I was thinking our controller will be deployed in each multi-linked cluster i.e source cluster will have its own controller so as target. And controller deployed in respective cluster, will monitor all the local services (it includes mirrored services), maybe with filter as label : mirror.linkerd.io/mirrored-service: "true".

I hope this also answers above question :

How will this be deployed? By whom? Will this controller monitor all clusters, or just one? If it monitors all, how will it get access to them? Is the plan here to read link resources, or?

@rushi47

I mentioned this in some DMs. It's important we don't hold any assumptions (or knowledge) in our brains when designing this. Any assumptions about this system and the conditions it is running in should be put in the proposal. For example, one contentious point: how will this discover changes in the remote cluster? Your plan to monitor local services is good, but we need to have it written down in the main proposal. The proposal does not mention any assumptions that the services will exist in parallel (i.e. a mirror service and a global service).

In my opinion discovery and deployment here will be closely related; we should also include a paragraph on how we intend to have this prototype deployed. If it's just a helm chart with an image, that's fine. If it's just a manifest, that's also fine. As long as we are explicit about ordering, dependencies, and all other assumptions.

I have tested this locally and I noticed that we can have two services pointing to same endpoints

I appreciate you want to keep this easy, but I'm not sure it's a good approach here. In fact, it might complicate a little bit what we are trying to do. In the service mirror, for example, we keep track of endpointslice ownership through the service label. If we want to have multiple services own a single endpointslice object, we need to use a tiered ownership reference object. I think:

Tiered ownership will make our logic a bit more difficult (e.g. we can no longer delete automatically when a mirror service is unexported, owner deletion will not gc the object, etc.)
We no longer get the full picture when doing kubectl lookups (afaik lookups are done based on the associated service label, we can't have the same label twice since keys are unique)
We now have two controllers (mirror & prototype) doing writes on the same resource (slice). How can we get exclusive write access?

I think it's best if we keep those resources separate. Your idea of looking for annotations to filter services out is good. Once we filter services out, we can either create a separate slice (owned by the "global" service) or append to an existing one. I think it might be a bit clearer to separate slices here.

Let's put all of the suggestions into the proposal, and maybe link the repo we're using for the prototype.

@mateiidavid & @alpeb Thank you for your comments. I have updated the proposal. It will be great help, if you could take a look.

Thanks @rushi47 but some parts have still not changed. See below:

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named east (source) and west (target). Let's assume that there is service name mysql existing in east and mysql existing in west. As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.

east and west again, further down east and west are not mentioned again. Can we please change all mentions to be consistent? There are more mentions further down in the proposal.

Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up

does mysql really do that? can you reference this? what about redis? what about cockroachdb, let's get some reference material here.

Possible solution to this problem is, creating global service in each linked cluster

In each linked cluster or for each linked cluster? The difference is important.
Your yaml example in the development plan is inaccurate:

endpoints:
  - addresses:
      - "172.43.128.182"
      - "172.43.224.245"
      - "172.43.129.182"
    conditions:
      ready: true
    ...

The addresses need to contain a hostname in order for DNS to be set-up.

Language of choice: do we care about performance here? We also cannot make a blanket statement about one being faster than the other without comparing the two implementations (although it is true, Rust can fare better). I'm with you on the error front, I don't like Go's error mechanism much either, but I'm not sure that's a compelling enough reason to pick one over the other here, given that the service mirror is written in Go and a bunch of the code can be inspired from what we do there. Both are strongly typed. I think my preference is to do this in Go, since the client-go ecosystem is a bit more mature, and you already have experience with the language which fits our tighter deadline.
Finally, I'm not sure these diagrams are super helpful? I had in mind something more similar to what we have in our multicluster docs. @alpeb wdyt, do these help out to understand the problem?

...the rest of it looks good, nice job :)

Agreed with @mateiidavid, the idea of the diagrams is to present an architectural overview, clarifying the relationship between clusters and their services, and illustrating how connections would flow through.

Hey @mateiidavid and @alpeb , Apologies for to and fro. I have again refactored the proposal and tried it to make it more simple. I have removed also lot of references pointing to any specific service. And I tried to kept proposal generic. I also tried to refactor diagrams and make it more neat. I hope this will be helpful. Looking forward to your feedback.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

linkerd / linkerd2