linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.65k stars 1.28k forks source link

Multi-cluster Discovery Proposal #10747

Closed rushi47 closed 1 year ago

rushi47 commented 1 year ago
tags: linkerd kubernetes service mesh multicluster

Multi-cluster Discovery Proposal

LFX Mentorship proposal By Rushikesh Butley & Matei David

Problem :

Linkerd's multicluster extension allows operators to export headless services, which are primarily used for workloads deployed as StatefulSets (traditionally databases).

Working with a StatefulSet in complex cluster topologies (three or more clusters) is unwieldy. It requires constant manual intervention from cluster operators and application owners. For each new (or removed) exported service, the StatefulSet has to be edited, and all of the workloads have to be redeployed.

The above problem makes endpoint discovery for stateful workloads hard to deal with.

Goals & Constraints :

For this proposal below are the goals and some of the constraints, which we have scoped out.

How should the problem be solved?

Design Exploration :

Background :

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named source and target. Let's assume that there is service name foo existing in source and target. And it might need to discover all other replicas existing in target/

As we can imagine, there are various applications like which does DNS based service discovery for various purposes including but not limited to Statefulsets for operating in HighAvailability. Ex. Thanos, Nginx Upstreams etc.

Background for this proposal :

While it works fine for cluster with one target, it will be quite cumbersome issue, when there are more than one target cluster. To solve this problem, we want standalone service existing in cluster, which when queried, can return all the Endpoints from it's peered cluster.

Solution :

Global Service :

Possible solution to this problem is, creating global service in each linked cluster. This global service will act as aggregator and will have all the mirrored EndpointSlices. Below is the example to clarify it further.

Once the proposal is implemented below will be the view for source cluster.

With Current Release

currentRelease

Proposed Solution

comingRelease

Summary :

In summary, if a Statefulset workload is deployed in a source cluster, and it needs to discover replicas from other clusters, each cluster will need to be linked against the source. Each individual headless service will be mirrored in the source cluster. This leads to operational problems, where a statefulset needs to be rolled out and manually modified each time a new service is added. Consider the following example where a single target cluster is linked against source:

- name: foo
  args:
     - join
     - foo,foo-target1

Linking another cluster leads to a manual intervention to add the new service:

- name: foo
   args:
     - join
     - foo,foo-target1,foo-target2

Implementation Ideas :

To achieve the above solution using EndpointSlices, we might start building on service mirror component. We will deploy controller in each cluster, alongside current linkerd services. Controller will keep watching all the local services inside cluster.

To break it down into steps :

  1. As we discussed above, will create the new Service suffixed by -global.

  2. In source cluster, we can have controller/operator, which keep looking for the new Service being added. To filter the services, controller will be filtering on the label mirror.linkerd.io/mirrored-service: "true", this annotation is already place by Service Mirror component, when it mirrors service in source cluster.

  3. This will help us, executing our logic when new service with respective anotation appears in our cluster.

  4. Once our controller receives any event for this filter, controller if there is respective global service exist for mirrored service in our cluster by the suffix -global.

    a) If it doesn't exist first we create Headless Service using our controller. And name it by snapping -global as suffix at the end of mirror service name.

    b) If it exist, we skip creating Headless Service.

  5. To find respective synthetic endpoint for mirror service existing in our cluster, we query our K8s registery for Service's respective Endpoints.

  6. All the retrieved Endpoints for mirrored services, will be used to create new EndpointSlice and it will be owned by the -global service . We will start with maintaining only one EndpointSlice and if required we can split it/scale it later.

Development Plan along with Testing Strategy :

We can deliver above feature in two phases.

How we will run it ? :

We will build docker images of our controller, and deploy these images in our multicluster environment using helm charts. It will monitor all the local Services inside respective cluster. It will take required action when encountered label of interest.

For testing, we will be simply running controller against local kubernetes cluster with default config.

Open Questions :

Language of Choice

Would you like to work on this feature?

yes

We have started the initial development here : https://github.com/rushi47/service-mirror-prototype

alpeb commented 1 year ago

Thanks for the writeup @rushi47 It seems the doc has some duplicated paragraphs (verbatim), and also there's lot duplicated ideas. It'd be easier to read if it were more concise. Also you could remove the alternative proposal and focus on the selected one. And I think this is ripe for some diagrams that would make things clearer.

About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services. There should be one global service per each Stateful ordinal index. Like, for each index x, a service mysql-x-global fronting mysql-x-north, mysql-x-south, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.

Looking forward for your thoughts (and diagrams! :wink: )

rushi47 commented 1 year ago

@alpeb Thanks a lot for reading the proposal and the suggestions. 😅 Yeah apologies for the dedup, we were trying to follow the RFC format. And as I am new to writing, I might have mentioned same things couple of times. Regarding the second part, you are correct we trying to do One global service per each stateful ordinal index. To shed more light, i.e for mysql-x-north and mysql-x-south there will be mysql-x-global. For foo-x-north and foo-x-south there will foo-x-global. As you stated, I think diagrams will be quite helpful here. I will try to get them in as well.

mateiidavid commented 1 year ago

@rushi47 thanks again for the proposal, here are some more notes:

StatefulSet workloads are able to use newly added or removed exported services without being restarted. This doesn't sound right to me. We don't want to use newly exported services, we want to use newly linked clusters, without restarting the workload.

Each linked cluster currently creates its own service. We want it to have its endpoints mirrored in the global service instead.

Design Exploration

Background

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named east (source) and west (target). Let's assume that there is service name foo-mysql existing in East and bar-mysql existing in west.

We should be consistent with our naming throughout the proposal. If we refer to the two linked clusters as source and target, we should stick with it instead of giving an example where we use cardinal points.

Similarly, I think I mentioned this before, we don't need foo-mysql and bar-mysql. Let's simplify it by saying that each cluster has a mysql service. When this service is mirrored, it will exist as mysql-target. Let's lower the cognitive burden here, it's an already complicated proposal and problem, with a lot of specialised terminology.

As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.

To do so, we need to discover the sister services running along side our main service. In this case, bar-mysql could be sister to foo-mysql and they might want to operate in consensus.

Keeping the same analogy, we can imagine that foo-mysql might form the consesus with bar-mysql. The problem is foo-mysql can't find bar-mysql in east cluster directly/locally (it can if it's exported), the way it can find other services natively in east.

I don't think this really explains the problem, and we're using complicated terminology. How about we replace it with a much smaller paragraph:

"Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up. Since a stateful workload may be replicated, it needs to elect a leader for writes. In a multicluster context, we want stateful workloads to discover replicas across local cluster boundaries".

For above purpose, we need to make sure that foo-mysql can find bar-mysql natively by calling fqdn like bar-mysql.default.svc.east-cluster.local. And this will be backed by all the Endpoints, existing in its native west cluster.

Again, let's change those names. It's very hard to follow.

"To discover replicas, stateful workloads generally rely on a DNS hostname to connect to. To allow discovery across more complex cluster topologies, stateful workloads need hostnames that resolve to endpoints not present locally in the cluster".

service_names followed by eps :

bar-mysql-west bar-mysql-0-west bar-mysql-1-west

zoo-mysql-north zoo-mysql-0-north zoo-mysql-1-north

woo-mysql-south woo-mysql-0-south woo-mysql-1-south

This is a convoluted example. What do we want to illustrate here, what the database needs to accept for service discovery? Or how services are actually organised? If it's the former, we do not need to specify any hostname services since we won't use these when configuring service discovery.

" Assuming a topology that includes three clusters, here is an example of the DNS records that will be created when two targets clusters are linked against a source:

# Original DNS records created when mysql was deployed
Headless Service: mysql.default.svc.cluster.local (may resolve to any A records corresponding to statefulset pods)
Endpoints:
   -  mysql-0.mysql.default.svc.cluster.local
   - mysql-1.mysql.default.svc.cluster.local

# DNS records created when clusters were linked
Headless Services:
  - mysql-target1.default.svc.cluster.local
  - mysql-target2.default.svc.cluster.local
Endpoints (one per cluster for simplicity):
  - mysql-0.mysql-target1.default.svc.cluster.local
  - mysql-1.mysql-target2.default.svc.cluster.local

"

This already lets me know that a consideration we should have made is not present in the proposal. Currently, DNS records are configured based on the mirror service name: <pod-name>.<svc>-<link-name>.<namespace>.svc.cluster.local. If we include a global service, we will need: <pod-name>-<link-name> since we can no longer guarantee that DNS records are unique.

Proposed solution

let's remove the proposed solution present in the proposal you sent at first, per alpeb's suggestion.


Recommended solution

dig mysql-global.default.svc.east-cluster.local

:: ANSWERS

bar-mysql-0-west 172..43.128.182
bar-mysql-1-west 172.43.224.245
zoo-mysql-0-north 172.43.225.74
zoo-mysql-1-north 172.43.128.182
woo-mysql-1-south 172.42.0.36

Consistent naming is key here to keep it simple. Let's eliminate cardinal points and let's keep it to mysql. There's no reason why an operator would need to give this service different names.

For your summary:

To summarise, currently in east cluster, services from west, north, south are mirrored. Although they are like sister, each > service exists as individual service. This could lead to problem where, foo-mysql will have to go through process of joining each individual cluster like foo-mysql join= bar-west, zoo-north, woo-south and places dependency ordering. It
will also trigger redployments. This might also lead to various issues as described in https://github.com/linkerd/linkerd2/issues/7566, like manual intervention from cluster operators and application owners. And also for each new (or removed) exported service, the StatefulSet has to be edited, and all of the workloads have to be redeployed.

We're not illustrating what the actual problem is. Being a bit more concrete about this example (maybe even providing a yaml snippet) is more helpful. Here's an example:

"In summary, if a statefulset workload is deployed in a source cluster, and it needs to discover replicas from other clusters, each cluster will need to be linked against the source. Each individual headless service will be mirrored in the source cluster. This leads to operational problems, where a statefulset needs to be rolled out and manually modified each time a new service is added. Consider the following example where a single target cluster is linked against source:

- name: mysql
  args:
     - join
     - mysql,mysql-target1

Linking another cluster leads to a manual intervention to add the new service:

- name: mysql
   args:
     - join
     - mysql,mysql-target1,mysql-target2

"

In source cluster, we can have controller/operator, which keep looking for the new Service being added and has annotation mirror.linkerd.io/mirrored-service: "true", this annotation is already place by Service Mirror component, when it mirrors service in source cluster.

How will this be deployed? By whom? Will this controller monitor all clusters, or just one? If it monitors all, how will it get access to them? Is the plan here to read link resources, or?

As we get all the endpoint, backed for the service. We can use this, to create EndpointSlice. This EndpointSlice will back the Headless Service, created by the name -global.

We only mention one EndpointSlice. Is the goal to maintain one EndpointSlice or many EndpointSlices?


Misc feedback

mateiidavid commented 1 year ago

re: @alpeb

About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services.

I think this is more for the workload to do discovery. There will be two services in total:

  1. the service deployed with the workload itself
  2. a global service that aggregates all endpoints from other clusters.

The goal is to have a reliable way for workloads to do service discovery without restarting pods (or patching) whenever clusters are linked or unlinked.

There should be one global service per each Stateful ordinal index. Like, for each index x, a service mysql-x-global fronting mysql-x-north, mysql-x-south, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.

This is a good point. I think the idea is to still create a ClusterIP for each host backing a service (mysql-x-<cluster>). We will need these for A records and to support actual pod-to-pod communication through the gateway. However, instead of having multiple headless services, we have only one.

The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.

# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

# Current behaviour as of 2.13
mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

Looking forward for your thoughts (and diagrams! 😉 )

Diagrams would be 100% helpful here 👍🏻

rushi47 commented 1 year ago

Hello @mateiidavid Thanks a lot for all the comments. I am working on refactoring this proposal but before i post edited version I have slight doubt.

The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.

Proposed

mysql-global.svc.cluster.local will resolve to

  • mysql-0-target1.mysql-global.svc.cluster.local
  • mysql-1-target1.mysql-global.svc.cluster.local
  • mysql-0-target2.mysql-global.svc.cluster.local
  • mysql-1-target2.mysql-global.svc.cluster.local

Current behaviour as of 2.13

mysql-target1.svc.cluster.local will resolve to

  • mysql-0-target1.mysql-global.svc.cluster.local
  • mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to

  • mysql-0-target2.mysql-global.svc.cluster.local
  • mysql-1-target2.mysql-global.svc.cluster.local

I think i am little confuse here as I was thinking after discussions, we will have parallel service running named global with all other existing replicas. This will also help us in incremental rollout, even if we publish this as extension.

# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local

mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local

I have tested this locally and I noticed that we can have two services pointing to same endpoints. Maybe we can start with using only one EndpointSlice and if required then start separating it.

mysql-svc-global       10.43.26.159:80,10.43.128.231:80,10.43.164.5:80   
mysql-svc-target1     10.43.26.159:80,10.43.128.231:80,10.43.164.5:80       

I was thinking our controller will be deployed in each multi-linked cluster i.e source cluster will have its own controller so as target. And controller deployed in respective cluster, will monitor all the local services (it includes mirrored services), maybe with filter as label : mirror.linkerd.io/mirrored-service: "true".

I hope this also answers above question :

How will this be deployed? By whom? Will this controller monitor all clusters, or just one? If it monitors all, how will it get access to them? Is the plan here to read link resources, or?

mateiidavid commented 1 year ago

@rushi47

I mentioned this in some DMs. It's important we don't hold any assumptions (or knowledge) in our brains when designing this. Any assumptions about this system and the conditions it is running in should be put in the proposal. For example, one contentious point: how will this discover changes in the remote cluster? Your plan to monitor local services is good, but we need to have it written down in the main proposal. The proposal does not mention any assumptions that the services will exist in parallel (i.e. a mirror service and a global service).

In my opinion discovery and deployment here will be closely related; we should also include a paragraph on how we intend to have this prototype deployed. If it's just a helm chart with an image, that's fine. If it's just a manifest, that's also fine. As long as we are explicit about ordering, dependencies, and all other assumptions.


I have tested this locally and I noticed that we can have two services pointing to same endpoints

I appreciate you want to keep this easy, but I'm not sure it's a good approach here. In fact, it might complicate a little bit what we are trying to do. In the service mirror, for example, we keep track of endpointslice ownership through the service label. If we want to have multiple services own a single endpointslice object, we need to use a tiered ownership reference object. I think:

I think it's best if we keep those resources separate. Your idea of looking for annotations to filter services out is good. Once we filter services out, we can either create a separate slice (owned by the "global" service) or append to an existing one. I think it might be a bit clearer to separate slices here.

Let's put all of the suggestions into the proposal, and maybe link the repo we're using for the prototype.

rushi47 commented 1 year ago

@mateiidavid & @alpeb Thank you for your comments. I have updated the proposal. It will be great help, if you could take a look.

mateiidavid commented 1 year ago

Thanks @rushi47 but some parts have still not changed. See below:

Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :

We start with two clusters named east (source) and west (target). Let's assume that there is service name mysql existing in east and mysql existing in west. As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.

Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up

Possible solution to this problem is, creating global service in each linked cluster

endpoints:
  - addresses:
      - "172.43.128.182"
      - "172.43.224.245"
      - "172.43.129.182"
    conditions:
      ready: true
    ...

The addresses need to contain a hostname in order for DNS to be set-up.

...the rest of it looks good, nice job :)

alpeb commented 1 year ago

Agreed with @mateiidavid, the idea of the diagrams is to present an architectural overview, clarifying the relationship between clusters and their services, and illustrating how connections would flow through.

rushi47 commented 1 year ago

Hey @mateiidavid and @alpeb , Apologies for to and fro. I have again refactored the proposal and tried it to make it more simple. I have removed also lot of references pointing to any specific service. And I tried to kept proposal generic. I also tried to refactor diagrams and make it more neat. I hope this will be helpful. Looking forward to your feedback.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.