Closed rushi47 closed 1 year ago
Thanks for the writeup @rushi47 It seems the doc has some duplicated paragraphs (verbatim), and also there's lot duplicated ideas. It'd be easier to read if it were more concise. Also you could remove the alternative proposal and focus on the selected one. And I think this is ripe for some diagrams that would make things clearer.
About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services. There should be one global service per each Stateful ordinal index. Like, for each index x, a service mysql-x-global
fronting mysql-x-north
, mysql-x-south
, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.
Looking forward for your thoughts (and diagrams! :wink: )
@alpeb Thanks a lot for reading the proposal and the suggestions.
😅 Yeah apologies for the dedup, we were trying to follow the RFC format. And as I am new to writing, I might have mentioned same things couple of times.
Regarding the second part, you are correct we trying to do One global service per each stateful ordinal index.
To shed more light, i.e for mysql-x-north
and mysql-x-south
there will be mysql-x-global
. For foo-x-north
and foo-x-south
there will foo-x-global
.
As you stated, I think diagrams will be quite helpful here. I will try to get them in as well.
@rushi47 thanks again for the proposal, here are some more notes:
StatefulSet workloads are able to use newly added or removed exported services without being restarted. This doesn't sound right to me. We don't want to use newly exported services, we want to use newly linked clusters, without restarting the workload.
Each linked cluster currently creates its own service. We want it to have its endpoints mirrored in the global service instead.
Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :
We start with two clusters named east (source) and west (target). Let's assume that there is service name foo-mysql existing in East and bar-mysql existing in west.
We should be consistent with our naming throughout the proposal. If we refer to the two linked clusters as source and target, we should stick with it instead of giving an example where we use cardinal points.
Similarly, I think I mentioned this before, we don't need foo-mysql
and bar-mysql
. Let's simplify it by saying that each cluster has a mysql
service. When this service is mirrored, it will exist as mysql-target
. Let's lower the cognitive burden here, it's an already complicated proposal and problem, with a lot of specialised terminology.
As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.
To do so, we need to discover the sister services running along side our main service. In this case, bar-mysql could be sister to foo-mysql and they might want to operate in consensus.
Keeping the same analogy, we can imagine that foo-mysql might form the consesus with bar-mysql. The problem is foo-mysql can't find bar-mysql in east cluster directly/locally (it can if it's exported), the way it can find other services natively in east.
I don't think this really explains the problem, and we're using complicated terminology. How about we replace it with a much smaller paragraph:
"Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up. Since a stateful workload may be replicated, it needs to elect a leader for writes. In a multicluster context, we want stateful workloads to discover replicas across local cluster boundaries".
For above purpose, we need to make sure that foo-mysql can find bar-mysql natively by calling fqdn like bar-mysql.default.svc.east-cluster.local. And this will be backed by all the Endpoints, existing in its native west cluster.
Again, let's change those names. It's very hard to follow.
"To discover replicas, stateful workloads generally rely on a DNS hostname to connect to. To allow discovery across more complex cluster topologies, stateful workloads need hostnames that resolve to endpoints not present locally in the cluster".
service_names followed by eps :
bar-mysql-west bar-mysql-0-west bar-mysql-1-west
zoo-mysql-north zoo-mysql-0-north zoo-mysql-1-north
woo-mysql-south woo-mysql-0-south woo-mysql-1-south
This is a convoluted example. What do we want to illustrate here, what the database needs to accept for service discovery? Or how services are actually organised? If it's the former, we do not need to specify any hostname services since we won't use these when configuring service discovery.
" Assuming a topology that includes three clusters, here is an example of the DNS records that will be created when two targets clusters are linked against a source:
# Original DNS records created when mysql was deployed
Headless Service: mysql.default.svc.cluster.local (may resolve to any A records corresponding to statefulset pods)
Endpoints:
- mysql-0.mysql.default.svc.cluster.local
- mysql-1.mysql.default.svc.cluster.local
# DNS records created when clusters were linked
Headless Services:
- mysql-target1.default.svc.cluster.local
- mysql-target2.default.svc.cluster.local
Endpoints (one per cluster for simplicity):
- mysql-0.mysql-target1.default.svc.cluster.local
- mysql-1.mysql-target2.default.svc.cluster.local
"
This already lets me know that a consideration we should have made is not present in the proposal. Currently, DNS records are configured based on the mirror service name: <pod-name>.<svc>-<link-name>.<namespace>.svc.cluster.local
. If we include a global service, we will need: <pod-name>-<link-name>
since we can no longer guarantee that DNS records are unique.
let's remove the proposed solution present in the proposal you sent at first, per alpeb's suggestion.
dig mysql-global.default.svc.east-cluster.local
:: ANSWERS
bar-mysql-0-west 172..43.128.182
bar-mysql-1-west 172.43.224.245
zoo-mysql-0-north 172.43.225.74
zoo-mysql-1-north 172.43.128.182
woo-mysql-1-south 172.42.0.36
Consistent naming is key here to keep it simple. Let's eliminate cardinal points and let's keep it to mysql. There's no reason why an operator would need to give this service different names.
For your summary:
To summarise, currently in east cluster, services from west, north, south are mirrored. Although they are like sister, each > service exists as individual service. This could lead to problem where, foo-mysql will have to go through process of joining each individual cluster like foo-mysql join= bar-west, zoo-north, woo-south and places dependency ordering. It
will also trigger redployments. This might also lead to various issues as described in https://github.com/linkerd/linkerd2/issues/7566, like manual intervention from cluster operators and application owners. And also for each new (or removed) exported service, the StatefulSet has to be edited, and all of the workloads have to be redeployed.
We're not illustrating what the actual problem is. Being a bit more concrete about this example (maybe even providing a yaml snippet) is more helpful. Here's an example:
"In summary, if a statefulset workload is deployed in a source cluster, and it needs to discover replicas from other clusters, each cluster will need to be linked against the source. Each individual headless service will be mirrored in the source cluster. This leads to operational problems, where a statefulset needs to be rolled out and manually modified each time a new service is added. Consider the following example where a single target cluster is linked against source:
- name: mysql
args:
- join
- mysql,mysql-target1
Linking another cluster leads to a manual intervention to add the new service:
- name: mysql
args:
- join
- mysql,mysql-target1,mysql-target2
"
In source cluster, we can have controller/operator, which keep looking for the new Service being added and has annotation mirror.linkerd.io/mirrored-service: "true", this annotation is already place by Service Mirror component, when it mirrors service in source cluster.
How will this be deployed? By whom? Will this controller monitor all clusters, or just one? If it monitors all, how will it get access to them? Is the plan here to read link resources, or?
As we get all the endpoint, backed for the service. We can use this, to create EndpointSlice. This EndpointSlice will back the Headless Service, created by the name -global.
We only mention one EndpointSlice. Is the goal to maintain one EndpointSlice or many EndpointSlices?
--join
option. Can we actually configure mysql to use two different services? I can't find anything in the bitnami chart. If we can't, then let's remove mentions of it. We should have a list of statefulset workloads that will be affected by this change.re: @alpeb
About the actual design, IIUC, there's gonna be a global service fronting all the mirrored services that have the same name (except for the target cluster name suffix), and requests to that global service would be round-robin'd across those mirrored services.
I think this is more for the workload to do discovery. There will be two services in total:
The goal is to have a reliable way for workloads to do service discovery without restarting pods (or patching) whenever clusters are linked or unlinked.
There should be one global service per each Stateful ordinal index. Like, for each index x, a service
mysql-x-global
frontingmysql-x-north
,mysql-x-south
, etc (I'm not sure if ordinal indices are mirrored like that, @mateiidavid should have a clear idea). Otherwise the stable network identifiers that StatefulSets are supposed to guarantee would no longer be guaranteed. It wasn't clear to me in your exposition if that would be the case.
This is a good point. I think the idea is to still create a ClusterIP for each host backing a service (mysql-x-<cluster>
). We will need these for A records and to support actual pod-to-pod communication through the gateway. However, instead of having multiple headless services, we have only one.
The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.
# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
# Current behaviour as of 2.13
mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
Looking forward for your thoughts (and diagrams! 😉 )
Diagrams would be 100% helpful here 👍🏻
Hello @mateiidavid Thanks a lot for all the comments. I am working on refactoring this proposal but before i post edited version I have slight doubt.
The goal is to have one service that can serve as our SRV and resolve to all other endpoints. Presumably, the statefulsets will use the hostname to resolve all A records it points to.
Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
Current behaviour as of 2.13
mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
I think i am little confuse here as I was thinking after discussions, we will have parallel service running named global with all other existing replicas. This will also help us in incremental rollout, even if we publish this as extension.
# Proposed
mysql-global.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
mysql-target1.svc.cluster.local will resolve to
- mysql-0-target1.mysql-global.svc.cluster.local
- mysql-1-target1.mysql-global.svc.cluster.local
mysql-target2.svc.cluster.local will resolve to
- mysql-0-target2.mysql-global.svc.cluster.local
- mysql-1-target2.mysql-global.svc.cluster.local
I have tested this locally and I noticed that we can have two services pointing to same endpoints. Maybe we can start with using only one EndpointSlice and if required then start separating it.
mysql-svc-global 10.43.26.159:80,10.43.128.231:80,10.43.164.5:80
mysql-svc-target1 10.43.26.159:80,10.43.128.231:80,10.43.164.5:80
I was thinking our controller will be deployed in each multi-linked cluster i.e source cluster will have its own controller so as target. And controller deployed in respective cluster, will monitor all the local services (it includes mirrored services), maybe with filter as label : mirror.linkerd.io/mirrored-service: "true"
.
I hope this also answers above question :
How will this be deployed? By whom? Will this controller monitor all clusters, or just one? If it monitors all, how will it get access to them? Is the plan here to read link resources, or?
@rushi47
I mentioned this in some DMs. It's important we don't hold any assumptions (or knowledge) in our brains when designing this. Any assumptions about this system and the conditions it is running in should be put in the proposal. For example, one contentious point: how will this discover changes in the remote cluster? Your plan to monitor local services is good, but we need to have it written down in the main proposal. The proposal does not mention any assumptions that the services will exist in parallel (i.e. a mirror service and a global service).
In my opinion discovery and deployment here will be closely related; we should also include a paragraph on how we intend to have this prototype deployed. If it's just a helm chart with an image, that's fine. If it's just a manifest, that's also fine. As long as we are explicit about ordering, dependencies, and all other assumptions.
I have tested this locally and I noticed that we can have two services pointing to same endpoints
I appreciate you want to keep this easy, but I'm not sure it's a good approach here. In fact, it might complicate a little bit what we are trying to do. In the service mirror, for example, we keep track of endpointslice ownership through the service label. If we want to have multiple services own a single endpointslice object, we need to use a tiered ownership reference object. I think:
I think it's best if we keep those resources separate. Your idea of looking for annotations to filter services out is good. Once we filter services out, we can either create a separate slice (owned by the "global" service) or append to an existing one. I think it might be a bit clearer to separate slices here.
Let's put all of the suggestions into the proposal, and maybe link the repo we're using for the prototype.
@mateiidavid & @alpeb Thank you for your comments. I have updated the proposal. It will be great help, if you could take a look.
Thanks @rushi47 but some parts have still not changed. See below:
Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :
We start with two clusters named east (source) and west (target). Let's assume that there is service name mysql existing in east and mysql existing in west. As we can imagine, services like mysql operate in cluster to provide high availablity, performance and failovers.
Stateful applications (e.g. mysql, redis, and other distributed databases) perform service discovery on start-up
Possible solution to this problem is, creating global service in each linked cluster
In each linked cluster or for each linked cluster? The difference is important.
Your yaml example in the development plan is inaccurate:
endpoints:
- addresses:
- "172.43.128.182"
- "172.43.224.245"
- "172.43.129.182"
conditions:
ready: true
...
The addresses need to contain a hostname in order for DNS to be set-up.
Language of choice: do we care about performance here? We also cannot make a blanket statement about one being faster than the other without comparing the two implementations (although it is true, Rust can fare better). I'm with you on the error front, I don't like Go's error mechanism much either, but I'm not sure that's a compelling enough reason to pick one over the other here, given that the service mirror is written in Go and a bunch of the code can be inspired from what we do there. Both are strongly typed. I think my preference is to do this in Go, since the client-go ecosystem is a bit more mature, and you already have experience with the language which fits our tighter deadline.
Finally, I'm not sure these diagrams are super helpful? I had in mind something more similar to what we have in our multicluster docs. @alpeb wdyt, do these help out to understand the problem?
...the rest of it looks good, nice job :)
Agreed with @mateiidavid, the idea of the diagrams is to present an architectural overview, clarifying the relationship between clusters and their services, and illustrating how connections would flow through.
Hey @mateiidavid and @alpeb , Apologies for to and fro. I have again refactored the proposal and tried it to make it more simple. I have removed also lot of references pointing to any specific service. And I tried to kept proposal generic. I also tried to refactor diagrams and make it more neat. I hope this will be helpful. Looking forward to your feedback.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
tags:
linkerd
kubernetes
service mesh
multicluster
Multi-cluster Discovery Proposal
LFX Mentorship proposal By Rushikesh Butley & Matei David
Problem :
Linkerd's multicluster extension allows operators to export headless services, which are primarily used for workloads deployed as StatefulSets (traditionally databases).
Working with a StatefulSet in complex cluster topologies (three or more clusters) is unwieldy. It requires constant manual intervention from cluster operators and application owners. For each new (or removed) exported service, the StatefulSet has to be edited, and all of the workloads have to be redeployed.
The above problem makes endpoint discovery for stateful workloads hard to deal with.
Goals & Constraints :
For this proposal below are the goals and some of the constraints, which we have scoped out.
Goals :
Constraints :
How should the problem be solved?
Design Exploration :
Background :
Currently, in mutlicluster discovery if there are multiple peers, when service gets mirrored from target cluster to source , for each peer mirror service is created. To elaborate, below is the example :
We start with two clusters named
source
andtarget
. Let's assume that there is service namefoo
existing insource
andtarget
. And it might need to discover all other replicas existing intarget
/As we can imagine, there are various applications like which does DNS based service discovery for various purposes including but not limited to Statefulsets for operating in HighAvailability. Ex. Thanos, Nginx Upstreams etc.
Background for this proposal :
While it works fine for cluster with one target, it will be quite cumbersome issue, when there are more than one target cluster. To solve this problem, we want standalone service existing in cluster, which when queried, can return all the Endpoints from it's peered cluster.
Solution :
Global Service :
Possible solution to this problem is, creating global service in each linked cluster. This global service will act as aggregator and will have all the mirrored EndpointSlices. Below is the example to clarify it further.
Once the proposal is implemented below will be the view for source cluster.
With Current Release
Proposed Solution
Summary :
In summary, if a Statefulset workload is deployed in a source cluster, and it needs to discover replicas from other clusters, each cluster will need to be linked against the source. Each individual headless service will be mirrored in the source cluster. This leads to operational problems, where a statefulset needs to be rolled out and manually modified each time a new service is added. Consider the following example where a single target cluster is linked against source:
Linking another cluster leads to a manual intervention to add the new service:
Implementation Ideas :
To achieve the above solution using EndpointSlices, we might start building on service mirror component. We will deploy controller in each cluster, alongside current linkerd services. Controller will keep watching all the local services inside cluster.
To break it down into steps :
As we discussed above, will create the new Service suffixed by
-global
.In source cluster, we can have controller/operator, which keep looking for the new
Service
being added. To filter the services, controller will be filtering on the labelmirror.linkerd.io/mirrored-service: "true"
, this annotation is already place by Service Mirror component, when it mirrors service in source cluster.This will help us, executing our logic when new service with respective anotation appears in our cluster.
Once our controller receives any event for this filter, controller if there is respective global service exist for mirrored service in our cluster by the suffix
-global
.a) If it doesn't exist first we create
Headless Service
using our controller. And name it by snapping-global
as suffix at the end of mirror service name.b) If it exist, we skip creating Headless Service.
To find respective synthetic endpoint for mirror service existing in our cluster, we query our K8s registery for Service's respective
Endpoints
.All the retrieved
Endpoints
for mirrored services, will be used to create newEndpointSlice
and it will be owned by the-global
service . We will start with maintaining only oneEndpointSlice
and if required we can split it/scale it later.Development Plan along with Testing Strategy :
We can deliver above feature in two phases.
Phase 1 :
In first phase we can focus on writing Controller/or refactoring Service Mirror, which look for new Services being added with label :
mirror.linkerd.io/headless-mirror-svc-name
Create
Headless Service
with the name suffixed with-global
from the initial service which is mirrored.Testing Strategy : To test this phase, we can create Multicluster topolgy of 3-4 cluster. And make sure that only One Global Service is being created per cluster, respective to each type of unique Service being mirrored.
Phase 2 :
Headless Service
ready from the Phase 1, in this phase we create only oneEndpointSlice
(Maybe we can scale it later if required).EndpointSlice
.Headless Service
created inPhase 1
and make sure that all theEndpointsSlices
being returned match against thesynthetic endpoints
create byService Mirror
.How we will run it ? :
We will build docker images of our controller, and deploy these images in our multicluster environment using helm charts. It will monitor all the local Services inside respective cluster. It will take required action when encountered label of interest.
For testing, we will be simply running controller against local kubernetes cluster with default config.
Open Questions :
Even after creating this Service Discovery mechanism, we need to make sure that consuming service has capablity to handle multiple ips return by the Service endpoint.
From my intial assumptions, I dont think there will be any change in proxy. As the request will still be handled in the same way, currently being handled.
Is it best bet to rely on Service Mirror, for the base implementation ? And write this as seprate controller. Or is it better to refactor the Service Mirror code. From my thoughts we should go with former one.
Language of Choice
Rust
instead ofGoLang
. As I think with Go lang it will be alot more boilerplate code.Would you like to work on this feature?
yes
We have started the initial development here : https://github.com/rushi47/service-mirror-prototype