rqlite / helm-charts

Helm charts for rqlite
MIT License
9 stars 0 forks source link

k8s multiple geolocation regional cluster configuration #20

Closed lmtyler closed 3 months ago

lmtyler commented 3 months ago

looking for direction/example of how to use the helm chart to create a single rqlite cluster that has nodes running across multiple different geolocated regions, specifically how to configure the pods to discover the pods in other regions.

rqlite multi cluster

otoolep commented 3 months ago

@jtackaberry -- not sure if you know, but the request is the following. Given a Kubernetes deployment across, say, 3 AWS Availability Zones, how can we ensure that Kubernetes will deploy the 3 rqlites nodes such that there is one node in each AWS Availability Zone?

otoolep commented 3 months ago

FWIW, here is what GPT-4 says to do (I don't know if this is correct, but it might supply some leads):

https://chat.openai.com/share/ae21ced0-e889-491f-b594-ef2b42c2d62d

otoolep commented 3 months ago

Corrected my first comment above.

lmtyler commented 3 months ago

yes deploying across multiple availability zones in a single region is pretty straight-forward by using taints, affinities, and anti-affinities during the pod scheduling process. I do that pretty commonly already.

I can configure the deployments/monitoring to ensure that there is at least one pod in each geolocated region.

My question, is more specifically how do I tell the rqlite helm chart deploying p1 in r1, to discover p2 in r2 and p3 in r3?

otoolep commented 3 months ago

What do you mean by "discover"? I don't follow that part. Do you mean "how do I get the 3 rqlite nodes, one in each zone, to form a single cluster?"

lmtyler commented 3 months ago

Yes, how to get each pod in a different region (East/West/SouthCentral) to know/find/discover the others to create a single raft cluster

otoolep commented 3 months ago

Yes, how to get each pod in a different region (East/West/SouthCentral) to know/find/discover the others to create a single raft cluster

Why is it any different than normal clustering? Set up DNS as per the Kubernetes guide, and use DNS-based clustering.

https://rqlite.io/docs/clustering/automatic-clustering/#using-dns-for-bootstrapping

The Kubernetes guide on rqlite.io (as well as the Helm charts) use DNS-based bootstrapping. As long as the nodes come up with contactable network addresses, and are in the headless service, what's the issue?

jtackaberry commented 3 months ago

The different regions are presumably running their own independent K8s clusters? Unless you're self-hosting anyway, as I'm not aware of any cloud provider that offers managed K8s that spans multiple regions.

Assuming separate clusters, the chart definitely doesn't support this right now. It's currently quite opinionated about service discovery, relying on a K8s headless service for discovery by DNS, which of course will only work within the boundary of one cluster.

Let's set aside the chart and its opinions for the moment and just talk straight K8s: how do you see this working, @lmtyler? I assume your CNI driver allows direct pod-to-pod communication between clusters (e.g. EKS VPC CNI), which would be a prerequisite to a single rqlite cluster spanning multiple K8s clusters. How do you envision service discovery working?

Service discovery seems like the hard part here (or at least the part whose options I'm less experienced with). Something like AWS's CloudMap MCS controller perhaps? (Although based on your example regions I'm guessing you're in Azure?)

lmtyler commented 3 months ago

@jtackaberry yes each region is an independent k8s cluster. Running on Azure, but no they do not allow direct cluster to cluster communication directly without creating a load balancer service with a fixed IP. Taking into account that each cluster is different, am I thinking that I should be able to configure a load balancer on the edge of each cluster, give that IP a DNS entry.

From this then I believe I should be able to create a configmap specific for each region that would be able to tell the helm chart with 3 distinct deployments and configure r1 to know about r2/3, r2 about r1/r3, and finally r3 about r1/r2

Additionally I am really hoping if I can bootstrap the above next to a standard 3 node statefulset cluster inside each individual region then each individual region should think it is actually a 5 node cluster.

Now if your helm chart could do this, it would make rqlite the only open source RDMS that I am aware of that would allow for multiple HA active/active regions redundancy 💯 😃

jtackaberry commented 3 months ago

I thought rqlite (or perhaps it's more accurate to say raft) requires each individual node in the cluster to be addressable by all other nodes in the cluster. @otoolep is it possible to assemble a cluster with a LB in front of each region's nodes in the way @lmtyler described?

otoolep commented 3 months ago

I thought rqlite (or perhaps it's more accurate to say raft) requires each individual node in the cluster to be addressable by all other nodes in the cluster.

Yes it does. Any node (leaving aside read-only nodes) could become leader at anytime, and then need to heartbeat directly with all other nodes.

rqlite does not support multi-cluster configurations (there is no such thing in rqlite, there is no meaning to the statement "rqlite cluster 1 talking to cluster 2"). Either you have a single cluster, or you have a single node. And a single node is still a cluster -- it just has one node in its internal config, and that node is (obviously) the leader.

otoolep commented 3 months ago

In the original diagram above, rqlite has no meaning for that. It's not really clear what is the goal here.

Is it to create 3-node cluster, where each node in the cluster is in a different failure domain (usually an AWS Availability Zone in given AWS region)?

To be clear, you can build a rqlite that spans regions too -- but you may need to work through firewalls and so on. See this rather old, but still reasonably correct, blog post: https://www.philipotoole.com/rqlite-v3-0-1-globally-replicating-sqlite/

jtackaberry commented 3 months ago

Thanks @otoolep, all consistent with my understanding.

Originally I interpreted the goal to be a single rqlite cluster spanning multiple regions, where the rqlite nodes in each region were housed within their own K8s cluster. So 3 regions, 1 K8s cluster per region, 3 rqlite nodes per region (housed within the region's K8s cluster), all composing 1 rqlite cluster.

That's why I had stipulated a K8s CNI plugin that allows pods to be directly addressable from outside the K8s cluster. It's been a while since I've used AKS (ironically, even though I cut my K8s teeth on AKS, I'm much more familiar with EKS these days), but reading about this now it does look like AKS has this capability nowadays. The Azure CNI (in contrast to Kubenet which is what I recall using many years ago) seems to work like the EKS VPC CNI in which AKS pods are given IPs from the VNET's subnet the pod's K8s node resides in. This allows things outside the K8s cluster to address the pod.

So in principle with the Azure CNI plugin, a multi-region rqlite cluster could work from a connectivity perspective. It's the service discovery that's the challenging part. Pods can talk to each other, but K8s services are still in-cluster constructs. So the DNS-based headless service approach the chart currently uses wouldn't work.

But rqlite also supports Consul or etcd as a discovery mechanism. The chart could be updated easily enough to support user-defined discovery configuration that overrides the chart-default's headless service approach. This would however require running a separate Consul or etcd cluster, which is admittedly a bit of an annoying barrier, but at least technically possible.

Here's an idea -- just thinking out loud here -- one could external-dns to register the pod IPs in an Azure Private DNS zone. Then rqlite's native DNS discovery could be used.

For this to work, rqlite would need to support multiple DNS names for discovery, because each region would need its own DNS name. @otoolep is that supported today?

otoolep commented 3 months ago

For this to work, rqlite would need to support multiple DNS names for discovery, because each region would need its own DNS name. @otoolep is that supported today?

It's not, but I am sure I could add if it made sense. Can you be more explicit on how it would work? rqlite would resolve all the hostnames passed in (as opposed to just resolving 1 today), and pass those to the bootstrapper? E.g. the launch config would be the following:

rqlited -node-id 1 -http-addr=$HOST1:4001 -raft-addr=$HOST1:4002 \
-disco-mode=dns -disco-config='{"name":["rqlite-1.cluster", "rqlite-2.cluster"]}' -bootstrap-expect 3 data
lmtyler commented 3 months ago

@otoolep for a true in the real world aspect supporting multiple DNS I think would be a requirement

this said for my specific needs all the public IP on the edge of the cluster would all be in the same DNS.

East2/West2/SouthCentral all would be in *.foo.bar.com behind an internal DMZ, so if I am following the blog post and the other comments correctly, then as long as it supports one external DNS and then also internal to the cluster DNS I believe that would support my current needs.

otoolep commented 3 months ago

@otoolep for a true in the real world aspect supporting multiple DNS I think would be a requirement

How is an rqlite node supposed to operate if only one of the addresses resolves? Does it just take those addresses, and keep going? I'm don't understand the use-case here, maybe I'm missing something.

jtackaberry commented 3 months ago

@lmtyler

this said for my specific needs all the public IP on the edge of the cluster would all be in the same DNS.

It sounds like there may still be some confusion about what rqlite needs. Remember from above:

Any node (leaving aside read-only nodes) could become leader at anytime, and then need to heartbeat directly with all other nodes.

So there's no "edge of the cluster" allowed here. You can't front the regions with a load balancer. Full global mesh is needed.

Full mesh connectivity should be possible with the Azure CNI (and obviously VNET peering between the regions). The question then becomes one of service discovery: how can rqlite discover the IPs of its peers in both the local region, and all the remote regions.

If the external-dns proejct does allow registering pod IPs into Azure Private DNS -- and I haven't tried this myself, just seems like it might possible based on some limited perusal of GitHub issues/PRs -- then it may be a fairly straightforward tweak to rqlite to support this configuration by means of allowing multiple DNS names for its native DNS-based discovery.

@otoolep

Can you be more explicit on how it would work? rqlite would resolve all the hostnames passed in (as opposed to just resolving 1 today), and pass those to the bootstrapper?

Exactly, yes.

I could actually prototype this multi-K8s-cluster rqlite cluster in my home lab. My network supports direct pod routing outside the K8s cluster, and I can forward DNS requests under a given subdomain to one or the other cluster, which would enable node discovery between K8s clusters. It would validate the idea at least.

That said, it does seem to me like a multi-region global cluster is a pretty fringe use case, no? I mean the latency for writes or anything other than "none" consistency writes (for the remote regions) would have to be pretty brutal.

I've only run multi-continent databases using Cassandra, which has an explicit notion of what's local and what's remote. (But of course Cassandra's not relational.)

otoolep commented 3 months ago

That said, it does seem to me like a multi-region global cluster is a pretty fringe use case, no? I mean the latency for writes or anything other than "none" consistency writes (for the remote regions) would have to be pretty brutal.

Yes, it will be quite slow. See https://www.philipotoole.com/rqlite-v3-0-1-globally-replicating-sqlite/ (which I linked earlier). I deployed this to show it was possible in principle, but only got single-digit writes per second. Of course, if those writes contain a large amount data in each request, you can move a fair amount of data per second. But the request rate per second is very low.

lmtyler commented 3 months ago

Thanks @otoolep @jtackaberry all great info, it is really close, but does not seem to fit my use case