Open xphoniex opened 2 years ago
Hey @xphoniex, nice work :+1: I'm not the one who should give feedback on devops, but I wanted to ask if you could elaborate on which services do you think should be spinned up.
From the perspective of an org I think the most interesting one would certainly be the ones in the
EDIT: Sorry my fault, did not look in which repo the issue was located.radicle-client-services
repo.
I can imagine that there could be a essentials package, and then some optional ones. @cloudhead Could there be eventually an issue by having multiple services reading and eventually writing to the same monorepo, and if the services are distributed on different clusters I can imagine that there could eventually replication issues?
Yeah, it's implied here that it would be radicle-client-services
.
The issue with the monorepo state already exists currently, since the client services all read from the same state. As long as there is a single writer, there should be no problem having multiple readers.
@xphoniex could you describe the topology of the cluster(s)? One small issue I could see is that we have both UDP and TCP services, and I know there is some limitation in k8 with having both on the same instance.
I'm also wondering where do we keep track of the mappings between org and physical instances, ie. IP addresses. Do we use DNS?
For instance, right now, each org points to a DNS name via its ENS records. This DNS name in turn points to a physical address. How do you imagine this could work in the above scenario?
@xphoniex could you describe the topology of the cluster(s)? One small issue I could see is that we have both UDP and TCP services, and I know there is some limitation in k8 with having both on the same instance.
I think we're going to have to expose unique tcp/udp as NodePort
per deployment, which opens theses ports on all nodes and limits the number of deployments. Not ideal, but should be okay.
HTTP traffic would be handled using nginx, according to subdomain:
I'm also wondering where do we keep track of the mappings between org and physical instances, ie. IP addresses. Do we use DNS?
For instance, right now, each org points to a DNS name via its ENS records. This DNS name in turn points to a physical address. How do you imagine this could work in the above scenario?
E.g. we'll have a wildcard dns record for *.aws.monadic.xyz
pointing to our AWS LB, and org x.radicle.eth
would point to x.aws.monadic.xyz
. I suppose tcp/udp endpoints would also look like aws.monadic.xyz:5000
and aws.monadic.xyz:5000/5001
.
Makes sense?
Hi @xphoniex π. Nice work on the RFC.
I anticipate some problems with org-nodes living outside of Kubernetes trying to form a cluster with the ones living inside it due to NATs. At least on GCE (I imagine it to be similar on other clouds), outbound Kubernetes connections are NAT'ed. The way to have inbound connections to the Kubernetes cluster, is to set up a load balancer. The problem is the load balancer would need to (1) be QUIC-aware (is this supported by Kubernetes?) (2) Implement some sort of stickiness so that it routes requests concerning the same Peer ID, to the same K8s pod and port.
Regarding state storage, am I correct in assuming the plan is to use persistent volumes?
Hi @adaszko :wave:
QUIC uses UDP under the hood, but it's not listed here, however it is supported on Google Load Balancer. We might need to test it quickly before proceeding, I guess. (@cloudhead )
Would NAT still be an issue if each org-node
has its own separate port? E.g. org x
would always hit x.aws.monadic.xyz:5000
.
Regarding state storage, am I correct in assuming the plan is to use persistent volumes?
Yes.
Hi @adaszko π
QUIC uses UDP under the hood, but it's not listed here, however it is supported on Google Load Balancer. We might need to test it quickly before proceeding, I guess. (@cloudhead )
Such a test would be nice π
Would NAT still be an issue if each
org-node
has its own separate port? E.g. orgx
would always hitx.aws.monadic.xyz:5000
.
Let's take the case of an outside org-node making 2 requests to an inside one. Even if we have distinct ports for every org-node, the load balancer can direct the 2nd request to a different node than the 1st one unless we set some session affinity (respective functionalities in other clouds need to be researched). Out of the session affinity types listed on the linked website, the most apt seems to be the one based on HTTP headers. The target PeerId
value would have to be added to some preordained header like Radicle-PeerId: ...
. Even then, session affinity is still only best effort, according to the documentation.
It's not a trivial issue, unless of course, librad protocol implementation is so resilient that it can handle (1) Violation of the assumption that 2 consecutive requests of any type addressed to the same (DNS name, port) pair actually reach the same node (that's the load balancer issue) and (2) IP addresses of nodes change at an unpredictable times due to Kubernetes rebalancing to different pods and/or self-healing (DNS names and ports stay the same though). If the implement is in fact so resilient that we have (1) and (2), we can basically go ahead full steam with K8s deployment. These 2 are a pretty tall order though and we'll need be aware that the whole cluster will perform worse (more errors, retries, higher latency) even if the implementation handles (1) and (2) 100% correctly.
@kim @FintanH I'm curious what you guys think, especially regarding seeing the last paragraph from the protocol implementation standpoint.
Let's take the case of an outside org-node making 2 requests to an inside one. Even if we have distinct ports for every org-node, the load balancer can direct the 2nd request to a different node than the 1st one unless we set some session affinity
I'm not so sure about this. We're gonna have a single deployment
/service
per org
thus even if it hits another node, the packet will always be redirected using iptable rules back to the intended node where the deployment
lives. No?
Let's take the case of an outside org-node making 2 requests to an inside one. Even if we have distinct ports for every org-node, the load balancer can direct the 2nd request to a different node than the 1st one unless we set some session affinity
I'm not so sure about this. We're gonna have a single
deployment
/service
perorg
thus even if it hits another node, the packet will always be redirected using iptable rules back to the intended node where thedeployment
lives. No?
I'm still talking about the case of org-nodes living outside of Kubernetes trying to connect the ones living inside it. Radicle is a peer to peer app (i.e. not a SaaS) so I think it's fair to say the nodes can connect from anywhere.
It is a bit unclear what you're trying to achieve here. I am assuming that you are aware that radicle-link
is a peer-to-peer protocol, and thus it is nonsensical to try and cluster / load-balance individual nodes (if not, I'm happy to explain).
You can surely use k8s to spin up singleton instances of a node (like a database, iirc StatefulSet
is the thing to use for that). However, it is most likely the case that you would need to employ custom SDN, or else use NodePort
and public IP addresses to make nodes be able to communicate.
If your goal is to only cluster the HTTP/git interfaces of an org-node for availability reasons, you may be able to do that by mounting a shared volume containing the state. Since network-attached storage is both slow and does not necessarily exhibit POSIX semantics (O_EXCL
specifically), I would recommend mounting read-only.
For the latter to work, you would obviously need to be able to run those endpoints standalone, ie. without spinning up the p2p stack.
So as I understand the discussion, the LB is not really used to "balance load" between homogeneous backends, but rather just to route traffic? or did I get this wrong?
In the simplest case, each org/user has 1 replica, and the cluster is heterogeneous, ie. the nodes are not interchangeable.
In the more advanced case, an org may want to deploy multiple replicas. Using a load balancer in front of those nodes would make sense in case we're worried that one of the nodes goes down, but I think we may find that having the clients directly connect to the individual replicas simplifies things, since this is supported by the protocol.
Just to clear, the reason for choosing k8s here is that it standardizes our lifecycle management. Since it complicates networking for us we have two options:
Sounds like the second option would be more appropriate at this. Any comments?
Note: there are some limitations that apply to VPS route, DigitalOcean for example doesn't allow more than 10 droplets unless you talk to support. AWS is at 20. We also need to be careful we don't hit a hard limit on our DNS provider, as we're setting a new record per server.
You can use k8s' LoadBalancer
concept as a NAT device to translate to a 1-replica StatefulSet
. Whether that's cost-efficient depends on your pain tolerance, and whether you are able to significantly overcommit (ie. most nodes are idle most of the time). Note that you need one external IP per node, unless clients can address using port numbers.
You can not expect any kind of transparent mapping of a single address to multiple, independent nodes to work as you'd expect. Even if you do that for just the HTTP endpoints and use session affinity, it will probably not yield the web experience you're after, because replication is by design asynchronous. I'm not sure how important this is, though, as long as the node gets restarted automatically if something goes wrong.
I get that what you want is essentially "virtual hosts", but I'm afraid this won't be possible on the p2p layer until HTTP/3 gets standardised. You could consider creating extra SRV
or TXT
records which would allow a p2p node to discover an IP:port pair (and possibly the peer id, too), but I don't think there's anything off-the-shelf which would automate this on k8s.
We can bypass LoadBalancer
NAT, we just need to assign every node its own external IP. Still I'm perfectly fine not taking the k8s route, and building it this way:
Directly order a VPS from providers like DigialOcean/Hetzner, do the initial setup using Ansible. Set DNS records (on our own domains, pointing org to new IP), and keep state in our own DB.
Issue with this is I'll end up writing some glue scripts to tie everything together and we can't use the idle resources anymore, as kim mentioned.
If no one has any objection with the design, I can start prototyping with pulumi.
The issue is not the NATing per-se (well, fingers crossed), but that you need one LB (with one public IP) per instance if you want a fixed port (ie. resolve using only the A record). Or maybe two, to support both TCP and UDP traffic.
If you have a way to assign a fixed port per instance, and clients can discover that, you could potentially have a single LB route all HTTP traffic, and another one route all QUIC traffic.
For that case we won't be using fixed port, each instance/org will have a separate port.
For that case we won't be using fixed port, each instance/org will have a separate port.
Then how do you make that work in the browser?
HTTP traffic is simple, it's not p2p and would come through LB. We just need to update ingress rules with our controller.
Ok, I guess I don't know what you're talking about then. Good luck! :)
Does this help?
ββββββββββββββββββββββββββββββ
β org-node 192.x.x.x:8776 β
βββββΊβ β
β ββββββββββββββββββββββββββββββ
β
β
β
β peering 192.x.x.x:8776 with 200.x.x.1:5000
β (for org x)
β
β
ββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β β Node#1 Public IP: 200.x.x.1 Node#2 Public IP: 200.x.x.2 β
β β β
β β ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β βββββΊβorg-node :5000 β βorg-node :5002 β β
β ββββββββββββββββββββββββββββββ€βββββββ ββββββββββββββββββββββββββββββ€ β
β βhttp-server: 5001 β² β β βhttp-server: 5003 βββββββββββΌβ β
β β β β ββββββ€βΊβ ββ β
β βββββββββββββββββββββββββββΌβββ€ β β ββββββββββββββββββββββββββββββΌβ β
β βnginx β β β β βnginx ββ β
ββββββββ΄ββββββ βx.radicle.eth -> x.svc:5001 β β βββ€x.radicle.eth -> x.svc:5001 ββ β
βLoadBalancerβββββ¬βββββββββΊβy.radicle.eth -> y.svc:5003 βββ βy.radicle.eth -> y.svc:5003 ββ β
ββββββββ¬ββββββ β ββββββββββββββββββββββββββββββ ββββ¬ββββββββββββββββββββββββββ β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
That makes sense, if the org-node
port is unique per radicle.eth
CNAME (assuming that's the p2p port). That is, x.radicle.eth
has a different port than y.radicle.eth
, and the client has a way to discover that.
There is no Host
header on the p2p layer, and even if there was one you could not make use of it for routing unless the proxy servers know the private keys of every org-node behind them.
Problem
We want to give interested parties a chance to try out Radicle without having to take care of their own infrastructure. Goal is to introduce a low-friction solution which is also reliable.
Proposal
This is not Radicle's core offering and we'd even encourage competition in this space. Thus our design should be transferable and as plug-and-play as possible.
We'll have an entrance contract that list all contracts offering their service. Decisions would be made based on the number of subscribers to each contract and the price each one is asking.
Upon deciding to purchase, user sends money to either
topUp(address org)
orregisterOrgThenTopUp()
. We might need a conversion from ether to stablecoin here, to simplify financing for service providers who have obligations in fiat.This will eventually emit a
NewTopUp
event containing org address and probably more info like expiry block. (After talking with Alexis, we decided to keep accounting in block terms on contracts)Inside each k8s cluster, which ideally lives on a different cloud, we'll have a controller watching
NewTopUp
events for their respective contract. On new events, we create Deployment and Service for this new org, with the needed containers inside. If it already exists, we simply change the expiry block, without affecting anything else.We'll use IaC (Infrastructure as code) with Terraform managing the cloud resources for us, thus a potential third party can offer an alternative once they clone our infra code and fill in their own cloud keys.
Issues
We are relying on major clouds AWS, GCP and Azure which are in the same jurisdiction. Others lack support in our automation tooling because of poor API or lack of enough interest from community/maintainers.
GeoDNS. Our p2p system, as is, can't optimize for latency-based routing. I think, this needs to be solved on protocol level so we can ideally have two machines representing the same
org-node
ideally on a write-write capacity but if not, write-read.High availability. Same as above.
Durability. Data can get lost, while in worst-case scenario, data can be partially or fully recovered by connecting with users' p2p nodes, having a HA solution would make our system more robust.