Discuss how to avoid single point of failure at relay

scottyeager commented 1 year ago

While the relay based RMB provides a much better experience than its Yggdrasil based predecessor in day to day operation, it introduced an effective single point of failure for each RMB participant: the relay. The impact of this was seen recently when relay.grid.tf went offline for some time. It became impossible to deploy workloads on any node on the Grid or carry out other RMB dependent functions. This issue is for discussion of potential ways to avoid having the relay be a single point of failure.

RMB participants are free to choose a different relay, and one potential approach is to say that it is the responsibility of downstream software to handle failover if desired. I present that as one option below, along with a couple other ideas:

Software using RMB is responsible to monitor the liveness of the relay it uses and switch relays as needed by updating their relay entry on TF Chain and connecting to the other relay
An optional fallback relay could be configured in an additional slot added to TF Chain and passed to RMB clients. Rerouting messages in the case that a participant's primary relay goes down is handled automatically
Allow simultaneous use of multiple relays. As long as messages can be uniquely identified, duplicates can be discarded. Since a client doesn't necessarily connect directly to the recipient's relays, fanning messages out would happen at the federation stage

Of course, each potential route comes with downsides, namely additional complexity and additional network traffic. Giving the system an inherent, if optional, way to achieve greater resilience seems worthwhile to me, but at the least we should establish a stance on this to inform the development of systems that rely on RMB.

muhamadazmy commented 1 year ago

Hi @scottyeager

it might not be known, but RMB (and the relay) supports exactly this and also support federation. In other words peers can use relay of choice AND still be able to talk to each other even if they are connected to different relays via some mean of federation similar to how matrix workds.

We don't have a mechanism to allow the farmer yet to choose which relay (usually the closest) to him to use yet but it should be easy to implement, and fall back to the default relay

So federation in RMB works as follows:

You connect to your relay of choice
- The peer need to make sure his relay information on his twin object is correct and updated to the right relay (we will call it R0)
Once connected, the peer when sending a message to another peer, he already knows the twin information (he has to know to do e2e) which means he already knows what is the relay config for the other peer (say R1)
The peer attach federation information to the message and send it normally to his relay R0
R0 will receive the message and already can tell that the message is distant to R1, so instead of queuing it to deliver to local peer, it forward it to R1
R1 then can simply queue the message locally until the receiving peer is available to receive the message.

Also note that, the relay is designed to be stateless as a service while it of course keeps a queue of messages per twin in a redis cluster, we can run multiple relay instance next to each other (so relay.grid.tf can in face have multiple processes running behind it instead of one as long as they are using the same redis cluster backend) which allows the relay to handle many many connections at the same time.

Right now we run a single instance of the relay on top of (as far as I know) a single redis instance. this setup already can take up (hard limit in config, to 500K connections)

We already can improve by running multiple relay instances behind relay.grid.tf AND we can also then start creating regional relays.

My plan was (but is totally up to the operation to execute) is to run regional relays. Say:

europe.relay.grid.tf
africa.relay.grid.tf
etc...

This can even go for smaller regions (east, west, etc...)

Then famers can then choose which region they should go to, or just use local information from the nodes to select which relay they need to connect to.

Once they are connected, federation will be effective, and a blackout in one region will not bring the entire grid down.

scottyeager commented 1 year ago

So I am aware of how federation is possible with RMB and see how this can help with preventing a global point of failure at the relay. That will certainly be important for scaling out, especially beyond a single grid.

The part I didn't consider is how a relay can already be scaled. If I'm understanding correctly, a way to introduce redundancy into a single relay is like this:

Scale the single redis instance to a multinode setup (Redis Sentinel or Cluster)
Spin up additional relay processes that connect to the same redis
For a multi site setup use multiple DNS entries for relay.grid.tf, to give basic load balancing and failover

In that case, it would just be an operational task to deploy such a setup, and no further changes would be required in code or client config.

muhamadazmy commented 1 year ago

Draft proposal here #141

xmonader commented 1 year ago

https://github.com/threefoldtech/rmb-rs/issues/141

threefoldtech / rmb-rs

Discuss how to avoid single point of failure at relay #134