threefoldtech / rmb-rs

RMB implementation in rust
Apache License 2.0
3 stars 1 forks source link

RMB redundancy proposal #141

Closed muhamadazmy closed 8 months ago

muhamadazmy commented 11 months ago

Related to #140

Redundancy implemented for rmb by means of federation. It simply means when u connect to your rmb of choice, you make sure your public information on tfchain states which rmb server you are connected to.

Right now, a client on sending a message will include federation information in the message, this will help the rmb relay to decide if he need to deliver the message to one of his directly connected clients, or send it to a remote rely.

This design has 2 main flows:

The summary of the problem is that propagation of changes in peer information does not propagate fast enough to have a seamless connection with remote peers in case of relay changes.

Proposal:

This way clients can send messages without the need to know where the remote peer lives, relay(s) has fresh information about the twins that is updated in almost real time. then relay(s) can decide where to push the message.

IMPORTANT NOTE:

Local peer will still to get remote peer information to use the pk to do e2e encryption this is why we thought since we get this information anyway that we also have the federation information in the message so the relay has to do nothing at all regarding where to push the message.

The client will still have to get and cache twin information locally but the PK almost never change after generation, which means clients can cache this for longer, and relay will then figure out where to send the message as explained above.

despiegk commented 11 months ago

found some issues, good idea, but I am afraid we're not there yet

muhamadazmy commented 11 months ago

Thabet told me about json vs protobuf, I wanted to state that clients that relies on rmb-peer as your agent use full json protocol (which is local over local redis). but for many other reasons the rmb-peers talk to each other and the relay with protobuf for many reasons (also any client that talks to the relay directly):

sameh-farouk commented 11 months ago

I read the original issue and the proposal here, but I don’t see how they are related. I believe the proposal here aims to solve the third point in the issue, which is:

make sure its always redundant, so even when rmb server falls away it still works,

I’m not clear what rmb server means, but I guess it’s rmb relay, right?

My understanding of the issue is that we want to have redundancy for the relays, so that if peer 1 is connected to relay A, and peer 2 is connected to relay B, they can still communicate even if one or both relays fail at some point.

However, the proposal here (RMB redundancy proposal) seems to focus on avoiding cache staleness, by scanning new blocks for twin info changes and updating the cache accordingly in near a realtime. This might improve RMB availability a bit, but it won’t prevent downtime when a relay fails and the on-chain relay info is not updated yet (This could take a far longer time actually). Also, I don’t see how this proposal adds any redundancy. It is a good one but It looks like it’s trying to solve a different problem than the one in the issue (at least to my understanding).

Why don’t we use a traditional fail-over mechanism instead? We could allow on-chain relay field to store more than one IP address. The first IP address would point to the default relay, and the other IP addresses would point to identical redundant relays.

Then the client or rmb-peer could query and cache the list of IP addresses for each twin. If the first IP address times out, it would try the next IP in the list, and reorder the cached twin info to put the working IP address first.

This way, by enabling failover, the communication between twins can continue even if a relay fails, as the client will eventually find another operational relay.

As for the cache staleness issue, I wonder how often twin info changes. Maybe we could simply use a lower cache period, or skip the cache and re-query the twin info if a request to a relay times out (we would need some kind of circuit breaker as well). Scanning every new block from every relay would require a separate connection to blockchain chain node for each relay for the purpose of updating twin info, which might be costly despite the data rarely getting updated. I think one reason we have the garphql to offload unnecessary read operations from the blockchain network and to have the ability to access on-chain data without having to directly interact with the blockchain nodes, which can be costly, and unreliable sometime.

Also If we use the redundancy method described above, the cache staleness would be less relevant. again how often would the twin info change? Probably not more often than a DNS A record. But we still can’t solve the problem of DNS cache staleness (cache is everywhere 🤷🏻‍♂️).

Again, I'm not saying that the proposal is not good. I'm just trying to better understand why we are proposing / implementing this and that we all share same understanding to the issue we are trying to solve, before going any further or writing any code.

Please @muhamadazmy let me know if any of this makes sense.

sameh-farouk commented 11 months ago

We have decided to implement redundancy and failover into the system to achieve high availability after discussing the topic with Azamy. The required changes are summarized as follows:

sameh-farouk commented 10 months ago

Update:

sameh-farouk commented 9 months ago

Update: Azamy suggested an interesting idea of having a relay ranking system that is stored in-memory to optimize the federation routing. I will plan and implement this next.

sameh-farouk commented 9 months ago

Update:

A-Harby commented 8 months ago

Verified, I used the latest rmb release v1.1.1 and updated my twin relays image From gridproxy image And I tried send a message over those relays and even closed two relay with the help of @sameh-farouk, and both message sent successfully. image image

And I tried again later to add some more relay and they were added with no error. image image