sudomesh / disaster-radio

A (paused) work-in-progress long-range, low-bandwidth wireless disaster recovery mesh network powered by the sun.
https://disaster.radio
1.06k stars 108 forks source link

Alternative mesh routing protocol options #57

Open X-Ryl669 opened 4 years ago

X-Ryl669 commented 4 years ago

If I understand your wiki completely, you're only dealing with 2 level of abstraction for the routing table. As I understand it, it almost impossible to reach a node in a large network (let's say on the 120-th hop), because:

  1. The origin node will never be aware of it
  2. The limited routing table size can not grow exponentially.
  3. The metric are not static, it can fluctuate depending on the moon or the position of the node or a car passing by and so on. Thus, sometimes packet will flow into the network correctly, sometimes they'll just vanish.

Typically, let's say you have this topology:

  [A] => [B] => [C] => [D] => [E] => [F]
                    \=> [E]=> [F]

If node A wants to talk to node F, then it must know about it (this is impossible to solve by storing all possible nodes's MAC, BTW). Let's say that it has learnt about it previously, and now wants to talk to it. It sends a packet with DEST=F and in its routing table, it needs to go via NEXTHOP = B.

Similarly, B needs to have it in its routing table also (how could that possible scale if the number of node is very high ?). Let's say B is Rainman and remembers about all possible nodes.

When B receives it, it'll have to make a choice between C and E for its nexthop. The metric for C might be better because C is closer to B, even if there is an additional node D in the chain. So it chooses C. The conditions are bad, and the packet struggles to follow the chain that goes via D to reach E (or F) immediately.

After some time, A would resend the packet to F, and by the time, the metric in for F would increase so that B select E. D finally wakes up and send the packet to E and finally F.

Then F can/will receive 2 packets (one coming from D=>E, another from E). How can it replies ?

Now imagine that F moves closer to B, so that B could speak to it directly. Who is going to remove the old routing table value in C and D ? What it a packet is being sent to C for F and D is currently removing its entry for F in its routing table before C removed its own version ? In that case, the packet will be lost at D (when C sends it to D for F) and no feedback is sent to A about this packet loss. This is going very bad on the global network consumption because, to deal with this kind of issues, A must resend the packet to ensure it'll arrive at its destination (wasting the global network resources) or any intermediary layer before F must answer to A somehow to tell it they've failed delivery (which could even be worse since a single packet could create a spread of packets back to the origin, if D tells A it failed, then C tells A it failed, and so on).

The issue described above is not a theory, it's the main routing issue we observe on internet. I'm not even speaking of bad route or packet loss here.

As I understand it, any approach based on storing a dictionary with all the possible participant will fail since it grows exponentially in memory and resources consumption (image the above example if all nodes could see F except A).

  1. We need some kind of path detection algorithm to figure out the best route to a destination.
  2. We need some kind of logarithm abstraction somewhere to break the exponential grow (something like a tree or a forest algorithm) required by the current algorithm.
  3. We need a way to avoid dual path sending (because it'll be almost impossible to keep the duty cycle to 1% if all paths starts sending the same packets twice)
samuk commented 3 years ago

Ideally, that should not happen. Just one megabyte of memory can accommodate more than 22.000 active links,

So ~4400 as an upper limit on 192Kb devices.

markqvist commented 3 years ago

So ~4400 as an upper limit on 192Kb devices.

~4400 if you can allocate 192KB to the link table, since other things will also need memory ;) So my guess is a 192KB device could probably handle around 1500 active links. But I think the link throughput and CPU cycles needed to actually pass traffic for 1500 active connections would be exhausted before the RAM runs out.

X-Ryl669 commented 3 years ago

You seem to be talking about LoRaWAN, which that table also seem to be describing

No, it comes from here: https://lora-developers.semtech.com/library/tech-papers-and-guides/lora-and-lorawan/ and the table is about LoRa and not LoRaWAN.

You can probably use your node the way you want (with low SF), not giving a thought about respecting the standards and rules (duty cycle for one) but doing so does not give confidence in your network, and it's bothering other users around (which might be using LoRaWAN, or not). As long as you're not spreading on all channels...

Sure, you can have hundred of nodes that are close to each other and it'll work.

I never had hundred of nodes but only few that are very far and the duty cycle constraint bothers me a lot more than the bandwidth constraint since sending data takes forever (and I'm talking about ~160 bytes here, not 500) and the node must stay awake when they transmit.

All in all, I've followed the Reticulum protocol when it was first described and I think it's a good protocol by itself, yet, I feel it's too large to work on LoRa and you'll maybe prove me wrong (and it would be great!). One of the missing thing in the protocol, IMHO, is the possibility to pre-provision symmetric keys for group communication, in order to avoid transmitting all the session creation packets. After all, if you master the nodes on the network, you can start communicating with encryption (and a known, secret, symmetric key) without having to deal with setting up DH and that would save a lot of bandwidth.

Another improvement I haven't seen in all the network mesh protocols, is to prepare for next packet wake up time to save battery. Many nodes have a very precise clock (GPS, oscillator, etc...) they could use to synchronize communication time, including transmission delay (thus allowing a better usage of the duty cycle limit) and compute the varying session's encryption keys from the elapsed time/wallclock time. That's some information that can be saved to transmit.

markqvist commented 3 years ago

@X-Ryl669, that table is specifically talking about modulation characteristics in the context of LoRaWAN networks. Maybe you should read the document again.

Sure, you can have hundred of nodes that are close to each other and it'll work.

Who said I had a hundred nodes close to each other? Quit with the straw manning already dude ;)

not giving a thought about respecting the standards and rules

Again with the straw man. That's a completely unfounded accusation, which I find just a wee bit offending ;) I sincerely hope you will take that back. I care a great deal about proper spectrum use, and I've designed and built commercial networks in both licensed and unlicensed spectrum in more or less every band from HF to microwave. To me it seems like you are not that knowledgeable about spectrum use regulation, outside of a rather specific area, which you then seem to believe you can apply to everything. Spectrum access regulation is pretty complex, and while that Semtech page you linked to is probably a fine reference for what you can and cannot do within the LoRaWAN standard, you can extract very little about actual regulation, ie. law, from that.

Just so you know, there is much more spectrum available (both licensed and unlicensed) where LoRa can be used legally than just the standard channels specified by the LoRa Alliance.

As I said before, I completely agree that using Reticulum on top of, or within the confines of the LoRaWAN standard is pointless at best. There are plenty of other cases where Reticulum over LoRa PHY works wonderfully though. Reticulum and LoRaWAN are very different protocols, that serve very different purposes. They just both happen to be able to use the LoRa PHY.

I think you have a good point about how pre-provisioning of group keys are cumbersome at the moment. I can be done, but there is no easy-to-use API for it. I will look into that in the future.

And by the way, the LoRa MTU is 255 bytes, not 242 as you wrote ;)

markqvist commented 3 years ago

I still don't know if Reticulum is actually a good fit for disaster.radio, since my knowledge of disaster.radio is way too limited. But for what it's worth, Reticulum can now, since commit cd8de6420155dc14f1b4743fc99d8c5bf68589cb, work with MTUs all the way down to 211 bytes. Bandwidth efficiency is negatively impacted by such a low MTU, but it does work. This is not "officially" supported, and should probably be considered rather experimental.

The official default MTU is still 500 bytes, and running any other MTU than 500 will break intercommunication with networks that run the standard MTU.

So while I would not recommend using a different MTU, it is now possible if you want to :)

The change will be in beta 0.2.4, which I am going to release in a few days.

samuk commented 3 years ago

@markqvist thanks for the update, we'd still need C I think before even exploring it. https://github.com/markqvist/Reticulum/issues/2