sudomesh / disaster-radio

A (paused) work-in-progress long-range, low-bandwidth wireless disaster recovery mesh network powered by the sun.
https://disaster.radio
1.06k stars 108 forks source link

Routing tables filled with garbage after some time, maybe hardware-related? #81

Open deafboy opened 4 years ago

deafboy commented 4 years ago

I'm using 1.0.0-rc.1 on ttgo v1 and ttgo v2. They can smoothly communicate with each other. However, the routing table on both nodes starts to fill with garbage after a while.

This is what web ui and ble client shows on the v2:

abfa0cfc01b6 | 91 | 88 31f4019b9288 | 31 | f4 010091ee9bf2 | 01 | 00 000000000100 | 00 | ff ffff00ff0000 | 00 | ff 00ffff000000 | 01 | 00 008f14088100 | e8 | fb 3fd8f9004001 | 00 | 00 0100d3c8d3f4 | 01 | 00 918d3df10100 | f1 | 44 38f40248f288 | 37 | 92 051c91dba8f2 | 03 | 30 4500cc4c8b01 | 60 | 99 99ff62010660 | 00 | c9 3302cc603505 | 05 | 1c 058f14088301 | f3 | 6c 53b8850061a1 | 15 | 95 52010000 | |

abfa0cfc01b6 is my v1 node connected directly, yet it shows the distance of 91 hops.

When I try to get more info via telnet the board resets:

< > /lora Local address: 918838f8 �Connection closed by foreign host. $

paidforby commented 4 years ago

That garbage in the web gui and the ble app is due to the fact that I have not updated either of those to interpret the latest changes to routing table. The routing protocol (LoRaLayer2) now uses four byte addresses, as you can see the ones reported in the web gui (and probably the ble client) are six bytes long. So it's not really garbage, but rather misinterpreted information.

I haven't had time or energy to fix the web gui and I don't know how to fix the ble client.

Also, both use a "hack" to get the routing table in the first place. They just watch for routing table packets, "intercept them", then interpret them. Instead, they should actively request the routing table from the LoRaClient, I started writing this feature into recent commits and it is what is supposed to be used by the Console, but it is not totally finished (or thought out) as evidenced by #77.

BishopPeter1 commented 4 years ago

problem with garbage in routing table and reboot after /lora command is also in rc.2 version...

< > /lora Routing Table: total routes 5 1 hops from 1d562bd8 via 1d562bd8 metric 207 1 hops from d1c8d390 via d1c8d390 metric 191 2 hops from 0ffd1775 via d1c8d390 metric 0 91 hops from 98f764b3 via d1c8d390 metric 18 255 hops from �␒␁␀␀␀␀␀␀␀␀␀␀␀␀␀␔␒␁␀␀i␀␀␀ ␀␀␀ hopts Stack smashing protect failure!

abort() was called at PC 0x400f8bdf on core 1

just "enhancement" tag ?

paidforby commented 4 years ago

To be clear, the initial issue is related to the web app and the ble app which are really separate from the firmware. The bug you are reporting @BishopPeter1 looks to be of a different nature, probably more closely related to #77 and #88.

That being said, thanks for reporting this and I think you are correct that this should be labeled a bug and not an enhancement, regardless of which part of code it is talking about.

I think I noticed this problem in the simulator, so I will try to reproduce there. Not sure why the routes are getting all screwy. But I'm thinking the stack smashing is being caused because there is no check on the size of the routing table print out before sending it to the console (it has to fit it a datagram to be sent to the console). Somewhere between LoRaClient.cpp and the getRoutingTable() function in LoRaLayer2 there needs to be a check that the output routing table does not exceed 239 bytes (which in this case is equal to a single char). Alternatively, we could remove the limit on datagram size in the Layer3 code and only impose the 239 datagram size limit in the LL2 code, which is where it actually matters. I actually like the second option, but it sounds slightly more difficult and could be confusing.

BishopPeter1 commented 4 years ago

As seen in https://github.com/sudomesh/LoRaLayer2/issues/17 in routing table is something "byterotated" as seen below Routing Table: total routes 8 1 hops from c0d3f00d via c0d3f00d metric 255 <--- This is the sender... yay! Who is everyone else?! 1 hops from ffff00c0 via ffff00c0 metric 254 2 hops from d3f00d00 via ffff00c0 metric 0 64 hops from 48ce804b via ffff00c0 metric 7 255 hops from 00000000 via ffff00c0 metric 0 1 hops from 444344f8 via 444344f8 metric 254 2 hops from 13412824 via 444344f8 metric 0 41 hops from 1159063d via 444344f8 metric 24

Is not it something with wrong definition of length of MAC address in routing table after shortening from initial version?

paidforby commented 4 years ago

Good catch. I haven't looked closely at the problem yet. It is possible that I missed something in that transition, but this problem seems more recent than that change.

Based on some quick investigation using the simulator, I think this is most likely a problem of memory not being cleared properly. In the simulator I'm not able to reproduce the problem of more routes appearing, however, it does look like my routing packets contain more data than they should, except the extra data is all zeros (probably because the stack/heap are arranged differently).

More investigation is needed to figure why/where that memory is being overrun.

paidforby commented 4 years ago

Fyi, @ those interested. I've rewritten the packet success and metric calculations so they actually make sense. I don't think this is the immediate cause of this issue but it is something I noticed happening and is a start towards making sense of these garbage-filled routing tables.

I've still been unable to reproduce this problem with even two real T-beam boards. @BishopPeter1 what exactly was your setup? Was it just two nodes or was it three nodes? Were there any hops involved? Did you have a routing interval set, or were doing the "manual"/reactive routing where messages must be sent to build the routing tables?

paidforby commented 4 years ago

Note https://github.com/sudomesh/LoRaLayer2/issues/17#issuecomment-695855603, I'm no longer seeing this problem in the LL2 sender/receiver example code, so I'm hoping some of my changes (maybe the packetSuccess/metric rewrite) solved the problem. I'll keep this issue open since I still haven't fixed the original bug with the web app and ble app route table print outs. Also because I'm not super confident I've solved the problem, since I can't explain why it's not happening anymore.

deafboy commented 4 years ago

The rotated fake addresses seems to be gone. The partially correct addresses and completely random adresses are still displayed after a while. Currently I have 6 physical nodes running and everything was fine until after I've connected ttgo v1 (abfa0cfc).

It's address was partially cut :

Routing Table: total routes 5 1 hops from 918c1ee4 via 918c1ee4 metric 47 1 hops from 336ac791 via 336ac791 metric 47 1 hops from 336ac7b5 via 336ac7b5 metric 31 1 hops from 91a1eed4 via 91a1eed4 metric 47 1 hops from abf�?( �?ifps Stack smashing protect failure!

After that, it shortly went down hill. For example, the routing table was returned twice:

Routing Table: total routes 9 1 hops from 918c1ee4 via 918c1ee4 metric 63 1 hops from 336ac791 via 336ac791 metric 63 1 hops from 336ac7b5 via 336ac7b5 metric 47 1 hops from 91a1eed4 via 91a1eed4 metric 63 �@�?D�?�A@?����iRouting Table: total routes 9 1 hops from 918c1ee4 via 918c1ee4 metric 63 1 hops from 336ac791 via 336ac791 metric 63 1 hops from 336ac7b5 via 336ac7b5 metric 4 Stack smashing protect failure!

The number of peers on the display seem to have peaked at 41:

Routing Table: total routes 41 1 hops from 336ac7b5 via 336ac7b5 metric 175 1 hops from 91a1eed4 via 91a1eed4 metric 175 1 hops from 336ac791 via 336ac791 metric 175 1 hops from 918c1ee4 via 918c1ee4 metric 175 1 hops from abibop Stack smashing protect failure!

To me it looks like the bad routes are no longer broadcast by v2 and t-beam, but if a malformed packet is received for any reason, the error is propagated throughout the whole network. Some sanity checks may be needed on the receiving side. Check if the value is a hex number, or even add some form of checksum to protect from unintentional RX errors.

When it comes to the original UI related issue, the routing table is not currently displayed in the web UI at all. Maybe we should split this to several separate issues, but at this point I'm not sure how many problems are even present and to which part of the codebase they're related.

BishopPeter1 commented 4 years ago

@BishopPeter1 what exactly was your setup?

My setup is: TTGO Lora v2.1_1.6 - wifi client TTGO Lora v1.0 conected by BT to mobile 2x TTGO Lora v1.0 - just hanging around

on computer is small script caled every hour and doing telnet to v2.1 TTGO and writing something funny to console. if i choose longer routing interval thet everything take longer time to broke.

If i disable learning routes (by setting routing interval to 0 then all problem disappear... but two Lora v1.0 never become in route table as they nothing saying - they just hangin around, switched on...

until after I've connected ttgo v1 (abfa0cfc).

Looks like problem is something with ttgo v1 as we both have them? Maybe something like this? https://stuartsprojects.github.io/2018/06/02/phantom-packets-is-this-the-reason.html

paidforby commented 4 years ago

Ok. I'm not sure I have more than 4 working boards at the moment and I definitely don't have a TTGO V1 board. I'll try to reproduce in the simulator by creating a network with > 4 nodes (and maybe include some hops also). If the simulator works for a very long time (over night?) without causing this issue, then I would feel fairly confident that this is somehow a hardware/LoRa -related problem. Note that @robgil in https://github.com/sudomesh/LoRaLayer2/issues/17 was seeing this problem with Heltec V2 boards, I know the TTGO V1 is a particularly bad board design, I'm not aware of quality issues with the Heltec V2, though @robgil claims to not be seeing the issue since the latest changes to LL2.

I agree with @deafboy on splitting this issue. Lets keep this one as the "Routing tables filling with garbage issue maybe hardware-related?" issue since we the discussion has turned more in that direction. I'll create a new issue for the original issue of making the routing tables appear correctly (or appear at all?) in the web app, ble app.

robgil commented 4 years ago

@paidforby I'm still seeing the issues after fixing the packet size. Still testing though and will report back.

paidforby commented 4 years ago

@robgil did you update to the latest commit of LL2 (or at least https://github.com/sudomesh/LoRaLayer2/commit/e173bf43214d9a7f5101465330603bee98d073d4 )? Is your setup just two nodes (one sender and one receiver)? Or are you only seeing this with more than two?

paidforby commented 4 years ago

After leaving the sim running with seven nodes for more than 12 hours, I didn't observe any issues with the routing table when leaving them in proactive routing mode (see below). I did observe some oddities related to the updating of metrics (specifically metrics for nodes 2 or more hops away) I think I just need to review the logic used for creating those values. I also noticed that a stack smashing failure is generated upon printing the routing table once it becomes larger, this is because I didn't create any check to break the routing table print out into smaller chunks that fit inside individual datagrams. Both of these issues should be easy to fix, but I don't think they are directly related to the issue being observed on physical devices.

sim_working

robgil commented 4 years ago

@paidforby running only 2 nodes. So far so good after removing the extra packet length from the datagram. Seeing only one neighbor in the route table now.

Routing Table: total routes 1
1 hops from c0d3f00c via c0d3f00c metric   1 

I'll run for a while longer to see if anything pops up. I bet the overflow from the packet might somehow be polluting the routes, but this is just conjecture at this point. Latest testing code is here. This adds the non blocking pseudo delay() approach and fixes the packet length.

EDIT: Just compiled/built with latest commit and so far so good as well.

deafboy commented 3 years ago

Today I've tried to lower the CPU clock to 80 Mhz. The 2 ttgo v2 boards were running stable for several hours, exchanging beacon packets and occasional messages. As soon as I unplugged the USB cable from one of the boards, leaving it only on the battery power, the routing table was altered: image

The change has been replicated on the other node as well: image

Not sure if the CPU clock change was the culprit. Will try to replicate with default clock speed later.