sudomesh / disaster-radio

A (paused) work-in-progress long-range, low-bandwidth wireless disaster recovery mesh network powered by the sun.
https://disaster.radio
1.06k stars 107 forks source link

Routing table error after long uptime or 3+ nodes #38

Closed 2E0PGS closed 4 years ago

2E0PGS commented 4 years ago

Great firmware, I just successfully range tested two modules with impressive results on the 868MHz band.

A few improvement ideas

Cheers

samuk commented 4 years ago

Yes both good ideas, the status has been discussed below

https://github.com/sudomesh/disaster-radio/issues/35

Writing scrolling last-sent messages to the OLED has been mentioned on the mailing list.

paidforby commented 4 years ago

@samuk is correct, both of these issues have been on my mind.

For what it is worth, the latest firmware (which I will release as 0.1.1 soon) includes a console interface accessed through serial that allows you to print the routing table by typing lr -r, see the testing firmware section of the readme, to see the current abilities of the console interface. The routing table will show "connected" nodes along with the "quality" of the connection in the form the metric.

I'm planning on figuring out a way to display the routing table in the web app.

paidforby commented 4 years ago

@2E0PGS checkout new "Active Nodes" list feature mentioned in the related issue, https://github.com/sudomesh/disaster-radio/issues/35#issuecomment-574992472

Should be working if you build (both the firmware and the web app) from latest, or I will be compiling a pre-built binary for 0.1.1 soon.

No progress on utilizing the OLED screen yet.

2E0PGS commented 4 years ago

ok cool thanks!

2E0PGS commented 4 years ago

Sorry it took me a while to try v0.1.1 just flashed it on my two boards. Working great! I can now see the hops and metric.

I presume node 000000000000 is the local node's address? hops 00 and metric 00

2E0PGS commented 4 years ago

It looks like there maybe a bug with the beacon length when the device is left on for a long time.

2020-01-25 17:20:31

image

2020-01-26 00:46:40

image

I didn't change any settings in GQRX.

The only changes I can think of maybe room temperature, laptop warming up, LoRa warming up, and HackRF warming up.

Version 0.1.1 from the binary release.

I had two nodes running there. Oddly enough unplugging and replugging didn't reset it.

However back this morning with a cold room and cold devices (switch off over night) they're back to how it was in the first screenshot.

2E0PGS commented 4 years ago

I will try and replicate it by artificially heating up my board. Or leaving one on and one off and compare after hours.

2E0PGS commented 4 years ago

No sudden changes from artificially heating my SDR or my LoRa board. I am using TTGO.

I will try leaving one running for now. Then I can turn the other on later and compare.

2E0PGS commented 4 years ago

The only code references I see are these two: https://github.com/search?q=org%3Asudomesh+beaconInterval&type=Code

2E0PGS commented 4 years ago

Or the glitch relates to the route message getting longer. Android phone on the WiFi slowing it down?

2E0PGS commented 4 years ago

I did some testing today. Here are the results.

During all of this testing GQRX was not modified settings wise. I am running two v0.1.1 firmware on TTGO boards from prebuilt binaries.

Ignore the extra harmonics this is due to one board being powered via a grounded mains to 5v PSU and the second via battery power bank. The second lagging signal is the power bank TTGO we shall call this node 2.

2020-01-26 17:28:28 "Receiver Options"

This is the beginning of the test and I show a few setting windows. image

2020-01-26 17:28:38 "FFT Settings"

image

2020-01-26 17:28:42 "Input controls"

image

2020-01-26 22:03:16

Several hours into testing I notice a increase in the signal TX length. image

2020-01-26 22:33:48

I take power cycle one of the boards to see if this changes it's TX length. It makes no change. image

2020-01-26 22:35:50

I decide to try power cycle both boards to see if the issue is related to packets exchanged between the two, maybe routing information. This resolves the problem. image

2E0PGS commented 4 years ago

Running one node on it's own for hours with no neighbours didn't have this behavior. This makes me think it's route message related.

samuk commented 4 years ago

Interesting stuff, wonder if it's worth testing with the latest code? Realise not that much has changed, but might be worth verifying it's still an issue?

paidforby commented 4 years ago

Highly likely that there may be an unknown error with the routing message logic that only appears after a long uptime. My guess is that a byte gets shifted somewhere and starts filling the routing table with false routes. This would explain why it doesn't go away after only one node is restarted, because the node that was kept on immediately shares those false routes with the rebooted node. However, when both are rebooted, their routing tables are reset and their little network "forgets" about the false routes.

Note: this is just my theory, I would need to do some actual testing and write some debugging code to demonstrate that this is happening.

tlrobinson commented 4 years ago

I don’t know if this is related, but I was seeing an issue where the routing table exploded (dozens of new entries per minute) if I had 3+ nodes running. I’ll try to reproduce it.

On Mon, Jan 27, 2020 at 7:56 PM grant_____ notifications@github.com wrote:

Highly likely that there may be an unknown error with the routing message logic that only appears after a long uptime. My guess is that a byte gets shifted somewhere and starts filling the routing table with false routes. This would explain why it doesn't go away after only one node is restarted, because the node that was kept on immediately shares those false routes with the rebooted node. However, when both are rebooted, their routing tables are reset and their little network "forgets" about the false routes.

Note: this is just my theory, I would need to do some actual testing and write some debugging code to demonstrate that this is happening.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sudomesh/disaster-radio/issues/38?email_source=notifications&email_token=AAAEOEJDOBL7B4HIKGEEKG3Q76UGJA5CNFSM4KFYIP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKB56LA#issuecomment-579067692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEOEODYW6YK4A6PRRXYLTQ76UGJANCNFSM4KFYIP3A .

2E0PGS commented 4 years ago

Ref the hypothesis, this sounds about right to me. I suspect it's filling up and this causes a knock on effect of a longer TX length as the message is longer.

samuk commented 4 years ago

Would you be up for trying to replicate your error with the latest routing? Hoping this bug has just gone away: https://github.com/sudomesh/disaster-radio/issues/57#issuecomment-628040381

paidforby commented 4 years ago

Yes, it would be good to test if this bug is resolved on the 1.0.0-rc.2 branch, which is using the latest updates to LoRaLayer2, which has switched to a more dynamic source routing (DSR) style and no longer requires that sharing of routing tables via routing table packets.

paidforby commented 4 years ago

Closing this issue and merging it with #81 since there is more activity on that thread and these seem closely (if not directly) related issues.