meshtastic / firmware

Meshtastic device firmware
https://meshtastic.org
GNU General Public License v3.0
3.3k stars 800 forks source link

some messages are dropped - probably acks are getting starved #681

Closed geeksville closed 2 years ago

geeksville commented 3 years ago

new behavior since 1.0.

pretty high drop rate (about 30%?) - usually hidden by the retry mechanism.

geeksville commented 3 years ago

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/lost-messages-while-testing/2455/3

geeksville commented 3 years ago

hmm - stress test shows no dropped packets at the radio or python api level:

~/development/meshtastic/Meshtastic-python$ bin/run.sh --test 
INFO:root:Writing serial debugging to log_dev_ttyUSB0
Connection changed: meshtastic.connection.established
INFO:root:Writing serial debugging to log_dev_ttyUSB1
Connection changed: meshtastic.connection.established
INFO:root:Ports opened, starting test
INFO:root:Found devices, starting tests...
INFO:root:Sending test packet from 2885173400 to 4294967295
From /dev/ttyUSB1: {'from': 681021528, 'to': 4294967295, 'decoded': {'data': {'portnum': 'POSITION_APP', 'payload': b'', 'position': {}}, 'wantResponse': True}, 'id': 1601631314, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '^all'}
From /dev/ttyUSB1: {'from': 681021528, 'to': 681021528, 'decoded': {'data': {'portnum': 'POSITION_APP', 'payload': b'', 'position': {}}}, 'id': 1601631315, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '!246f28979058'}
From /dev/ttyUSB1: {'from': 681021528, 'to': 681021528, 'decoded': {'data': {'portnum': 'POSITION_APP', 'payload': b'', 'position': {}}}, 'id': 1601631316, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '!246f28979058'}
From /dev/ttyUSB1: {'from': 2885173400, 'to': 681021528, 'decoded': {'data': {'portnum': 'POSITION_APP', 'payload': b' dM\xe0\x8b#`', 'position': {'batteryLevel': 100, 'time': 1612942304}}}, 'id': 153691730, 'rxSnr': 9.5, 'hopLimit': 3, 'fromId': '!2462abf84098', 'toId': '!246f28979058'}
From /dev/ttyUSB1: {'from': 681021528, 'to': 4294967295, 'decoded': {'data': {'portnum': 'NODEINFO_APP', 'payload': b'\n\r!246f28979058\x12\x0cUnknown 9058\x1a\x03?58"\x06$o(\x97\x90X', 'user': {'id': '!246f28979058', 'longName': 'Unknown 9058', 'shortName': '?58', 'macaddr': 'JG8ol5BY'}}, 'wantResponse': True}, 'id': 1601631317, 'rxTime': 1612942314, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '^all'}
From /dev/ttyUSB1: {'from': 681021528, 'to': 681021528, 'decoded': {'data': {'portnum': 'NODEINFO_APP', 'payload': b'\n\r!246f28979058\x12\x0cUnknown 9058\x1a\x03?58"\x06$o(\x97\x90X', 'user': {'id': '!246f28979058', 'longName': 'Unknown 9058', 'shortName': '?58', 'macaddr': 'JG8ol5BY'}}}, 'id': 1601631318, 'rxTime': 1612942314, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '!246f28979058'}
From /dev/ttyUSB1: {'from': 681021528, 'to': 681021528, 'decoded': {'data': {'portnum': 'NODEINFO_APP', 'payload': b'\n\r!246f28979058\x12\x0cUnknown 9058\x1a\x03?58"\x06$o(\x97\x90X', 'user': {'id': '!246f28979058', 'longName': 'Unknown 9058', 'shortName': '?58', 'macaddr': 'JG8ol5BY'}}}, 'id': 1601631319, 'rxTime': 1612942314, 'hopLimit': 3, 'fromId': '!246f28979058', 'toId': '!246f28979058'}
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 681021528, 'decoded': {'data': {'portnum': 'NODEINFO_APP', 'payload': b'\n\r!2462abf84098\x12\x07Node 98\x1a\x02N9"\x06$b\xab\xf8@\x98', 'user': {'id': '!2462abf84098', 'longName': 'Node 98', 'shortName': 'N9', 'macaddr': 'JGKr+ECY'}}}, 'id': 153691731, 'rxSnr': 9.0, 'rxTime': 1612942330, 'hopLimit': 3, 'fromId': '!2462abf84098', 'toId': '!246f28979058'}
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 1', 'text': 'Test 1'}}, 'id': 2650926549, 'rxSnr': 10.0, 'rxTime': 1612942340, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (1 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 2', 'text': 'Test 2'}}, 'id': 2650926550, 'rxSnr': 9.0, 'rxTime': 1612942345, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (2 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 3', 'text': 'Test 3'}}, 'id': 2650926551, 'rxSnr': 9.25, 'rxTime': 1612942350, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (3 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 4', 'text': 'Test 4'}}, 'id': 2650926552, 'rxSnr': 9.25, 'rxTime': 1612942356, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (4 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 5', 'text': 'Test 5'}}, 'id': 2650926553, 'rxSnr': 8.75, 'rxTime': 1612942362, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (5 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
From /dev/ttyUSB1: {'from': 2885173400, 'to': 4294967295, 'decoded': {'data': {'portnum': 'TEXT_MESSAGE_APP', 'payload': b'Test 6', 'text': 'Test 6'}}, 'id': 2650926554, 'rxSnr': 10.25, 'rxTime': 1612942368, 'fromId': '!2462abf84098', 'toId': '^all'}
INFO:root:Test succeeded (6 successes so far)
INFO:root:Sending test packet from 2885173400 to 4294967295
Ignoring sending interface
...
geeksville commented 3 years ago

based on report from @vfurman-gh changed test config to the following for both nodes:

bin/run.sh --port /dev/ttyUSB1 --set position_broadcast_secs 30 --set ls_secs 15

Still no failures with "meshtastic --test"

@vfurman-gh - how many nodes in your mesh? I'll test with android next (using your tool)

geeksville commented 3 years ago

@vfurman-gh ooh thanks! that helped a lot - can repro easily with your android based test

geeksville commented 3 years ago

ooh I see! yes - with 30 secs between each ping and those position broadcasts I think the acks are getting starved from delivery to the API client device. I'll change the python test to also test this case and then debug/fix the underling device bug.

Also interesting "--set position_broadcast_secs 0" (using the default interval) causes the problem to not be visible. Though I can see that the retransmission delays for reliable messages are way too long...

Also: I bet the fix will involve changing the outgoing message queue in the router to be a priority queue, and add 'priority' to messages. With ack/naks having the highest possible priority and background position broadcasts the lowest. Also the position broadcaster should discard any old (unsent) positions when it tries to queue a new position.

geeksville commented 3 years ago

Screenshot_20210211-171101.png

geeksville commented 3 years ago

@vfurman-gh thanks for this report. Now fixed I think. Can you confirm when you get the release I'm doing later today?

geeksville commented 3 years ago

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/android-and-device-builds-1-1-46-now-up-for-alpha-testing/2487/1

vfurman-gh commented 3 years ago

Trying it now with 1.1.46. Unfortunately, doesn't seem the issue is solved...

On Thu, Feb 11, 2021 at 5:14 PM Kevin Hester notifications@github.com wrote:

@vfurman-gh https://github.com/vfurman-gh thanks for this report. Now fixed I think. Can you confirm when you get the release I'm doing later today?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meshtastic/Meshtastic-device/issues/681#issuecomment-777902663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTB466UKZN6EM4BKEPB4I3S6R6I7ANCNFSM4XKUDZIA .

vfurman-gh commented 3 years ago

BTW, I use the slowest channel settings.

On Thu, Feb 11, 2021 at 7:02 PM Vadim Furman vfurman@gmail.com wrote:

Trying it now with 1.1.46. Unfortunately, doesn't seem the issue is solved...

On Thu, Feb 11, 2021 at 5:14 PM Kevin Hester notifications@github.com wrote:

@vfurman-gh https://github.com/vfurman-gh thanks for this report. Now fixed I think. Can you confirm when you get the release I'm doing later today?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meshtastic/Meshtastic-device/issues/681#issuecomment-777902663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTB466UKZN6EM4BKEPB4I3S6R6I7ANCNFSM4XKUDZIA .

vfurman-gh commented 3 years ago

And here are the logs from both devices, in case you need them

On Thu, Feb 11, 2021 at 7:04 PM Vadim Furman vfurman@gmail.com wrote:

BTW, I use the slowest channel settings.

On Thu, Feb 11, 2021 at 7:02 PM Vadim Furman vfurman@gmail.com wrote:

Trying it now with 1.1.46. Unfortunately, doesn't seem the issue is solved...

On Thu, Feb 11, 2021 at 5:14 PM Kevin Hester notifications@github.com wrote:

@vfurman-gh https://github.com/vfurman-gh thanks for this report. Now fixed I think. Can you confirm when you get the release I'm doing later today?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meshtastic/Meshtastic-device/issues/681#issuecomment-777902663, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTB466UKZN6EM4BKEPB4I3S6R6I7ANCNFSM4XKUDZIA .

geeksville commented 3 years ago

@vfurman-gh interesting. how many nodes in your mesh? All with the "--set position_broadcast_secs 30 --set ls_secs 15" setting?

I just did a long test with your ping client and had 100% delivery. Though I was only using a two node mesh, prior to this change I would have had many non acked packets.

vfurman-gh commented 3 years ago

I have two nodes.

On Thu, Feb 11, 2021, 10:04 PM Kevin Hester notifications@github.com wrote:

@vfurman-gh https://github.com/vfurman-gh interesting. how many nodes in your mesh? All with the "--set position_broadcast_secs 30 --set ls_secs 15" setting?

I just did a long test with your ping client and had 100% delivery. Though I was only using a two node mesh, prior to this change I would have had many non acked packets.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meshtastic/Meshtastic-device/issues/681#issuecomment-777996134, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTB466TCA2TOKUJ6SLFTH3S6TAHVANCNFSM4XKUDZIA .

vfurman-gh commented 3 years ago

And yes, same setting for all nodes. Anything to glean from the device logs?

On Thu, Feb 11, 2021, 10:05 PM Vadim Furman vfurman@gmail.com wrote:

I have two nodes.

On Thu, Feb 11, 2021, 10:04 PM Kevin Hester notifications@github.com wrote:

@vfurman-gh https://github.com/vfurman-gh interesting. how many nodes in your mesh? All with the "--set position_broadcast_secs 30 --set ls_secs 15" setting?

I just did a long test with your ping client and had 100% delivery. Though I was only using a two node mesh, prior to this change I would have had many non acked packets.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/meshtastic/Meshtastic-device/issues/681#issuecomment-777996134, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASTB466TCA2TOKUJ6SLFTH3S6TAHVANCNFSM4XKUDZIA .

geeksville commented 3 years ago

oops - I think the logs are missing (did you intend to attach to the bug)?

one more question: what model of radios?

vfurman-gh commented 3 years ago

Both radios are LILYGO® TTGO T-Beam V1.1 SX1262. Just to make sure, you are firing tests from both phones? Looks like github e-mail interface doesn't pass attachments through. Attaching them from web interface: log2.txt log1.txt

Do we do ALOHA for access control? We are probably operating at around 60% utilization with this test, I would not be surprised if aloha starts to fall apart...

geeksville commented 3 years ago

oh that helps! I bet running the ping on each end will help me stress things even further and see the same problem. In the next couple of days I'll update the python test so it can do this (so I can then make it nice and repeatable as an integration test).

re: aloha no - the implementation is basically just naive flood routing with (optional) acks and DSR (not used for broadcasts). Though I did use aloha a long time ago when I developed a (pre wifi) wireless protocol at Apple.

thebentern commented 2 years ago

Closing this. Should no longer be an issue with the channel tx utilization based throttling.