meshtastic / firmware

Meshtastic device firmware
https://meshtastic.org
GNU General Public License v3.0
3.52k stars 869 forks source link

[Bug]: Firmware 2.5 Alpha - DM messages are unreliable #4712

Closed thatSFguy closed 1 month ago

thatSFguy commented 1 month ago

Category

Other

Hardware

T-Lora v2 1.6

Firmware Version

2.5.0.9ac0e26

Description

I know this is only Alpha, but based on the release video it seems like the right place to go with DM's. I've updated 10 nodes to 2.5 except for a couple that I can't update.

In this bug report/case I have 5 nodes, same model as listed in the bug report, all sitting next to each other to eliminate any RF issues. I can send a message out on the default LongFast channel and all of them get message. The DM's are now hit or miss. Sometimes they go trough with the "Person Icon + Check mark" and other times I just get the "Cloud and Check mark". It's less reliable than it was with the private channel and a DM at this point.

Again, I think it's the way to go from a visionary perspective. Unfortunately, I don't have a ton of time to devote to troubleshoot it right now, but wanted to provide some feedback on this.

Let me know if I can do anything to help.

5 nodes

Relevant log output

No response

jp-bennett commented 1 month ago

Grab a serial log from the receiving node when a message fails to go through. See if you get an error message that explains it.

thatSFguy commented 1 month ago

I'll try to get logs for both sides tomorrow evening. Thanks for your dedication to this, Jonathan!

thatSFguy commented 1 month ago

O.k Here is the log from the receiving note. I verified via CLI that both nodes see each other with --nodes. The output is from "Blue Node" which is the receiving node from BMN4

INFO | 23:24:41 69 BLE disconnect

WARN | 23:24:43 71 [PositionModule] Channel utilization is >25 percent. Skipping this opportunity to send.

WARN | 23:24:48 76 [PositionModule] Channel utilization is >25 percent. Skipping this opportunity to send.

WARN | 23:24:53 81 [PositionModule] Channel utilization is >25 percent. Skipping this opportunity to send.

DEBUG | 23:24:53 81 [Power] Battery: usbPower=1, isCharging=1, batMv=4218, batPct=100

DEBUG | 23:24:57 85 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=43, time 559 ms

DEBUG | 23:24:57 85 [RadioIf] Lora RX (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x8 encrypted rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [RadioIf] AirTime - Packet received : 559ms

DEBUG | 23:24:57 85 [Router] Add packet record (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x8 encrypted rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [Router] Using channel 0 (hash 0x8)

DEBUG | 23:24:57 85 [Router] Expanding short PSK #1

DEBUG | 23:24:57 85 [Router] Using AES128 key!

DEBUG | 23:24:57 85 [Router] decoded message (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=3 rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [Router] handleReceived(REMOTE) (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=3 rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [Router] Module 'position' wantsPacket=1

INFO | 23:24:57 85 [Router] Received position from=0x1c169820, id=0x2071d65c, portnum=3, payloadlen=21

DEBUG | 23:24:57 85 [Router] POSITION node=1c169820 l=21 lat=431751168 lon=-856424448 msl=260 hae=0 geo=0 pdop=0 hdop=0 vdop=0 siv=0 fxq=0 fxt=0 pts=0 time=0

INFO | 23:24:57 85 [Router] updatePosition REMOTE node=0x1c169820 time=0 lat=431751168 lon=-856424448

DEBUG | 23:24:57 85 [Router] Node status update: 7 online, 10 total

DEBUG | 23:24:57 85 [Router] Module 'position' considered

DEBUG | 23:24:57 85 [Router] Module 'routing' wantsPacket=1

INFO | 23:24:57 85 [Router] Received routing from=0x1c169820, id=0x2071d65c, portnum=3, payloadlen=21

DEBUG | 23:24:57 85 [Router] Routing sniffing (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=3 rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3)

INFO | 23:24:57 85 [Router] Rebroadcasting received floodmsg to neighbors

DEBUG | 23:24:57 85 [Router] Expanding short PSK #1

DEBUG | 23:24:57 85 [Router] Using AES128 key!

DEBUG | 23:24:57 85 [Router] enqueuing for send (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3 priority=64)

DEBUG | 23:24:57 85 [Router] txGood=3,rxGood=19,rxBad=0

DEBUG | 23:24:57 85 [Router] rx_snr found. hop_limit:2 rx_snr:13.500000

DEBUG | 23:24:57 85 [Router] rx_snr found in packet. Setting tx delay:1386

DEBUG | 23:24:57 85 [Router] Delivering rx packet (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=3 rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [Router] Update DB node 0x1c169820, rx_time=1726529097

DEBUG | 23:24:57 85 [Router] Forwarding to phone (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=3 rxtime=1726529097 rxSNR=13.5 rxRSSI=-67 hopStart=3)

DEBUG | 23:24:57 85 [Router] Module 'routing' considered

INFO | 23:24:57 85 Lost phone connection

DEBUG | 23:24:57 85 (Client not yet interested in packets)

WARN | 23:24:58 86 [PositionModule] Channel utilization is >25 percent. Skipping this opportunity to send.

DEBUG | 23:24:58 86 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=41, time 542 ms

DEBUG | 23:24:58 86 [RadioIf] Lora RX (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=12.25 rxRSSI=-60 hopStart=3)

DEBUG | 23:24:58 86 [RadioIf] AirTime - Packet received : 542ms

DEBUG | 23:24:58 86 [Router] Found existing packet record for fr=0x1c169820,to=0xffffffff,id=0x2071d65c

DEBUG | 23:24:58 86 [Router] Found existing packet record for fr=0x1c169820,to=0xffffffff,id=0x2071d65c

DEBUG | 23:24:58 86 [Router] Add packet record (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=12.25 rxRSSI=-60 hopStart=3)

DEBUG | 23:24:58 86 [Router] Ignoring incoming msg we've already seen (id=0x2071d65c fr=0x20 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=12.25 rxRSSI=-60 hopStart=3)

DEBUG | 23:24:58 86 [Router] cancelSending id=0x2071d65c, removed=1

DEBUG | 23:24:58 86 [Router] Incoming message was filtered from 0x1c169820

WARN | 23:25:03 91 [PositionModule] Channel utilization is >25 percent. Skipping this opportunity to send.

INFO | 23:25:08 96 [PositionModule] Sending pos@0:6 to mesh (wantReplies=1)

DEBUG | 23:25:08 96 [PositionModule] Skipping location send because precision is set to 0!

DEBUG | 23:25:08 96 [PositionModule] allocReply returned a nullptr

DEBUG | 23:25:09 96 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=45, time 567 ms

DEBUG | 23:25:09 96 [RadioIf] Lora RX (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxSNR=11 rxRSSI=-34 hopStart=3)

DEBUG | 23:25:09 96 [RadioIf] AirTime - Packet received : 567ms

DEBUG | 23:25:09 96 [Router] Add packet record (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxSNR=11 rxRSSI=-34 hopStart=3)

DEBUG | 23:25:09 96 [Router] Attempting PKI decryption

INFO | 23:25:09 96 [Router] Random nonce value: 553475508

DEBUG | 23:25:09 97 [Router] DH Output: : 7c 6f 73 99 5d c0 0d 4a b8 dc b1 3a 20 b6 85 26 09 1f d4 ed 7a fd b5 d3 32 22 04 c4 5d 26 d8 55 

DEBUG | 23:25:09 97 [Router] Attempting decrypt using nonce: : 00 08 89 af b4 5d fd 20 d8 bf 2c 1c 00 

DEBUG | 23:25:09 97 [Router] Attempting decrypt using shared_key: : 6a 0f e7 ab 6a 9d 01 bd 77 16 a0 02 59 c4 00 53 91 b9 87 48 35 9d 82 5b ed 39 12 8f ac 38 46 08 

WARN | 23:25:09 97 [Router] No suitable channel found for decoding, hash was 0x0!

DEBUG | 23:25:09 97 [Router] packet decoding failed or skipped (no PSK?) (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxtime=1726529109 rxSNR=11 rxRSSI=-34 hopStar

DEBUG | 23:25:09 97 [Router] Module 'routing' wantsPacket=1

INFO | 23:25:09 97 [Router] Received routing from=0x1c2cbfd8, id=0xaf890800, portnum=1886715933, payloadlen=46323

DEBUG | 23:25:09 97 [Router] Routing sniffing (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxtime=1726529109 rxSNR=11 rxRSSI=-34 hopStart=3)

DEBUG | 23:25:09 97 [Router] Partially randomized packet id 2349033193

WARN | 23:25:09 97 [Router] Alloc an err=6,to=0x1c2cbfd8,idFrom=0xaf890800,id=0x8c0366e9

DEBUG | 23:25:09 97 [Router] localSend to channel 0

DEBUG | 23:25:09 97 [Router] Add packet record (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=2 Ch=0x0 Portnum=5 requestId=af890800 rxtime=1726529109 priority=120)

DEBUG | 23:25:09 97 [Router] Expanding short PSK #1

DEBUG | 23:25:09 97 [Router] Using AES128 key!

DEBUG | 23:25:09 97 [Router] enqueuing for send (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=2 Ch=0x8 encrypted rxtime=1726529109 hopStart=2 priority=120)

DEBUG | 23:25:09 97 [Router] txGood=3,rxGood=21,rxBad=0

DEBUG | 23:25:09 97 [Router] Delivering rx packet (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxtime=1726529109 rxSNR=11 rxRSSI=-34 hopStart=3)

DEBUG | 23:25:09 97 [Router] Forwarding to phone (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=3 Ch=0x0 encrypted rxtime=1726529109 rxSNR=11 rxRSSI=-34 hopStart=3)

DEBUG | 23:25:09 97 [Router] Attempting PKI decryption

INFO | 23:25:09 97 [Router] Random nonce value: 553475508

DEBUG | 23:25:09 97 [Router] DH Output: : 7c 6f 73 99 5d c0 0d 4a b8 dc b1 3a 20 b6 85 26 09 1f d4 ed 7a fd b5 d3 32 22 04 c4 5d 26 d8 55 

DEBUG | 23:25:09 97 [Router] Attempting decrypt using nonce: : 00 08 89 af b4 5d fd 20 d8 bf 2c 1c 00 

DEBUG | 23:25:09 97 [Router] Attempting decrypt using shared_key: : 6a 0f e7 ab 6a 9d 01 bd 77 16 a0 02 59 c4 00 53 91 b9 87 48 35 9d 82 5b ed 39 12 8f ac 38 46 08 

WARN | 23:25:09 97 [Router] No suitable channel found for decoding, hash was 0x0!

DEBUG | 23:25:09 97 [Router] Module 'routing' considered

WARN | 23:25:09 97 [RadioIf] Can not send yet, busyRx

WARN | 23:25:09 97 [RadioIf] Can not send yet, busyRx

WARN | 23:25:09 97 [RadioIf] Can not send yet, busyRx

WARN | 23:25:09 97 [RadioIf] Can not send yet, busyRx

DEBUG | 23:25:09 97 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=45, time 567 ms

DEBUG | 23:25:09 97 [RadioIf] Lora RX (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=2 Ch=0x0 encrypted rxSNR=12 rxRSSI=-59 hopStart=3)

DEBUG | 23:25:09 97 [RadioIf] AirTime - Packet received : 567ms

DEBUG | 23:25:09 97 [Router] Found existing packet record for fr=0x1c2cbfd8,to=0x25c0dba4,id=0xaf890800

DEBUG | 23:25:09 97 [Router] Found existing packet record for fr=0x1c2cbfd8,to=0x25c0dba4,id=0xaf890800

DEBUG | 23:25:09 97 [Router] Add packet record (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=2 Ch=0x0 encrypted rxSNR=12 rxRSSI=-59 hopStart=3)

DEBUG | 23:25:09 97 [Router] Ignoring incoming msg we've already seen (id=0xaf890800 fr=0xd8 to=0xa4, WantAck=1, HopLim=2 Ch=0x0 encrypted rxSNR=12 rxRSSI=-59 hopStart=3)

DEBUG | 23:25:09 97 [Router] cancelSending id=0xaf890800, removed=0

DEBUG | 23:25:09 97 [Router] Incoming message was filtered from 0x1c2cbfd8

DEBUG | 23:25:10 98 [RadioIf] Starting low level send (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=2 Ch=0x8 encrypted rxtime=1726529109 hopStart=2 priority=120)

DEBUG | 23:25:10 98 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=29, time 452 ms

DEBUG | 23:25:10 98 [RadioIf] AirTime - Packet transmitted : 452ms

DEBUG | 23:25:10 98 [RadioIf] Completed sending (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=2 Ch=0x8 encrypted rxtime=1726529109 hopStart=2 priority=120)

DEBUG | 23:25:12 99 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=27, time 436 ms

DEBUG | 23:25:12 99 [RadioIf] Lora RX (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=12 rxRSSI=-59 hopStart=2)

DEBUG | 23:25:12 99 [RadioIf] AirTime - Packet received : 436ms

DEBUG | 23:25:12 100 [Router] Rx someone rebroadcasting for us (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=12 rxRSSI=-59 hopStart=2)

DEBUG | 23:25:12 100 [Router] didn't find pending packet

DEBUG | 23:25:12 100 [Router] Found existing packet record for fr=0x25c0dba4,to=0x1c2cbfd8,id=0x8c0366e9

DEBUG | 23:25:12 100 [Router] Found existing packet record for fr=0x25c0dba4,to=0x1c2cbfd8,id=0x8c0366e9

DEBUG | 23:25:12 100 [Router] Add packet record (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=12 rxRSSI=-59 hopStart=2)

DEBUG | 23:25:12 100 [Router] Ignoring incoming msg we've already seen (id=0x8c0366e9 fr=0xa4 to=0xd8, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=12 rxRSSI=-59 hopStart=2)

DEBUG | 23:25:12 100 [Router] cancelSending id=0x8c0366e9, removed=0

DEBUG | 23:25:12 100 [Router] Incoming message was filtered from 0x25c0dba4

DEBUG | 23:25:13 101 [Power] Battery: usbPower=1, isCharging=1, batMv=4219, batPct=100
jp-bennett commented 1 month ago

@thatSFguy great. So it thinks if has a key pair, and just fails on the decrypt. I'm plotting to change the debug output a bit to make it more useful. For now, you can grab the same log from the sender, and see if the "Attempting decrypt using shared_key" line matches. I suspect it will not, but it might give us a clue. I'll tag you on a PR once I get the logging to be more helpful.

thatSFguy commented 1 month ago

I'll make some notes here:

  1. CLI outputs the current list of the nodes, but android 2.4.5 isn't showing all the nodes. I had to disconnect from bluetooth and reconnect for the node to appear.
  2. The nodeDB was showing the old generic "Meshtastic XXXX" default name, despite the node being renamed a couple of days ago.
  3. I reset the nodeDB again and disconnected bluetooth and reconnected with Android 2.4.5 and the messages went through
  4. While experimenting, I fired up 3 other nodes, but the nodeDB showed the Hardware as "UNSET". Not sure if the HW info has anything to do with the decryption. but I'll see if I can get another working.
  5. The output from Step 4 is that I noticed that I had a private channel "Friends" on 0 and "long fast" on 1. I reset the channel to match the sending node and once the nodes saw each other I was able to send DM's. Are DM's only supposed to work if the telemetry channel is the same on both nodes?
thatSFguy commented 1 month ago

It looks like a full erase via the flasher does not reset the channels. The other node also had the private channel "Friends" on 0 and "long fast" on 1. Again, unable to send DM's. I reset the channels to make sure the "long fast" was on 0 and the DM's started working!

jp-bennett commented 1 month ago

5. The output from Step 4 is that I noticed that I had a private channel "Friends" on 0 and "long fast" on 1. I reset the channel to match the sending node and once the nodes saw each other I was able to send DM's. Are DM's only supposed to work if the telemetry channel is the same on both nodes?

Unless you manually change the LoRa Frequency Slot, using a different primary channel will result in the radios transmitting on different frequencies. That may explain a lot of the problems you were seeing. Even if the slot was the same, your nodeinfo packets go out over the primary channel, and those need to get picked up for DMs to work, since nodeinfo carries the public keys.

thatSFguy commented 1 month ago
  1. The output from Step 4 is that I noticed that I had a private channel "Friends" on 0 and "long fast" on 1. I reset the channel to match the sending node and once the nodes saw each other I was able to send DM's. Are DM's only supposed to work if the telemetry channel is the same on both nodes?

Unless you manually change the LoRa Frequency Slot, using a different primary channel will result in the radios transmitting on different frequencies. That may explain a lot of the problems you were seeing. Even if the slot was the same, your nodeinfo packets go out over the primary channel, and those need to get picked up for DMs to work, since nodeinfo carries the public keys.

Copy that. I was using the QR code that was available on github to set 0 to private and 1 to Longfast, to prevent accidental sharing of locations. The weird thing is that I now have a node that LongFast on 0 and another node with private on 0 with LongFast on 1 and yet DM's work.
I'll probably just reset them all to LongFast on 0, but this node bucks the trend.

P.S. Again I'm really surprised that the "full erase" didn't reset the channel settings.

thatSFguy commented 1 month ago

Closing this. Just a couple of notes for the ones thst come after me:

  1. As recommended, if you have any issues, reset the nodeDB. Even if the node appears you may have to wait 15 min before it starts working. Patience is key!
  2. Not sure if all the nodes need to be on the same primary channel, but that seems to be the most reliable.

Thank you @jp-bennett for all your help! The new DM method eliminates a lot of unnecessary channels!