helium / router

router combines a LoRaWAN Network Server with an API for console, and provides a proxy to the Helium blockchain
Apache License 2.0
70 stars 31 forks source link

Device repeating every uplink #517

Closed buzzware closed 2 years ago

buzzware commented 2 years ago

I'm in Perth Australia on AU915 with only a Sensecap in range, and using a 3rd party sensor I can't configure or modify. It works fine on TTS, but on Helium in the console, almost every uplink appears twice with the same count, and an acknowledgment each time. Sometimes its 3 times, and rarely once. Its the same device as in #463. I suspect the ack isn't getting there in time, and so it sends again, but I don't know why it doesn't try again and again (perhaps just a configured limit). Please see attached the json and a csv I generate from the json.

I believe the RXDelay is 1 and so the delta_ms next to each downlink should be well under 1 second. It is regularly over 800ms, and the Australia -> US latency is typically 200ms, so I am wondering whether the delay shown in the console is the send time, and so we should add the latency onto it. The problem with this theory is that it still repeats when the delta_ms is around 600ms. event-debug (4).json.zip event-debug (4).csv

buzzware commented 2 years ago

I tried setting up a Pycom LoPy4 to reproduce the problem, but wasn't able to - it works fine. A key difference is in the join response :

pynode Join accept :

https://lorawan-packet-decoder-0ta6puiniaut.runkit.sh/?data=IP%2FjxMsUj8HSEo4UuLa8MSs%3D&nwkskey=&appskey= DLSettings.RX1DRoffset = 1 DLSettings.RX2DataRate = 4 RxDelay.Del = 8

3rd party device Join accept :

https://lorawan-packet-decoder-0ta6puiniaut.runkit.sh/?data=INWjXSAFzy8kQK7tMX7RXtw%3D&nwkskey=&appskey=

DLSettings.RX1DRoffset = 6 DLSettings.RX2DataRate = 13 RxDelay.Del = 1

buzzware commented 2 years ago

debugging from my pycom node

event-debug-pynode-3.json.zip event-debug-pynode-3.csv

buzzware commented 2 years ago

TTSV3 Join Accept for 3rd party device :

DLSettings.RX1DRoffset = 4 DLSettings.RX2DataRate = 13 RxDelay.Del = 11

Perhaps the RXDelay should be increased on Helium to give devices more chance to succeed? TTS has a lot more parameters for devices. Perhaps it would be good for Helium to be more customisable.

leroyle commented 2 years ago

Interested bystander 2 cents worth. I took a look at the first event-debug(4).cvs file:

I see a restart at line 938 at line 1085 we see the first #19 uplink ( a second follows the downlink ) at line 1086 we see downlink #36 which contains a MAC command, looks like LINK-ADR-REQ. Normally that downlink MAC command would be sent when the network has seen 20 uplinks. The FCNT counts are 0 based, so this would be the 20th uplink the network has seen.

So, the network is sending the LINK-ADR-REQ at FCNT == 20 From then on we see more MAC commands being sent with the downlinks.

Question: When the network sends a LINK-ADR-REQ is it in response to the 20th uplink it sees or a response to the FCNT==20 that it receives from the device?

If the network responds due to FCNT==20, then it seems the device is not seeing the downlinks per the OP's comments. That could be due to device issues or Helium router/hot spot issues.

The flow is device -> hot spot -> router -> console console -> router -> hot spot -> device

Decoding the MAC commands is a pain. They were added recently so perhaps they are not yet dumped with the .cvs data, but should be. If you can capture these in the console we can easily decode what the MAC commands are.

Again, interested bystander, not a worker bee.

buzzware commented 2 years ago

Thanks @leroyle The CSV should contain everything from the json (also attached). I did raise an issue for the full raw packet binary to be included in json, not just the payload.

buzzware commented 2 years ago

There are several downlink related issues: #437, #497, #510, #491. I wonder if our downlink issues are due to a too-short RxDelay.Del as above, perhaps made worse for non-US users by the internet latency to US based servers (typically 200ms here in Australia). US based users may not see this problem.

Why does my third party device on Helium get given a 1 second RXDelay ? (see decoding above) My theory is that is not enough time for the packet to arrive on the gateway and be scheduled for delivery at the precise time (+/- 20 ms). TTS gives the same device a value of 11, and Helium gives my pynode a value of 8.

image

"The delay RxDelay follows the same convention as the Delay field in the RXTimingSetupReq command." image

Latency is critical for LoRaWAN servers. TTS has regional servers, including one in Australia. I would like to see Helium simply host regional servers (also for data sovereignty reasons) or incentivise them like validators.

leroyle commented 2 years ago

For what it's worth I'm seeing the same repeated FCNT uplinks when I send "CONFIRMED" uplinks. US915, WisBlock, Helium Original hot spot 30ft away, Dallas Tx USA

It looks like the Wisblock runtime is seeing the downlink which is being interpreted as a FRAME_TYPE_DATA_UNCONFIRMED_DOWN. I might have expected a FRAME_TYPE_DATA_CONFIRMED_DOWN type instead but then again I don't know, perhaps it has always been the unconfirmed type that has been returned. Just have not noticed in the past and do not have detailed enough older log files.

I see this behavior on both prod and stage. My Join Accept numbers just for giggles (prod):

DLSettings.RX1DRoffset = 4 DLSettings.RX2DataRate = 1 RxDelay.Del = 7

buzzware commented 2 years ago

@leroyle I think FRAME_TYPE_DATA_CONFIRMED_DOWN would cause a confirmation on the confirmation

buzzware commented 2 years ago

519 just appeared, which refers to #338, both of which make very interesting reading and may explain this issue.

The RxDelay seems to be still 1 and wasn't increased to 2 as suggested. I'm thinking this is extra-important for Australia due to distance, and slower national network. Note that Join Accepts use a fixed value of 5, and so they succeed.

I will look closer at hold_time now.

buzzware commented 2 years ago

Consider my csv file at the top (event-debug (4).csv) . Here is a snapshot with only uplinks and the relevent columns : image

Note :

  1. the zigzagging up and down of the hold_time in sync with the delta_ms
  2. this comment by @Oliv4945 says that a hold_time over 2000ms suggests downlinks won't be received, and the Helium RxDelay should be raised to 2s.
  3. 519 shows a correlation between the node transmit delta and the hold_time ie a 5 minute delta causes 2000ms hold_time & a much less delta cause a much more acceptable hold_time, which is what I'm seeing.

  4. for the first uplink of each pair, there is a delta (since the last uplink) of around 300,000 ms and the hold_time is always over 2000ms, and presumably the downlink is not received, so the node sends the second uplink.
  5. for the second of each pair, the hold_time is much less - between 500 and 800 and presumably the downlink is received because there is only occasionally a third uplink.
  6. however, for fcnt=212, there are 3 uplinks and 🥁 _the first and second holdtimes are around 2s but not the third.
  7. After the device is rejoined, uplink 0 has hold_time 818 and no repeat
  8. uplink 8 has no repeat and a hold_time of 809

My conclusions are :

  1. I don't understand what the delta has to do with the hold_time, but the relationship couldn't be clearer
  2. RxDelay should be raised from 1 to 2
  3. Attention needs to be paid to the production server latency especially as the load is ever increasing
  4. Perhaps Helium should setup regional Network Servers, including for Australia, but :
  5. If an acceptable latency is being achieved for every second downlink, why isn't it for every other ?
buzzware commented 2 years ago

A Theory: What we're seeing is a traffic prioritisation algorithm this gives preference to joins, the first uplink and repeat confirmed uplink attempts after that. This would ensure joins under heavy traffic, but doubles battery consumption and DC burn. It also looks like the server is always in a condition of heavy load. Does Helium inc just need to upgrade their servers? This won't be good enough for Senet and Actility customers.

freedadara commented 2 years ago

hi there, i think i found something intresting i was in the MOC and saw your problem could it be that it's related to this ? https://github.com/libp2p/go-libp2p/issues/1105

buzzware commented 2 years ago

@dcorper it isn't that one uplink is appearing twice; and by the way, my hotspot is only connected by Ethernet not WiFi too. However a Helium engineer has a theory it is a libp2p connection being reused or not that causes a quick or slow response - that makes sense to me. He also says it's not because of network distance - that just makes it worse to the point that it matters.

macpie commented 2 years ago

We will be working on this https://github.com/helium/router/issues/531 to give ability to user to increase RxDelay on the device.

buzzware commented 2 years ago

I will leave the device c31a7a5a-ce77-41d9-9644-6a6ec7229f55 running for testing this. Let me know here if you would like me to trigger a rejoin.

macpie commented 2 years ago

Fixed RX delay https://github.com/helium/router/pull/548

buzzware commented 2 years ago

I added one of these devices to the staging server to test this, and what I'm seeing with RxDelay=3 is much worse than before. So far I think this is the device not obeying the RxDelay setting in the join-accept, which seems correct. This is device https://staging-console.helium.wtf/devices/b4233140-f081-48e5-b643-954219f5ebf0 image

buzzware commented 2 years ago

I've just realised that all my decodings of JoinAccepts have been wrong, and this does it right https://github.com/buzzware/decrypt-join-accept

buzzware commented 2 years ago

@macpie @jdgemm Is it possible that the miner/gateway/packet forwarder need to change to match this new RxDelay feature? I'm using the staging console with my retail Heltec hotspot and seeing this many repeat uplinks problem just above.

macpie commented 2 years ago

@macpie @jdgemm Is it possible that the miner/gateway/packet forwarder need to change to match this new RxDelay feature? I'm using the staging console with my retail Heltec hotspot and seeing this many repeat uplinks problem just above.

Seems like we have an issue on the new rx delay feature, we are working on it

buzzware commented 2 years ago

@macpie @leroyle @jdgemm @lthiery At last this seems to be fixed, and my device is working as it should on the staging console. 1) I left it sending every 5 minutes with RxDelay=5 for days, and when I checked it just now it was not repeating uplinks, and on SF7 (the lowest DR) like it should be 2) when I caused it to rejoin, it did so exactly as I would expect without repeats. It is still in SF12, but I suppose it will adjust that after 20 uplinks

Big thanks to everyone involved :)

macpie commented 2 years ago

@macpie @leroyle @jdgemm @lthiery At last this seems to be fixed, and my device is working as it should on the staging console.

1. I left it sending every 5 minutes with RxDelay=5 for days, and when I checked it just now it was not repeating uplinks, and on SF7 (the lowest DR) like it should be

2. when I caused it to rejoin, it did so exactly as I would expect without repeats. It is still in SF12, but I suppose it will adjust that after 20 uplinks

Big thanks to everyone involved :)

Awesome to hear, thanks for the update