Review, document and debug Adaptive Data Rate (ADR) scheme

buzzware commented 2 years ago

The only description I can find of the Helium ADR scheme is by Leroyle in the Discord. He says it changes the data rate after 20 uplinks. The algorithm should be officially documented so users know what to expect, and also tested to ensure it is working as does not work for me. TTN changes my device to SF7 within a few uplinks. On Helium I see it joining at SF7 then uplinking at SF12. If it can join at SF7, surely it can uplink on SF7. Therefore, the algorithm should be reconsidered as many users will watch the SF for a few uplinks and assume ADR was not functioning. The data provided in #517 and #463 show my device endlessly on SF12 even though it is in the same room or almost underneath the roof antenna. Perhaps the lost downlinks and repeated uplinks are causing ADR to fail. I haven't provided a specific data set here because the data in the issues above should suffice; if you can't see similar issues with your own devices. It seems that ADR just hasn't had much attention.

ADR is a critical feature because it controls battery consumption, and for my devices it makes a difference of 25 times. My third party product uses a non-replaceable, non-rechargeable battery that should last over 10 years, but on helium it would last much less than one year.

buzzware commented 2 years ago

@pvolodin I don't actually understand the CFList feature at all. The text near the switch isn't clear on what it does or why you would want it on or off. I also says it only applies to US915, and I'm not sure that is still true. Turning it off didn't make a difference that I noticed to the ADR or repeating uplinks/missed downlinks issues I've reported.

mikev commented 2 years ago

Will do, but I haven't had confirmation that the RxDelay feature was fully completed end to end and deployed? There was some outstanding issue. Sprint 68 seems to contain relevant issues unresolved

@buzzware - Sorry, my mistake. I just spoke with a colleague. The feature is 95% implemented, but getting the rx delay sent to the end-device involves some subtlies in the LoRaWAN protocol. I'll post to this issue when the full rx-delay feature is ready.

I don't actually understand the CFList feature at all.

Yes, it is confusing. Here is the current documentation:

Enable Join-Accept CF List (applicable to US915 devices only) The Join-Accept CF List configures channels according to the LoRaWAN spec to use sub-band 2. Devices that have not correctly implemented the LoRaWAN spec may experience transfer issues when this setting is enabled.

Enabled, the server will send a CF List with every other join.

Disabled, the server will not send a CF List. The channel mask is still transmitted via ADR command.

This feature ONLY applies to US915 regions. Until very recently Helium mistakenly had reversed the CHMask bytes in the Join response. So its possible this was a work-around for the original chmask issue. CFList is likely not relevant in this specific case. In any case, since the end-device and Gateway are in Australia this feature is not applicable and will have no effect.

buzzware commented 2 years ago

Will do, but I haven't had confirmation that the RxDelay feature was fully completed end to end and deployed? There was some outstanding issue. Sprint 68 seems to contain relevant issues unresolved

@buzzware - Sorry, my mistake. I just spoke with a colleague. The feature is 95% implemented, but getting the rx delay sent to the end-device involves some subtlies in the LoRaWAN protocol. I'll post to this issue when the full rx-delay feature is ready.

@mikev thanks for the clarification. I'm expecting/hoping an immediate improvement when that is completed, maybe even to this issue.

I don't actually understand the CFList feature at all.

Yes, it is confusing. Here is the current documentation:

Enable Join-Accept CF List (applicable to US915 devices only) The Join-Accept CF List configures channels according to the LoRaWAN spec to use sub-band 2.

TTN uses sub band 2 for AU915. If the switch doesn't apply to AU915, is Helium AU915 on sub band 2 or not?

Could this feature be better described as Sub-band 2 on/off ?

Devices that have not correctly implemented the LoRaWAN spec may experience transfer issues when this setting is enabled.

Does this mean not correctly implemented the LoRaWAN sub-band 2 spec? Of course "Devices that have not correctly implemented the LoRaWAN spec may experience transfer issues"

Enabled, the server will send a CF List with every other join.

"with every other join" Really? Why not every join? This is bound to confuse people, and I can't see a good reason for it. I've always assumed every join was the same. If it sends the CFList with the join, does it still need to do the double LinkAdrReq when changing DR ?

Disabled, the server will not send a CF List. The channel mask is still transmitted via ADR command.

Perhaps "Disabled, the server will not send a CF List in the join-accept" would be better.

mikev commented 2 years ago

@buzzware - Any discussion of the CF List profile config is not relevant to Europe or Australia. For all regions except the U.S. the CF List is implemented exactly according to the LoRaWAN spec. We already have an open issue on this topic. Lets move any discussions of the CFList to this issue - https://github.com/helium/router/issues/556

mikev commented 2 years ago

@buzzware - The comparison traces with TTN is very helpful. Thanks again for posting these comparison logs. I believe this will lead to useful insights which can save everyone a lot of time.

I noticed that TTN's LinkAdrReq always set the TXPower index to 0 (or one time it was 1). That is any interesting difference to explore. Helium may be causing the end-device to lower the power too much or maybe TTN is not setting it correctly. If power is lowered too much we will not see these messages in the logs.

Lets keep an eye on the multi Mac commands in a single FOpts. The spec allows this but it certainly could lead to coding errors within your end-device. Can you find out which LoRaWAN library your end-devices uses?

The most likely root cause is still the default rx delay of 1 second. I'm very much looking forward to testing with an rx delay of at least 5 seconds.

mikev commented 2 years ago

The rx delay feature is coming soon and is deployed on Stage now. ETA is Monday, Jan 31st. https://github.com/helium/router/pull/582#issuecomment-1024457518

mikev commented 2 years ago

The rx delay feature is now deployed to Production. https://github.com/helium/router/issues/597

@buzzware - Can you retest ADR with your end-device? Whether positive or negative, can you post trace logs of the packets?

buzzware commented 2 years ago

@mikev Will do, in the next day or so

buzzware commented 2 years ago

@mikev the repeat uplinks are gone with RxDelay=5 at last. The ADR may be working too, as my long running session was running on SF7 like it should. I have just rejoined and will keep an eye on it and update here. For now, here is the json & csv event-debug-202202182359.csv event-debug-202202182359.json.zip

mikev commented 2 years ago

@buzzware - This is great news! Also thanks for posting your logs and thanks for the thorough investigation. ADR is complex and without your persistence we would not have gotten to the root cause. Also I'm happy we increased unit test coverage by 10X.

OK, this issue is now officially closed. Please re-open if you notice any additional ADR issues.

buzzware commented 2 years ago

After rejoining the ADR did act around uplink 35 and the SF was reduced from 12 to 7, so that's great. 35 uplinks seems like a lot (there was talk of 20) but I'm just glad to see the device and network operating as they should. event-debug-202202190946.json.zip event-debug-202202190946.csv

mikev commented 2 years ago

Thanks for the update. I'll take a closer look at your logs. The ADR history is limited to 20 items. What can happen is that the ADR algorithm state machine checks the current DR and TXIndex and will only trigger an ADR update IF a change is required. So most likely ADR didn't have enough data to decide to reduce DR from 12 to 7 until packet 35. In any case, as long as the entire packet sequence is there I can double-check how the ADR is supposed to behave given your exact sequence.

buzzware commented 2 years ago

The device is almost directly under the hotspot antenna, so SF12 is definitely not appropriate. What could cause the count to reset? There were some hiccups in the log. The CSV contains everything in the JSON, which should have every packet.

mikev commented 2 years ago

@buzzware - Thanks for the positive feedback at the last Community townhall!

Follow-up Investigation Summary: Although ADR appears to be working OK with the improved rx_delay, I wanted to re-examine the ADR log just to double-check. Everything look OK and normal with the individual ADR commands in the posted sequence event-debug-202202190946.csv

Investigation Details: I've taken a closer look at the ADR trace logs. Here are all decoded FOpts traces starting with the initial Join request (line 705) For reference the FCnt indicates the frame number (FCnt is reset to zero upon Join).

[These are mac command decodes for all Uplink LinkAdrAns responses] FCnt=35 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=36 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=37 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=66 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=66 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=68 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}] FCnt=69 CtrlBits=1000 Uplink FOpts = [{link_adr_ans,1,1,1},{link_adr_ans,1,1,1}]

[Mac command for all Downlink LinkAdrReq commands] FCnt=37 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,3,0,7,0},{link_adr_req,5,3,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 3 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 3 NbTrans = 0 FCnt=38 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,7,0,7,0},{link_adr_req,5,7,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 7 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 7 NbTrans = 0 FCnt=39 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,10,0,7,0},{link_adr_req,5,10,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 10 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 10 NbTrans = 0 FCnt=68 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,3,0,7,0},{link_adr_req,5,3,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 3 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 3 NbTrans = 0 FCnt=69 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,6,0,7,0},{link_adr_req,5,6,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 6 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 6 NbTrans = 0 FCnt=71 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,9,0,7,0},{link_adr_req,5,9,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 9 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 9 NbTrans = 0 FCnt=72 CtrlBits=1010 Downlink FOpts = [{link_adr_req,5,10,0,7,0},{link_adr_req,5,10,65280,0,0}] ChMaskCntl = 7 ChMask = 0x0000 DataRate = 5 TXPower = 10 NbTrans = 0 ChMaskCntl = 0 ChMask = 0xFF00 DataRate = 5 TXPower = 10 NbTrans = 0

Observations: DataRate is set to DR5 corresponding to SF7 / 125 kHz. Device is then sending on SF7 so this appears normal and expected for ADR. TXPower quickly reduces from 0 to 3 to 7 to 10. Appears normal for ADR.

Questions:

What could cause the count to reset?

The end-device FCnt is reset to zero just prior to a Join request. The Join request also triggers the LNS to reset it's Downlink FCnt. This is normal.

There were some hiccups in the log.

I didn't find anything odd or out of the ordinary with this trace. If you see something that even looks odd let us know the specifics.

Thanks

helium / router

Review, document and debug Adaptive Data Rate (ADR) scheme #540