Review, document and debug Adaptive Data Rate (ADR) scheme

buzzware commented 2 years ago

The only description I can find of the Helium ADR scheme is by Leroyle in the Discord. He says it changes the data rate after 20 uplinks. The algorithm should be officially documented so users know what to expect, and also tested to ensure it is working as does not work for me. TTN changes my device to SF7 within a few uplinks. On Helium I see it joining at SF7 then uplinking at SF12. If it can join at SF7, surely it can uplink on SF7. Therefore, the algorithm should be reconsidered as many users will watch the SF for a few uplinks and assume ADR was not functioning. The data provided in #517 and #463 show my device endlessly on SF12 even though it is in the same room or almost underneath the roof antenna. Perhaps the lost downlinks and repeated uplinks are causing ADR to fail. I haven't provided a specific data set here because the data in the issues above should suffice; if you can't see similar issues with your own devices. It seems that ADR just hasn't had much attention.

ADR is a critical feature because it controls battery consumption, and for my devices it makes a difference of 25 times. My third party product uses a non-replaceable, non-rechargeable battery that should last over 10 years, but on helium it would last much less than one year.

pvolodin commented 2 years ago

TTN changes my device to SF7 within a few uplinks. On Helium I see it joining at SF7 then uplinking at SF12.

In US, for instance, both DR2 and DR4 use SF8. Wouldn't it be better if you indicate DR, not SF?

If it can join at SF7, surely it can uplink on SF7.

But the packet error rate may be unacceptable high.

It seems that ADR just hasn't had much attention.

At the moment it seems that a lot of LoRaWAN features just hasn't had much attention :-) To be honest, it doesn't seem to be a good idea to use Helium in production for something serious right now. But the giant leaps start from the small steps :-)

Therefore, the algorithm should be reconsidered

Ideally, it should be possible for the users to implement their own custom ADR algorithms (as it is possible in Chirpstack).

buzzware commented 2 years ago

Yes, the ability to use a custom ADR JavaScript function would be great. For example, it could make sense to disable SF12 to prevent complaints due to huge battery consumption especially when the battery is not replaceable.

jdgemm commented 2 years ago

The only description I can find of the Helium ADR scheme is by Leroyle in the Discord.

Existing ADR documentation is here, what other information would you like to see? https://docs.helium.com/use-the-network/console/profiles/#adr

buzzware commented 2 years ago

Ok, I hadn't seen that before, that looks adequate for now. The main thing is to understand why this device is stuck on SF12 after several days c31a7a5a-ce77-41d9-9644-6a6ec7229f55 I will leave it running for you to observe.

Also, what is the reasoning for waiting for 20 uplinks? That could take 20 days for some devices to see if ADR is working. TTN acts immediately.

pvolodin commented 2 years ago

@jdgemm > Existing ADR documentation is here, what other information would you like to see? https://docs.helium.com/use-the-network/console/profiles/#adr

There's no mention of repetition rate. I believe the existing implementation just always sets it to 1, but anyhow it would be good to have a couple of words regarding this setting in the docs.

pvolodin commented 2 years ago

@jdgemm Also, when only 8 channels are enabled, a device may transmit only at 26 dBm or less (this is FCC req-ment, not a part of LoRaWAN spec). So when the server sets the channel mask with ADR commands, this should be taken into account. There's no mention of this in the docs too, and I'm not sure how this is actually implemented in the code.

Though this is on device's responsibility to meet FCC requirements, I believe it would be better to mention these topics in the docs in one form or another.

pvolodin commented 2 years ago

@jdgemm

Also, when only 8 channels are enabled, a device may transmit only at 26 dBm or less (this is FCC req-ment, not a part of LoRaWAN spec).

This is for the US, of course. In other regions with fixed channel plans regulations may put other limitations.

pvolodin commented 2 years ago

Existing ADR documentation is here, what other information would you like to see? https://docs.helium.com/use-the-network/console/profiles/#adr

From the doc:

"The network only calculates data-rate/power corrections after it collects 20 contiguous uplink packets with the ADR bit set to 1. ... The network server clears its ADR history whenever a device sends an uplink packet with ADR set to 0."

There's no reason to not take the packets with ADR bit unset into account when the server estimates the link quality. Not sure if the current implementation really works this way. If it doesn't then there's a bug in the doc. If it does, this probably should be corrected.

jdgemm commented 2 years ago

Ok, I hadn't seen that before, that looks adequate for now. The main thing is to understand why this device is stuck on SF12 after several days c31a7a5a-ce77-41d9-9644-6a6ec7229f55 I will leave it running for you to observe.

Also, what is the reasoning for waiting for 20 uplinks? That could take 20 days for some devices to see if ADR is working. TTN acts immediately.

From TTN: https://www.thethingsnetwork.org/docs/lorawan/adaptive-data-rate/

ADR in The Things Stack To determine the optimal data rate, the network needs some measurements (uplink messages). Currently The Things Stack takes the 20 most recent uplinks, starting at the moment the ADR bit is set. These measurements contains the frame counter, signal-to-noise ratio (SNR) and number of gateways that received each uplink.

jdgemm commented 2 years ago

@jdgemm > Existing ADR documentation is here, what other information would you like to see? https://docs.helium.com/use-the-network/console/profiles/#adr

There's no mention of repetition rate. I believe the existing implementation just always sets it to 1, but anyhow it would be good to have a couple of words regarding this setting in the docs.

How would you use leverage this information?

jdgemm commented 2 years ago

@jdgemm

Also, when only 8 channels are enabled, a device may transmit only at 26 dBm or less (this is FCC req-ment, not a part of LoRaWAN spec).

This is for the US, of course. In other regions with fixed channel plans regulations may put other limitations.

It's device and region-centric so disagree it belongs with ADR documentation.

buzzware commented 2 years ago

ADR in The Things Stack To determine the optimal data rate, the network needs some measurements (uplink messages). Currently The Things Stack takes the 20 most recent uplinks, starting at the moment the ADR bit is set. These measurements contains the frame counter, signal-to-noise ratio (SNR) and number of gateways that received each uplink.

Thats not what I observed with my device - unless it is considering uplinks by the same device before the join. See this image from #463 https://user-images.githubusercontent.com/28285/138627525-fc578aec-ddf7-4eed-8c26-d0a50f14acce.png

jdgemm commented 2 years ago

ADR in The Things Stack To determine the optimal data rate, the network needs some measurements (uplink messages). Currently The Things Stack takes the 20 most recent uplinks, starting at the moment the ADR bit is set. These measurements contains the frame counter, signal-to-noise ratio (SNR) and number of gateways that received each uplink.

Thats not what I observed with my device - unless it is considering uplinks by the same device before the join. See this image from #463 https://user-images.githubusercontent.com/28285/138627525-fc578aec-ddf7-4eed-8c26-d0a50f14acce.png

https://lora-developers.semtech.com/documentation/tech-papers-and-guides/understanding-adr

_To understand this process, imagine a device is connected to the network and has announced itself as ADR-enabled via an uplink (Figure 4). This uplink travels through one or more gateways that simply relay the message back to the network server. By default, it will be sent at the lowest data rate, that is, the longest range setting.

What does the network server do? It waits.

Once the network server has amassed several results, it calculates the median of those results and determines both the available link budget and the highest data rate that can be supported, along with a margin for error, to allow for fluctuation in the channel characteristics._

pvolodin commented 2 years ago

@jdgemm

@jdgemm > Existing ADR documentation is here, what other information would you like to see? https://docs.helium.com/use-the-network/console/profiles/#adr There's no mention of repetition rate. I believe the existing implementation just always sets it to 1, but anyhow it would be good to have a couple of words regarding this setting in the docs.

How would you use leverage this information?

NbRep field is an integral part of LinkADRReq command. Same as TxPower and DR. How would you use leverage the information that Helium ADR algo may change TX power? :-)

But there's an real life example : #517 . If I was experiencing that problem, I'd try to check rep. rate first thing as it isn't clearly stated in the docs that ADR algo never touches this setting.

pvolodin commented 2 years ago

@jdgemm

It's device and region-centric so disagree it belongs with ADR documentation.

ADR at all is a region-centric (OK, region-specific, not region-centric) feature. That's why the interpretation of some LinkADRReq parameters is explained in the Regional Settings doc and not in the LoRaWAN spec itself. But I agree that there could be a better place for this information than ADR docs.

pvolodin commented 2 years ago

@buzzware

Also, what is the reasoning for waiting for 20 uplinks? That could take 20 days for some devices to see if ADR is working. TTN acts immediately.

I used TTN implementation long enough ago last time so cannot say how it works now, but I don't think they "act immediately". Yes, it "could take 20 days for some devices to see if ADR is working." But acting immediately could just disconnect some devices for a number of days. After all, the goal of any network operator is to provide reliable connectivity, not to demonstrate its capability to do some tricks.

And if 20 uplinks at the lowest DR can seriously discharge your device's battery, than may be you need to choose another device? Though this is a good reason to have an ability to define your own ADR algo in the console.

buzzware commented 2 years ago

@pvolodin I've regularly seen TTS v3 turn my nodes SF to SF8 then SF7 within a few uplinks. Maybe it changed for v3? You can see it here https://github.com/helium/router/issues/463#issuecomment-950480809 I didn't say "20 uplinks at the lowest DR can seriously discharge your device's battery" - its just annoying and perhaps unnecessary. For business reasons I can't just "choose another device", nor should I have to - it works fine on TTN.

mikev commented 2 years ago

The ADR adjustment algorithm Helium uses is the SemTech algorithm documented here: https://www.thethingsnetwork.org/forum/uploads/default/original/2X/7/7480e044aa93a54a910dab8ef0adfb5f515d14a1.pdf

As a whole, the ADR protocol and mechanism is complex. Having said this - certain engineering simplifications are made in both Helium's LNS code and the LoRaWAN protocol, which when understood do make debugging and ADR comprehension easier.

Brief ADR algorithm observations 1) Only the most recent 20 packets are considered for computations. 2) spread factor, bandwidth, rssi and snr are stored with each packet in the packet history list. 3) Rssi values are never used in the current computation 4) Bandwidth is ignored in the current computation. DataRates are restricted to 125kHz. 5) An integer DataRate (DR) index and Power index adjustment are sent to the end-device. (Conversion to actual spread, bandwidth and power float values is documented in the Regional doc). These integer indexes range from 0 to somewhere below 20. 6) The end-device normally initializes it's state so the DR and Power indexes both start at zero. 7) During a normal, expected sequence the ADR will cause the DR and Power indexes to gradually incease in value.

To simplify - in a normal working system we expect the end-device to initialize and start with a DR index of 0 and a Power index of 0. These settings corresponds to maximum power and high spread factor needed to transmit the packet as far and clearly as possible. However the initial settings use more time on air and power therefore more battery. ADR is designed to decrease power usage.

Therefore in the ADR message we should observe small, positive integer adjustments to the DataRate and Power indexes. ADR downlinks from the LNS to the end-device can be lost. Therefore several acknowledgement mechanisms exist so the end-device can notify the LNS that it recieved the ADR. These include sending the LinkADRAns or setting the ADR acknowledge request bit in an Uplink.

This is a simplified description.

A description from SemTech of the overall ADR protocol: https://lora-developers.semtech.com/documentation/tech-papers-and-guides/understanding-adr/ A detailed ADR implementation guide from SemTech. https://lora-developers.semtech.com/documentation/tech-papers-and-guides/implementing-adaptive-data-rate-adr/implementing-adaptive-data-rate/

I plan to spend more time reviewing the ADR code and writing additional unit tests. An additional 40 passing unit tests have been added. Consider this a partial update.

mikev commented 2 years ago

TTN changes my device to SF7 within a few uplinks. On Helium I see it joining at SF7 then uplinking at SF12. If it can join at SF7, surely it can uplink on SF7. Therefore, the algorithm should be reconsidered as many users will watch the SF for a few uplinks and assume ADR was not functioning. The data provided in #517 and #463 show my device endlessly on SF12 even though it is in the same room or almost underneath the roof antenna.

I find it odd that your device joins at SF7. All devices should join at DataRate 0 (SF 12 for the EU). Was this a typo? Stabilizing at SF12 would be a symptom of losing or never receiving the ADR downlink command.

Two related issues have been fixed - (1) the ADR bit not set on the ADR downlink (2) the RX delay settings (pending a production deploy). Both those fixes would explain your end-device failing to receive or process the downlink ADR mac command.

@buzzware - Do you still consider the ADR algorithm an issue or do these fixes solve the problem? If your end-device is now actually receiving the ADR downlink as shown by one of the acknowledge methods and you still believe there is a problem the please re-post a newer packet sequence. Thanks.

buzzware commented 2 years ago

Thanks @mikev

it does join at SF7 (not a typo). See "event-debug (4).csv" at the top of #517 row 246. I'm on AU915 if that makes any difference. I suppose if it failed at SF7 it would try a higher SF.
these fixes sound like they are on target, but when ready I will definitely test to be sure
from experience it seems TTS isn't waiting for 20 uplinks for ADR. I'm not fully across all the issues or algorithm of ADR, I just want it to work and I expect to see SF7 when the node is outdoors, ground level, 1 inch antenna within 100m if not 500m (don't quote me, just guessing) of a gateway.
let me know when I can test the changes with a retail hotspot and the staging console.

mikev commented 2 years ago

@buzzware - A LinkAdrAns mac command is sent by the end-device to acknowledge a LinkAdrReq sent from the LNS. In the log you posted there are 17 LinkAdrAns uplinks. The LinkAdrAns uplink I see in your logs are usually sent at SF7, following a prior SF12 uplink. It appears your end-device is successfully receiving and processing ADR mac commands and re-adjusting to SF7. From the log lets note that messages are sent Confirmed so there is also Ack packets, which is evidence that both the gateway and end-device are receiving the packets.

What seem unusual is that after sending a LinkAdrAns your end-device then immediately begins incrementing it's uplinks from SF7 up to SF12. Does this mean your end-device is misbehaving?

buzzware commented 2 years ago

@mikev Yes I see in the CSV rows 423 to 434 the link_adr_ans commands and the SF going to 7. These FOPTS are :

0354000070035400FF00
035A000070035A00FF00
0356000070035600FF00
035A000070035A00FF00

After that the SF climbs again. Notice also the downlinks still contain FOpts with LinkADRReq (03) but there are no link_adr_ans. These FOPTS are only sent in the first downlink_ack of each pair which is probably missed, causing the second uplink. The second downlink_ack has FOpts=null. That probably explains the lack of link_adr_ans.

0356000070035600FF00
null 
0356000070035600FF00
null
0356000070035600FF00
null
0354000070035400FF00
null
0354000070035400FF00
null
0354000070035400FF00
null

buzzware commented 2 years ago

Its too late at night here, but I suppose the next thing is to scrutinise these LinkADRReq commands. I wonder about the DataRate_TXPower eg. 035A000070 seems to break down as 03: LinkAdrReq 5: DataRate A: TXPower but in the spreadsheet the DataRate changes even though the value is always 5. Have I got DataRate and TXPower backwards above? https://lorawan-packet-decoder-0ta6puiniaut.runkit.sh/?data=YAQAAEiqVQADWgAAcANaAP8AQrMV8A%3D%3D&nwkskey=&appskey=

buzzware commented 2 years ago

035A000070 03: LinkAdrReq 5A: DataRate_TXPower 0000: ChMask - A bit in the ChMask field set to 1 means that the corresponding channel can be used for uplink transmissions if this channel allows the data rate currently used by the end-device. A bit set to 0 means the corresponding channels should be avoided. 7: ChMaskCntl - All 125 kHz OFF : ChMask applies to channels 64 to 71 *** Suspicious 0: NbTrans - number of transmissions for each uplink message. This applies only to “unconfirmed” uplink frames. 0 means default(1)

035A00FF00 03: LinkAdrReq 5A: DataRate_TXPower 00FF: ChMask - A bit in the ChMask field set to 1 means that the corresponding channel can be used for uplink transmissions if this channel allows the data rate currently used by the end-device. A bit set to 0 means the corresponding channels should be avoided. 0: ChMaskCntl - Channels 0 to 15 0: NbTrans - number of transmissions for each uplink message. This applies only to “unconfirmed” uplink frames. 0 means default(1)

mikev commented 2 years ago

Regarding the two LinkAdrReq commands. The AU915 portion of the spec is missing an explanation note, which you can find elsewhere in the Regional spec. The note explains that the first LinkAdr disables ALL 125kHz channels, while the second LinkAdr enables a bank of 8 125kHaz channels. So this is normal and expected behavior.

mikev commented 2 years ago

@buzzware - Can you share the end-device manufacture you are using? We might be able to purchase and test it locally.

In the meantime I'm working on building better analysis tools for processing hundreds or thousands of trace messages to better help diagnose these types of issues.

buzzware commented 2 years ago

I can't provide the manufacturer. It works fine on TTN and I haven't seen it misbehaving against the LoRaWAN standard. One thing that might separate it from others is the use of confirmed uplinks. You can see the device at https://staging-console.helium.wtf/devices/b4233140-f081-48e5-b643-954219f5ebf0 - let me know if you want me to trigger a rejoin. I also have a RAK USB Concentrator and could run https://github.com/helium/lorawan-sniffer. I will work with anyone to get this device working on Helium.

mikev commented 2 years ago

@buzzware - Yes, if you can run the lorawan-sniffer that will be great. Please run the sniffer and send the results.

lthiery commented 2 years ago

@buzzware I'm a little confused by the data.payload column in your spreadsheet.

If I take bwkBJANhBpgCAAAAAQAAACsOMRITC+UHAAAAJEw=

>>> raw_bytes = base64.b64decode(b'bwkBJANhBpgCAAAAAQAAACsOMRITC+UHAAAAJEw=')
>>> byte_array = [x for x in raw_bytes]
>>> byte_array
[111, 9, 1, 36, 3, 97, 6, 152, 2, 0, 0, 0, 1, 0, 0, 0, 43, 14, 49, 18, 19, 11, 229, 7, 0, 0, 0, 36, 76]

I assume this is the base64 encoding of the PHY payload?

So the first byte 111 is MHDR. And the first two bytes are the Major version

And they appear to be 0b11 instead of 0b00

>>> bin(int_values[0])
'0b1101111'

Am I misinterpreting what data.payload is?

mikev commented 2 years ago

To recap the current knowledge.

(1) The end-device is uplinking data at SF12 (which is DR-0) (2) The end-device successfully receives an ADR mac command. It confirms with an AdrAns uplink sent at SF7. (3) The end-device then unexpectedly stepwise changes its uplinks from SF7 to SF8 to SF9 to SF10 to SF11 to SF12. (4) Within this 20 packet sequence there must be some uplink or downlink bit difference compared to the exact same sequence on a TTN network. (5) We are unable to decode the end-device's Uplink messages. The payload in the log may be mis-logged.

My working theory is that there is likely a one bit difference between Helium and TTN during the ADR sequence. This small difference causes the end-device to confuse the ADR and not stick with SF7 but instead to stepwise revert back to SF12.

Next Steps: (a) @buzzware will gather a new Sniffer trace (b) Can we gather comparison packet traces between TTN and Helium for the same device? (c) Scrutinize the initial ADRReq, ADRAns and following 5 messages.

mikev commented 2 years ago

Some colleagues and I spent some time examing the logs again. We noticed something interesting. Examine the 20 or so packets following the ADR mac command. For example rows 344 to 370. The end-device is using Acks, so when the end-device does not receive an ack it will re-transmit the same message a 2nd time. You can observe a re-transmit because the fcnt value stays the same. For example row 351 is a re-transmit of row 349, i.e both have fcnt=26.

So, it appears that after an ADR adjustment the end-device does not hear every ack. The end-device's internal logic then triggers it to stepwise change it's DR index by one. Then the pattern repeats. The symptom of not hearing the ack should be solved by the RX delay changes which should be deployed to production now.

@buzzware - Can you test this theory? Using the new config try configuring your RX delay for this device from the default 1 second to 5 seconds.

buzzware commented 2 years ago

Am I misinterpreting what data.payload is?

@lthiery Yes, the payload column is straight from the Helium console JSON dump, which can be the node upload data, or for downlink_acks or join_accept is actually the entire raw packet. In this case you are decoding the node data. This seems to have been addressed with #464 so I suppose it would help to work with a fresh log dump. I'm working on getting a sniffer dump.

buzzware commented 2 years ago

Hello @mikev

(3) The end-device then unexpectedly stepwise changes its uplinks from SF7 to SF8 to SF9 to SF10 to SF11 to SF12.

See this comment https://github.com/helium/router/issues/540#issuecomment-1010106224 The server is sending ADRReq commands when the node steps up.

(5) We are unable to decode the end-device's Uplink messages. The payload in the log may be mis-logged.

See comment above https://github.com/helium/router/issues/540#issuecomment-1012705631

Next Steps: (a) @buzzware will gather a new Sniffer trace

Working on it

(b) Can we gather comparison packet traces between TTN and Helium for the same device?

Probably, working on it

mikev commented 2 years ago

See this comment #540 (comment) The server is sending ADRReq commands when the node steps up.

Yes, you are correct, the FOpt commands are included with the Ack. In any case here is the FOpt ADRReq sequence for rows 343 to 360 0353000070035300FF00 0357000070035700FF00 0356000070035600FF00 0356000070035600FF00 0356000070035600FF00

The AdrReq is 0x03 followed, if I'm reading it correctly by the DataRate 0x5. (You mention the same above) The DataRate is always 0x5 while the TXPowerIndex ranges from 0x3 to 0x7. The DataRate of 0x5 corresponds to SF7 / 125 kHz for the AU915 region.

The ADR is sent repeatedly which seems wrong (maybe the LNS is not seeing the Adr Ans). However the ADR is consistently setting DR_0 / SF7. So this does not explain why the device is changing to SF values above 7. I think a comparison trace with TTN will be helpful.

lthiery commented 2 years ago

Am I misinterpreting what data.payload is?

@lthiery Yes, the payload column is straight from the Helium console JSON dump, which can be the node upload data, or for downlink_acks or join_accept is actually the entire raw packet. In this case you are decoding the node data. This seems to have been addressed with #464 so I suppose it would help to work with a fresh log dump. I'm working on getting a sniffer dump.

Well that explains it! Thanks

mikev commented 2 years ago

Recap:

The root cause is most likely multiple lost Downlink messages sent from the Gateway to the end-device. Evidence of lost packets is seen by observing a repeating fcnt in the end-device Uplinks. This appears to trigger a stepwise increase in SF on the end-device. We've seen latency issues with Australia due to the physical distance to the LNS servers which are located in the U.S. Increasing the RX delay from 1 sec to 5 secs should mitigate the issue.

Next Steps: @buzzware - Will gather new Sniffer trace with new config RX delay settings Optionally capture a TTN trace with the same end-device so we can compare TTN and Helium behavior.

Expanding on the working theory. As ADR adjusts SF and Power this can result in a device change which degrades the ability to transmit to a Gateway. Each side of the protocol must take this into account. So if the end-device believes it cannot reach the Gateway it would be normal and expected behavior for the end-device to gradually increase DataRate and Power until a packet is successfully sent.

fyi - I'll be on holiday until Jan 19th.

buzzware commented 2 years ago

This data was captured with a clean install of Raspbian on Raspberry Pi 3b+ and https://github.com/RAKWireless/rak_common_for_gateway with a RAK7271 USB concentrator.

helium-investigation-20220117.zip

helium-investigation-20220117.zip contains :

109-rpi-ttn-direct: from the most simple TTN gateway setup
helium-debug-4: from the top of #517

These 2 files are analysis of the FOPTS which may contain LinkADRReq etc:

helium-investigation-20220117/109-rpi-ttn-direct/direct-analysis.csv
helium-debug-4/event-debug.4 - study FOpts.csv

Observations

1) TTN Doesn't do the double 03 LinkADRReq commands like Helium https://github.com/helium/router/issues/540#issuecomment-1010252465 2) TTN has LinkADRReq's in downlinks and the node replies with LinkADRAns in the following uplinks. On Helium, the node only responds with LinkCheckAns 02, not LinkADRAns, and I don't see any LinkCheckReq that it is replying to. 3) TTN specifies ChMask=00FF, while Helium specifies ChMask=0000 then ChMask=00FF 4) The Helium data is in difficult conditions where RxDelay is too short for many downlinks to succeed, while the TTN example has RxDelay=5 with no such problems. 5) TTN sets the DR immediately to SF8, and then again to SF7 with the first few packets

Perhaps the node isn't happy with the double LinkADRReq commands. Are they necessary, when TTN doesn't seem to think so ? @mikev

lthiery commented 2 years ago

I believe it has its origins in our LNS from the 1.0.3 Regional Specification, which in the US915 section explains:

For example, to reconfigure a device from 64 channel operation to the first 8 channels could contain two LinkAdrReq, the first (ChMaskCntl = 7) to disable all 125 kHz channels and the second (ChMaskCntl = 0) to enable a bank of 8 125 kHz channels. Alternatively, 653 using ChMaskCntl = 5 a device can be re-configured from 64 channel operation to support the first 8 channels in a single LinkAdrReq.

As far as I can tell, the US915 and AU915 definitions are structurally identical:

au915

You would assume the Note from the US915 section would then also apply to AU915. That being said, perhaps the end-device implementation does not tolerate the double LinkAdrReq. I don't know if we saw any less success with the implementation on the LNS side where we send a single LinkAdrReq and don't know if we had a specific reason to adopt the double LinkAdrReq approach.

pvolodin commented 2 years ago

@lthiery @buzzware

The LoRaWAN spec clearly states:

The network server may include multiple LinkAdrReq commands within a single downlink message. For the purpose of configuring the end-device channel mask, the end-device will process all contiguous LinkAdrReq messages, in the order present in the downlink message, as a single atomic block command. The end-device will accept or reject all Channel Mask controls in the contiguous block, and provide consistent Channel Mask ACK status indications for each command in the contiguous block in each LinkAdrAns message, reflecting the acceptance or rejection of this atomic channel mask setting. The device will only process the DataRate, TXPower and NbTrans from the last message in the contiguous block, as these settings govern the end-device global state for these values. The end device will provide consistent ACK status in each LinkAdrAns message reflecting the acceptance or rejection of these final settings.

If the end-device implementation does not tolerate a sequence of LinkAdrReq within a single DL, it's just a bug in the firmware, and that doesn't much matter if that sequens is really necessary at the moment.

pvolodin commented 2 years ago

@buzzware I'm not sure if Helium currently supports CFList in JoinAccept for your region. But if it does, this may be a good workaround for your issue.

buzzware commented 2 years ago

@pvolodin

The node does support multiple commands eg. DevStatusReq + LinkADRReq
Helium needs to support as many third party devices as possible, and when its difficult or impossible to update them, Helium should be accomodating. TTN is the quasi-standard that Helium should be compatible with. I will take it to the manufacturer, but there is no guarantee they will make any change, and they can't change hundreds of thousands of units already sold. There is no firmware update feature I know of.
The bug could be related to the 125Khz channels, not multiple commands. Is it assuming 64 channels, when few networks are? If the 125Khz channel command is necessary, is it necessary every time the DR changes ? Perhaps it could be sent just once after joining. TTN doesn't seem to need it every time, if ever.
What does it mean that the AU915 docs don't have this combination like other plans ?
Separating the LinkADRReq commands across multiple packets might at least allow the DR change to work even if the other command fails
Note that the node does not ACK the LinkADRReq with a LinkADRAns. Perhaps then it could send them separately.

mikev commented 2 years ago

@buzzware - Were you able to re-test with the device's RX delay increased from 1 second to 5 seconds? (Note - the RX delay is a new config option Helium recently added to device profiles). The only logs I see in your most recent traces for Helium are still time-stamped for 11-19-2021. I believe increasing the RX delay will most likely fix the issue.

The comparison logs for TTN are very helpful. Thanks for capturing these traces. I'll analyze further and reply later.

buzzware commented 2 years ago

@mikev Will do, but I haven't had confirmation that the RxDelay feature was fully completed end to end and deployed? There was some outstanding issue. Sprint 68 seems to contain relevant issues unresolved https://github.com/helium/router/milestone/6.

UPDATE I am seeing the same repeating issue just now on the staging console

pvolodin commented 2 years ago

@buzzware

Separating the LinkADRReq commands across multiple packets might at least allow the DR change to work even if the other command fails

Imagine that 0353000070035300FF00 is sent in two packets. In this case the first one contains 0353000070 , this disables all channels. The device either will be fully disconnected or will reject the command.

To reconfigure a device for 8-channels plan you have either send two commands in the same packet or use CFList in JoinAccept (and later insert reasonable values into channel mask-related fields of LinkADRReq command, just as placeholders as you cannot put "nothing" to these fields).

I believe the latter is how TTN works. To not force everyone to dive deep in your logs, can you please post the dumps of both TTN and Helium JoinAccept messages here?

TTN is the quasi-standard

For Oz - maybe. For the US - almost nobody knows what is TTN. Guess which market is bigger and gets more attention from the Helium team? Sad but true.

pvolodin commented 2 years ago

@buzzware

TTN JoinAccept from your logs:

[2022-01-17 09:27:23.825] JoinAccept 926.3 MHz DataRate(SF7, BW500) AppNonce: 000024 NetId: 000013 DevAddr: 260D7CDB DL Settings: DLSettings(8) RxDelay: 5, CFList: FixedChannel([ChannelMask([0, ff]), ChannelMask([0, 0]), ChannelMask([0, 0]), ChannelMask([0, 0])])

So, different to Helium, they don't use LinkADRReq for managing channel mask.

buzzware commented 2 years ago

Yes, I've seen the CFList in the TTN Join-Accept before, and not in Helium's.

I will try to get Join-Accepts, last time it seemed to be missing on TTN somehow.

What is the quasi-standard for the US? Helium? I don't think it has the maturity yet with actual LoRaWAN users (hence the reason for these issues)

buzzware commented 2 years ago

So, different to Helium, they don't use LinkADRReq for managing channel mask.

Makes sense - why send it twice every time ?

pvolodin commented 2 years ago

@buzzware So if you're able to enable CFlist for you devices in Helium console - try this setting and let us know the result. And if there's no such possibility - this is another issue not related to ADR.

buzzware commented 2 years ago

@buzzware So if you're able to enable CFlist for you devices in Helium console - try this setting and let us know the result. And if there's no such possibility - this is another issue not related to ADR.

I've always had CFList enabled on every device

pvolodin commented 2 years ago

@buzzware If the feature is enabled but you're sure it doesn't work - it's a good reason to open an issue :-) Though an issue with CFList doesn't mean the devices have right to not follow the rules and ignore the pretty valid commands. (ok, to misnterpret, not to ignore).

helium / router

Review, document and debug Adaptive Data Rate (ADR) scheme #540