mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
636 stars 207 forks source link

US915 in TTNv3 1.0.2 gives LinkADRAns NACK #688

Open nielskoot opened 3 years ago

nielskoot commented 3 years ago

In TTNv3 with US915 and LoRaWAN 1.0.2 configured for the device the following FOpts is in the downlink: 03 32 00 00 71 03 32 00 ff 01 which consists of: 03 32 00 00 71 = LinkADRReq (DataRate = 3, TXPower = 2, ChMask = 0x0000 (all 16 off), ChMaskCntl = 7, NbTrans = 1) 03 32 00 ff 01 = LinkADRReq (DataRate = 3, TXPower = 2, ChMask = 0x00ff (0-7 off, 8-15 on), ChMaskCntl = 0, NbTrans = 1)

The first LinkADRReq disables all channels, both 125 kHz and 500 kHz and the second LinkADRReq enables channels 8-15 (the 125 kHz channels of sub-band 2).

The next uplink contains the Fopts: 03 06 03 06 which consists of: 03 06 = LinkADRAns (Power ACK = 1, Data rate ACK = 1, Channel mask ACK = 0) 03 06 = LinkADRAns (Power ACK = 1, Data rate ACK = 1, Channel mask ACK = 0)

The set of 2 LinkADRReq is attempted 4 times after the join and rejected 4 times. This not only creates extra downlinks and uplinks, also the channel 65 that TTNv3 tries to disable stays active in the device and is continued to be used.

After checking in the code the issue seems to be the first LinkADRReq, that basically disables all channels and LMIC doesn't accept that. However according to the specification these combined LinkADRReq should be handled as one, and then the second LinkADRReq enables 8 channels again so everything is ok.

This combining of LinkADRReq is described in the 1.0.2 spec.: The network server may include multiple LinkAdrReq commands within a single downlink message. For the purpose of configuring the end-device channel mask, the end-device will process all contiguous LinkAdrReq messages, in the order present in the downlink message, as a single atomic block command.

There is even an example in the US915 regional parameters (RP002-1.0.2) describing this exact scenario: Note: A common network server action may be to reconfigure a device through multiple LinkAdrReq commands in a contiguous block of MAC Commands. For example, to reconfigure a device from 64 channel operation to the first 8 channels could contain two LinkAdrReq, the first (ChMaskCntl = 7) to disable all 125 kHz channels and the second (ChMaskCntl = 0) to enable a bank of 8 125 kHz channels.

As far as I can determine LMIC handles the multiple LinkADRReq fine and as one block. The issue seems to be in the check of an individual LinkAdrReq, there there is a (legacy?) check that an individual LinkAdrReq does not disable all channels. However that is exactly what happens here in the first LinkADRReq, and as channels are enabled again in the second LinkADRReq in the same download that is ok and doesn't leave the device without channels. Actually there is already a check after processing all LinkAdrReq that not all channels are disabled, the return value of LMICbandplan_mapChannels on the last LinkADRReq (function applyAdrRequests in lmic.c, "map_ok = LMICbandplan_mapChannels(chpage, chmap);")

The issue as far as I can determine is in the function LMICuslike_canMapChannels in lmic_us_like.c:

    } else if (chpage == MCMD_LinkADRReq_ChMaskCntl_USLIKE_125ON ||
               chpage == MCMD_LinkADRReq_ChMaskCntl_USLIKE_125OFF) {
                u1_t const en125 = chpage == MCMD_LinkADRReq_ChMaskCntl_USLIKE_125ON;

        // if disabling all 125kHz chans, must have at least one 500kHz chan
        // don't allow reserved bits to be set in chmap.
        if ((! en125 && chmap == 0) || (chmap & 0xFF00) != 0)
            return 0;

The test on (! en125 && chmap == 0) should not be done.

Although I didn't encounter it, I think the same applies to the check for the new ChMaskCntl 5 introduced in 1.0.3 for the sub-bands. Again in the function LMICuslike_canMapChannels in lmic_us_like.c:

    } else if (chpage == MCMD_LinkADRReq_ChMaskCntl_USLIKE_BANK) {
        if (chmap == 0 || (chmap & 0xFF00) != 0) {
            // no bits set, or reserved bitsset , fail.
            return 0;
        }

Here the test on chmap == 0 should not be done.

After removing these 2 tests the complete function LMICuslike_canMapChannels can be simplified, especially as all ChMaskCntl values 4 and higher have 8 channels/subbands instead of 16 so only those need to check there are no more than 8 in the ChMask. This simplifies the function to just:

// verify that a given setting is permitted
bit_t LMICuslike_canMapChannels(u1_t chpage, u2_t chmap) {
    /*
    || MCMD_LinkADRReq_ChMaskCntl_USLIKE_500K only has 8 500 kHz channels,
    || the same applies for all the higher special ChMaskCntl values.
    || MCMD_LinkADRReq_ChMaskCntl_USLIKE_BANK is the exception, then it is
    || subbands instead of channels, still 8 so we can use the same logic.
    */
    if (chpage >= MCMD_LinkADRReq_ChMaskCntl_USLIKE_500K) {
        if ((chmap & 0xFF00) != 0) {
            // reserved bitsset , fail.
            return 0;
        }
    }
    // if we get here, it looks legal.
    return 1;
}

I tested this against TTNv3 and the issue is resolved, the flash size was also reduced by 26 bytes for my build.

I only get this issue when the device is set to 1.0.2 in TTNv3, when the device is set to 1.0.3 in TTNv3 the channels are already set in the CFList of the join-accept (new for US in 1.0.3), however LMIC completely ignores that CFList. This is a seperate issue that I am still looking into.

terrillmoore commented 3 years ago

Sorry you're having problems and thanks for the research.

If you check the commit log, you'll see that the changes in this area are required for passing the compliance checks for 1.0.3 (in particular many of the checks for the channel masks). So we can't take them out completely; and we won't support 1.0.2, as we have no resources to do that. (We have a commitment to passing the LoRaWAN regression tests, and every test to support 1.0.3 mainline causes us to have to retest 1.0.3.)

We will solve bugs for 1.0.3, but please bear in mind that we pass the compliance tests for 1.0.3 apart from channel randomization (we are selecting channels with replacement, but the tests implicitly require a shuffle). The LMIC definitely doesn't ignore the CFList because that is extensively tested, both with TTN and with networks that use other-than subband1. TTNv3 is likely to be formatting mac commands in a way that breaks things for us, but that's a different issue.

I have tested my devices TTN V3 and seen no evidence of a problem. But I selected (in US) subband 1 (channel 8-15). I could tell from uplinks and downlinks that we were receiving all downlinks and not sending on unselected channels.

Thanks and best regards, --Terry

terrillmoore commented 3 years ago

Just to follow up:

Although I didn't encounter it, I think the same applies to the check for the new ChMaskCntl 5 introduced in 1.0.3 for the sub-bands. Again in the function LMICuslike_canMapChannels in lmic_us_like.c:

It's certain that I made mistakes in the code, and clearly there are issues, since you've seen some. .

But: we try to be very careful not to rely on correct network behavior (although I'm sure I've not found every place that is overly trusting of network behavior). This is a security thing. So we cannot rely on the network not to set illegal bits of the channel mask. That's what we're testing for the channel group commands, I think. In any case, this is the rule: please do not rely on correct network behavior, or make changes that are depending on correct network behavior.

The LoRaWAN compliance tests actually do some things to check that we won't, for example, lock up if the network turns off all the channel masks or sends things that might not make sense. They don't test all possibilities because it's not really possible to do so. But this is not about passing compliance, it's about establishing the invariant: "the LMIC will not act on formally invalid requests". Even though we're not using formal methods, this makes a larger set of assertions verifiable. We can relax this for important networks if they're broken, but we should not relax this until we encounter a broken network that cannot be fixed to become compliant.

Best regards, --Terry

nielskoot commented 3 years ago

Hi Terry,

I fully agree not to just change to whatever a specific network does. However if I detect an issue and after checking the specification I come to the conclusion the networks does it right and LMIC not, then I just want to bring this forward. If my interpretation of the specifications is wrong please let me know.

I reference TTNv3 with the device set to 1.0.2 as a way to reproduce the problem. I understand you just target 1.0.3 compliancy, but as far as I can see the specification didn't change in this respect. So in my opinion this is also an issue in 1.0.3. TTNv3 with the device set to 1.0.3 behaves differently so it cannot be reproduced that way, but that doesn't mean there is no problem with 1.0.3 too.

I fully understand if you don't want to make changes when it can only be reproduced on TTNv3 with 1.0.2. Note that it also reduces the flash size by 26 bytes for AVR, maybe that alone is an argument, especially as I don't expect it breaks the compliancy test (see below).

The reason I am chasing this and the CFList is that we recently encountered issues with a new network on US915. The join was successful but after that not a single application upload is received. Unfortunately I don't have access to that network yet and the people that have don't seem to be able to get me usable tracing/logging. I suspect that something goes wrong with the channels, but have no clue yet what. Obviously the network might be doing something wrong here, but I just want to make sure that from the device (i.e. LMIC) everything regarding the channels is handled correctly.

If I look at the commit history I think this functionality was introduced in two steps. First individual LinkADRReq where checked to ensue they didn't remove all channels: https://github.com/mcci-catena/arduino-lmic/commit/ea775f786091b152cbda3bc3da9986d32ef5c904# I suspect this change was made to pass the compliance tests.

Later a check was added that after processing all combined LinkADRReq there are still channels active: https://github.com/mcci-catena/arduino-lmic/commit/4a1839cbd1a59125f8b902714346710b2a4c302f# The function LMICbandplan_mapChannels only returns true when there are still channels active. In the loop parsing the LinkADRReq's this return value is stored in map_ok without immediate action. Only after the loop completed map_ok is checked, so only if no channels are left after all LinkADRReq are applied will they be rejected. Exactly as specified.

The second change should be enough to pass the compliancy tests I expect (just an assumption, I am not familiar with the actual tests). They make the 'no channels active' checks added in the first change obsolete, hence my proposal to remove those so the issue they cause is resolved.

nielskoot commented 3 years ago

Hi Terry,

One question on your remark that LMIC is not ignoring the CFList, that is true for 'EU like', but I am not sure about 'US like'. In lmic_us_like.h:

// no CFList on joins for US-like plans
#define LMICbandplan_hasJoinCFlist()    (0)

Or am I missing something here for US like?