jgromes / RadioLib

Universal wireless communication library for embedded devices
https://jgromes.github.io/RadioLib/
MIT License
1.42k stars 359 forks source link

SX1262 receive error #575

Closed SolarDaniel closed 1 year ago

SolarDaniel commented 1 year ago

Hi

At first many thanks to @jgromes for this excellent library.

I ran into a problem with receiving unknown packets that disturbed my communication. In short: I have built a LORA remote control (sender and receiver) based on

My neighbour has a rf-Solutions remote control also based on LORA. Unfortunately they have the same frequency and somehow same setup, that desturb my communication.

The problem is, that frames sent by rf-Solutions do not match my parameters identically. Though that should not be an issue, since there can be LORA messages with any configurations in the air at any time, but my receive method fails with getPacketLength() with 0 and subsequent access to any radiolib calls fail with "modem not found". The method getPacketLength() returns the packet length, but if an error occurs, there's no access to the error (maybe enhance that in future?). So I placed a serial print statement there and it showed an error -705 RADIOLIB_ERR_SPI_CMD_TIMEOUT.

size_t SX126x::getPacketLength(bool update)
{
    (void)update;
    uint8_t rxBufStatus[2] = {0, 0};
    int error = SPIreadCommand(RADIOLIB_SX126X_CMD_GET_RX_BUFFER_STATUS, rxBufStatus, 2);
    Serial.print("getPacketLength error: ");Serial.println(error);
    return((size_t)rxBufStatus[0]);
}

Any subsequent call to readData, startReceive or startTransmit will fail with -2 RADIOLIB_ERR_CHIP_NOT_FOUND. With the call to standBy() you can recover the chip. If it happens, that my neighbour opens his garage door the same time as I expect a message from my remote my system hangs. I think it can't be a chip issue, it has to be a radiolib issue.

Tests I made:

Setup:

With my Lopy (SX1276) I have written a Python script to emulate the communication with the rf-Solutions receiver. The rf-Solutions parameters are unknown, but the communication is stable and reliable with above settings. The protocol is simple and obviously. No issues at all. Next step was the implementation of the same test on my Adafruit NRF52840 feather with E22 900M30S (SX1262). Result: always (with exceptions) the same. Interrupt after receiving a packet always arrives, but getPacketLength() returns zero and error -705. It happens every ~hundred packets, that the packet is received correctly. The rf-Solutions receiver gets the commands correctly, but the responses lead to this issue. The chip is blocked and unusable, but I found, that with a standby() after getPacketLength() the chip is recoverable.

Adafruit NRF52840 <> Adafruit NRF52840: no problems LoPy <> LoPy: no problems rf-Solutions <> LoPy: no problems Adafruit NRF52840 <> LoPy: no problems rf-Solutions < Adafruit NRF52840: no problems rf-Solutions > Adafruit NRF52840: !!!!!!!!!!!!!

void Test()
{
    int error = RADIOLIB_ERR_NONE;
    int state;
    byte buffer[32];
    byte data[] = "\x00\xAB\x31\xAA\x01\x00\x01\x00\x00\x1F\xA0\x8F\x8F"; // a valid rf-Solutions command
    size_t dataLength = sizeof(data) - 1;   // exclude string terminating zero

    while (true)
    {
        state = lora.startTransmit(data, dataLength);
        if (state == RADIOLIB_ERR_NONE)
        {
            // this is my own LowPower implementation, works very well
            // the Arduino LowPower library is incomplete, does not work at all and is bullshit
            // even if they say it is suitable for nrf52
            wakeup_from_sleep_reason wakeupReasen;
            do
            {
                wakeupReasen = LowPower.sleep();
            } while (wakeupReasen != LORA_DIO1_INTERRUPT);

            int timeout = 5000 * (1/0.015625);  // 15.625 us 1 ms / 0.015625 us
            int16_t state = lora.startReceive(timeout);

            if (state == RADIOLIB_ERR_NONE)
            {
                // here we wait on the receive interrupt
                wakeup_from_sleep_reason wakeupReasen;
                do
                {
                    wakeupReasen = LowPower.sleep();
                } while (wakeupReasen != LORA_DIO1_INTERRUPT);

                size_t packetSize = lora.getPacketLength();
                Serial.print(F("packet size ")); Serial.println(packetSize);
                if (packetSize == 0)
                {
                    // standby() is necessary, otherwise modem is not found on subsequent access to modem, 
                    // getPacketLength has error -705
                    // call to readData or next startReceive lead to -2, somehow SPI hangs
                    lora.standby(); 
                    Serial.println(F("receive packet size 0"));
                    state = PROTOCOL_RECEIVE_DATA_0;    // my own added state number (+1), radiolib is alsways negative
                }
                else    // everything ok
                {
                    state = lora.readData(buffer, packetSize);
                    if (state == RADIOLIB_ERR_NONE)
                    {
                        Serial.printBuffer(buffer, packetSize); Serial.println();
                    }
                    else
                    {
                        if (error == RADIOLIB_ERR_CRC_MISMATCH)
                        {
                            Serial.println("Lora CRC mismatch");
                        }
                        else if (error == RADIOLIB_ERR_RX_TIMEOUT)
                        {
                            Serial.println("RX timeout");
                        }
                        else
                        {
                            Serial.print(F("receive read data failed, code ")); Serial.println(state);
                        }
                    }

                }
            }
            else
            {
                Serial.print(F("receive start failed, code "));
                Serial.println(state);
            }
            delay(10000);
        }
        else
        {
            // some other error occurred
            Serial.print(F("send failed, code ")); Serial.println(state);
        }
    }
}

Thanks for any help.

jgromes commented 1 year ago

Though that should not be an issue, since there can be LORA messages with any configurations in the air at any time

This statements is incorrect - if the two LoRa transmitters have the same frequency (+- 25% of bandwidth) and the same spreading factor, they will interfere.

Next, are you sure it is the interference that's causing the problem? Seeing as you are using reception with timeout in your code, this seems oddly similar to an issue I reported on the LoRa dev forum a while ago: https://forum.lora-developers.semtech.com/t/sx126x-strange-behavior-after-rx-timeout/1080

Try to replicate the issue with a simpler setup.

SolarDaniel commented 1 year ago

Thanks for responding so quickly. I think I already broke it down to the simplest setup. A well tested (with all other devices mentioned) snippet of code. Sending a valid message to the DUT and reading the response. A simple loop: startTransmit, wait for interrupt, receive, wait for interrupt, read data. A second device (LoPy) just listening the same time and printing the messages to the terminal. The DUT receives the command correctly, acts upon it and sends a response immediatly as can be seen with the second listening only device. I'm not talking of interference here. It is a forced communication with known parameters, that work with any other device I used. Unfortunately I don't know the exact Lora parameters of the DUT, except the frequency, but with all other devices using SX1272/6 it works well. The test work well with SX1262 <> SX1262 and SX1272 <> SX1262 with same parameters. As far as I know the code rate can be different. Frequency, SF and BW must match. According to a blog about syncword compatibility https://blog.classycode.com/lora-sync-word-compatibility-between-sx127x-and-sx126x-460324d1787a, syncwords may be different in some parts. I made a scan and found that the DUT responds to all syncwords 0xn2 where n is from 0 to F. How the chip handles different preamble length I don't know.

Did you find a solution for https://forum.lora-developers.semtech.com/t/sx126x-strange-behavior-after-rx-timeout/1080?

SolarDaniel commented 1 year ago

I made another Test: startReceive() without timeout, using my own sleep timer.

void Test()
{
    int state;
    byte buffer[32];
    byte data[] = "\x00\xAB\x31\xAA\x01\x00\x01\x00\x00\x1F\xA0\x8F\x8F";

    while (true)
    {
        state = lora.startTransmit(data, sizeof(data) - 1); // exclude string terminating zero
        if (state == RADIOLIB_ERR_NONE)
        {
            wakeup_from_sleep_reason wakeupReasen;
            do
            {
                wakeupReasen = LowPower.sleep();
            } while (wakeupReasen != LORA_DIO1_INTERRUPT);

            //int timeout = 5000 * (1/0.015625);    // 15.625 us 1 ms / 0.015625 us
            //int16_t state = lora.startReceive(timeout);
            LowPower.setTimeout(5000);
            int16_t state = lora.startReceive();

            if (state == RADIOLIB_ERR_NONE)
            {
                // here we wait on the receive interrupt
                wakeup_from_sleep_reason wakeupReasen;
                do
                {
                    wakeupReasen = LowPower.sleep();
                    if (wakeupReasen == WAKEUP_FROM_SLEEP_TIMER)
                    {
                        break;
                    }
                } while (wakeupReasen != LORA_DIO1_INTERRUPT);

                LowPower.clearTimeout();
                if (wakeupReasen == WAKEUP_FROM_SLEEP_TIMER)
                {
                    lora.standby();
                    Serial.println("rcv timeout");
                    delay(5000);
                    continue;;
                }

                size_t packetSize = lora.getPacketLength();
                if (packetSize == 0)
                {
                    lora.standby(); // this is necessary, otherwise modem is not found (error -2) after error -705
                    Serial.println(F("receive packet size 0"));
                    state = PROTOCOL_RECEIVE_DATA_0;
                }
                else    // everything ok
                {
                    state = lora.readData(buffer, packetSize);
                    if (state == RADIOLIB_ERR_NONE)
                    {
                        Serial.printBuffer(buffer, packetSize); Serial.println();
                    }
                    else
                    {
                        if (state == RADIOLIB_ERR_CRC_MISMATCH)
                        {
                            Serial.println("Lora CRC mismatch");
                        }
                        else if (state == RADIOLIB_ERR_RX_TIMEOUT)
                        {
                            Serial.println("RX timeout");
                        }
                        else
                        {
                            Serial.print(F("receive read data failed, code ")); Serial.println(state);
                        }
                    }

                }
            }
            else
            {
                Serial.print(F("receive start failed, code "));
                Serial.println(state);
            }
        }
        else
        {
            // some other error occurred
            Serial.print(F("send failed, code ")); Serial.println(state);
        }
        delay(10000);
    }
}

It looks like the Interrupt I got with the SX1262 was a timeout, not a "message received interrupt". How can I ask the SX1262 for the interrupt reason (timeout or message received) to avoid calling getPacketLength() ? With getPacketLength() I do not get any error information, that's hidden.

So radiolib behaves correctly. The only thing I would wish is a function to ask for the interrupt reason or another way of handling the error internally. For the moment a call to standBy() after getPacketLength() return zero recovers from the error.

Does anyone know, why SX1272/6 can receive the message and SX1262 can't?? Any differences in the chips? Stronger filters for messages that do not match the parameters exactly? The SX1262 is newer.

SolarDaniel commented 1 year ago

I'm sorry for my verbosity, but I made some progress.

When inspecting the source code I noticed there is a function getIrqStatus() to get the interrupt reason. You can call it only in modem standby after a timeout. The same for getPacketLength(). The reason is, that after a timeout, the modem stays in receive mode and the registers are not accessible. So after an interrupt you always have to call standby() first, check the interrupts and then proceed. When no timeout occurs this is not necessary, modem goes to standby automatically, but you never know if its a timeout or not. The function readData() does this at the beginning, but returns eventuelly the sended data instead of nothing. That happened to me several times. When using state = readData(buffer,length) you never know how much data was read. Though you need to call getPacketLength() first and therefore standby() is necessary.

It also happened that the interrupts register was zero (no interrupts). I hope this information helps anybody.

@jgromes Could you please confirm my experiences? Could you eventuelly update the interrupt examples, if I'm not wrong?

SolarDaniel commented 1 year ago

I should have spent some more time digging in the source code of radiolib and consulting the datasheet, before publishing anything here, but this way you can follow the steps necessary to find an issue like this. I came to the conclusion, that there is a major problem handling the status from the SX1262. The chip goes back to standby automatically after reception of a complete message or a timeout. The only thing, that is strange and not covered in the datasheet, is that after a receive timeout the status contains RADIOLIB_SX126X_STATUS_CMD_TIMEOUT, but it is reproduceable. It could be explained with the receiving command beeing aborted by timer and not beeing completed.

It's this part of code that I think is the problem.

The code of SPItransfer in SX126x.cpp (shortened to the relevant parts):

int16_t SX126x::SPItransfer(uint8_t* cmd, uint8_t cmdLen, bool write, uint8_t* dataOut, uint8_t* dataIn, uint8_t numBytes, bool waitForBusy, uint32_t timeout) {

  // pull NSS low
  _mod->digitalWrite(_mod->getCs(), LOW);

  // ensure BUSY is low (state machine ready)
......

  // start transfer
  _mod->SPIbeginTransaction();

  // send command byte(s)
  for(uint8_t n = 0; n < cmdLen; n++) {
    _mod->SPItransfer(cmd[n]);
  }

  // variable to save error during SPI transfer
  uint8_t status = 0;

  // send/receive all bytes
  if(write) {
    for(uint8_t n = 0; n < numBytes; n++) {
      // send byte
      uint8_t in = _mod->SPItransfer(dataOut[n]);

      // check status
      if(((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_TIMEOUT) ||
         ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_INVALID) ||
         ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_FAILED)) {
        status = in & 0b00001110;
        break;
      } else if(in == 0x00 || in == 0xFF) {
        status = RADIOLIB_SX126X_STATUS_SPI_FAILED;
        break;
      }
    }

  } else {
    // skip the first byte for read-type commands (status-only)
    uint8_t in = _mod->SPItransfer(RADIOLIB_SX126X_CMD_NOP);

    // check status
    if(((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_TIMEOUT) ||
       ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_INVALID) ||
       ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_FAILED)) {
      status = in & 0b00001110;
    } else if(in == 0x00 || in == 0xFF) {
      status = RADIOLIB_SX126X_STATUS_SPI_FAILED;
    } else {
      for(uint8_t n = 0; n < numBytes; n++) {
        dataIn[n] = _mod->SPItransfer(RADIOLIB_SX126X_CMD_NOP);
      }
    }
  }

  // stop transfer
  _mod->SPIendTransaction();
  _mod->digitalWrite(_mod->getCs(), HIGH);

  // wait for BUSY to go high and then low
.....
  // print debug output
......
  // parse status
......
}

Sequence:

In my opinion it does not make sense to check the failure of a command in the middle of the next command. It would make much more sense to me, issueing a getStatus (only 2 bytes) after each write command to check its success. According to the datasheet a getStatus can be sent at any time, even when busy. A check for all bytes transferred that do not change the status anyway, does not make sense.

My suggestion: Sequence:

In the case of a COMMAND_TIMEOUT/COMMAND_INVALID/COMMAND_FAILED error it's a failure by design. We can assume SPI to be working correctly by design. If we check all parameters to be sent to the chip on the API level properly, none of these errors can occur. One of these errorrs is a design error and should lead to system stop or reset. My design approach: A class member getError or similar, that returns the last cumulative error, instead of propagating it back to the top API level, where it gets lost as in getStatus() or getPacketLength(), etc.

For the moment, to get around the problem after a receive timeout, calling as first radiolib statement standby() solves the issue. Cause standby is a complete 2 byte command not aborted by the status check and therefor clears the status bits.

Well, I'm done now, it's your turn to comment or ask questions. I would appreciate a feedback.

SolarDaniel commented 1 year ago

@jgromes

comment out the status checks for read and write in SX126x::SPItransfer() beginning at line 1636 in SX126xx.cpp and everything works fine.

  // check status
   /*
   if(((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_TIMEOUT) ||
     ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_INVALID) ||
     ((in & 0b00001110) == RADIOLIB_SX126X_STATUS_CMD_FAILED)) {
     status = in & 0b00001110;
     break;
  } else */
  if(in == 0x00 || in == 0xFF) {
    status = RADIOLIB_SX126X_STATUS_SPI_FAILED;
    break;
  }
jgromes commented 1 year ago

I will try to answer all the questions raised and clear up any possible confusion for future readers. Before that however, two observations:

Responses:

According to a blog about syncword compatibility https://blog.classycode.com/lora-sync-word-compatibility-between-sx127x-and-sx126x-460324d1787a syncwords may be different in some parts.

LoRa sync word is not a 100% filter - it's not like an FSK sync word, the naming is somewhat unfortunate. LoRa sync word affects the receiver sensitivity, and unless you can measure that impact, you should never use any other values than what Semtech recommends. The sync word on SX127x and SX126x is not actually different. It's the same sync word (0x12 for private network, 0x34 for LoRaWAN), the encoding on SX126x is just different (there are extra bits added into the sync word register, that's why 0x1424 on SX126x corresponds to 0x12 on SX127x - in RF, it's the same sync word).

Did you find a solution for https://forum.lora-developers.semtech.com/t/sx126x-strange-behavior-after-rx-timeout/1080

No, I did not.

Could you eventuelly update the interrupt examples, if I'm not wrong?

How? I'm sorry, but your posts are really unclear.

The only thing, that is strange and not covered in the datasheet, is that after a receive timeout the status contains RADIOLIB_SX126X_STATUS_CMD_TIMEOUT, but it is reproduceable. It could be explained with the receiving command beeing aborted by timer and not beeing completed.

How do you know that? As you say, the datasheet mentions nothing about that after Rx timeout, the next command sent should return an "SPI timeout" error code. So until Semtech responds, there's no way to tell if this behavior is intended or not, and I'd rather err on the side of caution here.

In my opinion it does not make sense to check the failure of a command in the middle of the next command. It would make much more sense to me, issueing a getStatus (only 2 bytes) after each write command to check its success.

Again, you are assuming functionality that is unspecified in the datasheet. Adding a getStatus() after a command was sent might be safer, however, documentation doesn't specify when the status byte in the SPI transaction is updated. On each new SPI byte? After SPI transaction is done (NSS goes high)? After some change in the BUSY signal? Etc.

In the case of a COMMAND_TIMEOUT/COMMAND_INVALID/COMMAND_FAILED error it's a failure by design. We can assume SPI to be working correctly by design.

I disagree - I can easily imagine improper hardware inducing these faults randomly, for example by interference. You cannot assume communication is working 100% of the time, the device may receive something different from what you sent.

comment out the status checks for read and write in SX126x::SPItransfer() beginning at line 1636 in SX126xx.cpp and everything works fine.

Of course "everything works fine" when you disable the part telling you if it's not ;)

SolarDaniel commented 1 year ago

Thanks for the clarification. I apologize for the confusing posts. It is my first post here and I just used the buttons in the header without knowing how it works. Now I know at least the code block. '''c++ ..., so sorry. Somewhere I said, that I should have spent some more time digging in the source code of radiolib and consulting the datasheet, before publishing anything here, but this way you can follow the steps necessary to find an issue like this. I was hoping for help and as you always complain about, tried to give as much information as possible. I didn't want to blame you in any way.

I don't wanna fight with you about failure by design, may be you'r right.

I also used the word "could be", not it's a fact. I didn't want to claim something.

I would suggest you document somewhere. that following a receive timeout only standby() or a reset can be used to recover from the error. Even standby() returns an error, but the command is transmitted successfully and resets the error.

Since you know about the problem it would be nice to have a line of comment in the receive interrupt example or somewhere under "known issues" in the readme file.

Thanks anyway for the library. I just wanted to help.

You can delete this post now ( I don't know how), mybe it's too verbose and confusing for future readers.

jgromes commented 1 year ago

Sorry, it was not my intention to come across as irritated in any way - I truly appreciate all feedback about the library. You took the time to investigate an issue, and I would very much like to take the time to improve the user experience.

That being said, it's still not 100% clear to me what exactly are you proposing. Here's what I gathered:

  1. To improve the SPI transaction so that there's an extra call to getStatus, to ensure the previous command was executed. There's actually a precedent for this - SX127x checks all SPI register writes with a subsequent read to make sure what was written in the register matches the expected value. This improves reliability at the cost of increased number of SPI transactions, and can be disabled by disabling the RADIOLIB_SPI_PARANOID macro. I can easily imagine adding the getStatus call to SX126x, guarded by the same macro.

  2. Error codes are sometimes not propagated (e.g. in getPacketLength). This is an issue, and mostly comes down to my own laziness. I usually try to propagate everything, but sometimes it just seems redundant, or complicates the interface in some way. That's definitely something I intend to improve in the long run (e.g. by using an internal variable to cache the error code, and then check the cached value, I think that is similar to what you proposed with getError).

Could you confirm that those are the two points you'd like improved?

SolarDaniel commented 1 year ago

As you may have noticed, I gained more and more insights with every post as I spent a lot of time debugging and analyzing your code. So the last ones are more accurate.

When analyzing (almost) every command I realized that the status never changed during a command. But after the busy line falling again, status is updated. That makes sense to me, cause the SPI transfer is probably just some state machine (some logic), that counts the necessary bytes and verifys and executes the command after it's complete.

That brought me first to the conclusion, that asking for the actual status of the command after it's sent is probably the best time. But meanwhile I think getting the status before sending a command would be better as the status can change from receive to standby for example. Some commands may be sent at any time like getStatus, but others not. So checking the status first would help restricting commands only allowed for the actual state and handle the error before it's done. A two byte transfer is not a huge overhead compared to the LORA speed.

I would not check the status in the SPItransfer function, but in each function that calls SPItransfer. These are not so many. For me the transfer is logically independent of the chip status (Layer separation). Getting the status via SPItransfer() would call itself recursively. A SPI timeout error in the status is in my opinion missleading named. I's not a SPI error, but more an incomplete command error (sending not enough bytes). Again, a well designed board has no SPI errors. That's not the same as a UART with cabling outside the board. That's why I tend to not be paranoid (good naming for your option) on SPI.

Testing of the status could be done like you do with the ASSERT(state), but maybe like this ASSERT_STATE(recommended state) for example.

Concerning error propagation I'd like to use exceptions. But this library is intended for memory constrained systems and many do not support exceptions. On my nrf52840 I didn't succeed to bring exceptions to work.

On UNIX systems it's usual to have a getLastError() function. In C libraries often negative return values indicate an error, when the return value has a positive range. In getPacketLength() this kind of error handling would be suitable. But it should be consistent over all API-functions.

Alternatively use strict return values as error and returning values by reference.

That's all up to you, the author of the library. But keep in mind not to make too many API changes.

SolarDaniel commented 1 year ago

When reading through the datasheet I noticed a possible bug??. In SX126x.cpp at line 1543 there is a possible error. According to the datasheet the register should be 0x0902 instead of 0x0920 as written in the text. The coding example given is probably wrong. It writes to RADIOLIB_SX126X_REG_DIO3_OUT_VOLTAGE_CTRL register, what does not make sense to me, whereas the 0x0902 is the RADIOLIB_SX126X_REG_RTC_CTRL, what makes more sense.

It's not your fault, the datasheet is missleading. I didn't want to open a bug report, cause I don't know exactly and didn't test it. It's just a hint that it could be erroneous.

jgromes commented 1 year ago

@SolarDaniel I implemented the checking of SPI status after a the SPI transaction - it's a good suggestion.

In addition, the last SPI error can be retrieved by a getLastError method. At the moment it's just in the SX126x and SX128x classes, but if this approach turns out beneficial, I'll introduce it to all modules.

I also checked the incorrect register. You're probably right, though I still posted this question to the Semtech forum (https://forum.lora-developers.semtech.com/t/sx126x-implicit-header-mode-timeout-workaround/1299), so maybe somebody will reply.

I think that pretty much concludes all the points raised previously, so I will close this issue now. Again, thanks for the feedback, it is appreciated!