sparkfun / SparkFun_Swarm_Satellite_Arduino_Library

An Arduino library to enable communications with the Swarm M138 satellite modem
Other
28 stars 11 forks source link

Checksum calculation errors #32

Closed hayden-ernst closed 1 year ago

hayden-ernst commented 1 year ago

Hi,

I've been digging into checksum errors that we have been getting while attempting to queue messages. This issue was found while receiving unsolicited messages at a fairly high rate of one every 5 seconds and also trying to queue messages . This issue was intermittent so it was difficult to track down. However, the error was caught during debugging:

15:31:12.588 -> sendCommandWithResponse: ====> 15:31:12.588 -> sendCommand: Command: $TD HD=34819200,"1673299804+9999+9999+9999+9999+9999+999+9999+9999+999+999+999+999+9+408563-966562+367+2+SH2_01;1673299865+9999+9999+9999+9999+9999+999+9999+9999+999+999+9"*4a 15:31:12.588 -> 15:31:12.635 -> waitForResponse: $TD OK,4601487328959*2c 15:31:12.635 -> 15:31:12.635 -> checksum = 50 15:31:12.635 -> expectedChecksum = 2C 15:31:12.635 -> checkChecksum: invalid checksum 15:31:12.635 -> sendCommandWithResponse: <====

'checksum' and 'expectedChecksum' were variables printed from the 'checkChecksum()' function in the 'SparkFun_Swarm_Satellite_Arduino_Library.cpp' file. 'checksum' is the value as calculated from the string and 'expectedChecksum' is the value parsed from the string. With this example, we can see that the checksum found on the response is 2C and the expectedChecksum matches this but the checksum does not match.

Looking at a larger number of communications also helps to give us a picture of what may be happening:

15:16:58.946 -> checkUnsolicitedMsg: event(s) found! ===> 15:16:58.946 -> checkUnsolicitedMsg: start of event: $RT RSSI=-109*15 15:16:58.946 -> checksum = 15 15:16:58.946 -> expectedChecksum = 15 15:16:58.994 -> checkUnsolicitedMsg: end of event 15:16:58.994 -> checkUnsolicitedMsg: <=== end of event(s)! 15:17:07.835 -> sendCommandWithResponse: ====> 15:17:07.880 -> sendCommand: Command: $TD HD=34819200,"1673298960+9999+9999+9999+9999+9999+999+9999+9999+999+999+999+999+9+408564-966562+359+2+SH2_01;1673299020+9999+9999+9999+9999+9999+999+9999+9999+999+999+9"*4b 15:17:07.880 -> 15:17:07.926 -> waitForResponse: $TD OK,4601487328952*27 15:17:07.926 -> 15:17:07.926 -> checksum = 15 15:17:07.926 -> expectedChecksum = 15 15:17:07.926 -> sendCommandWithResponse: <==== 15:17:07.926 -> sendCommandWithResponse: ====> 15:17:07.926 -> sendCommand: Command: $MT C=U*12 15:17:07.926 -> 15:17:08.535 -> waitForResponse: $MT 1*08 15:17:08.535 -> 15:17:08.535 -> checksum = 15 15:17:08.535 -> expectedChecksum = 15 15:17:08.535 -> sendCommandWithResponse: <==== 15:17:08.535 -> The message has been added to the transmit queue. The message ID is 4601487328952 15:17:08.582 -> Waiting for the $TD SENT 15:17:09.562 -> checkUnsolicitedMsg: backlog found! backlog length is 41 15:17:09.562 -> checkUnsolicitedMsg: event(s) found! ===> 15:17:09.562 -> checkUnsolicitedMsg: start of event: $RT RSSI=-109*15 15:17:09.562 -> checksum = 15 15:17:09.609 -> expectedChecksum = 15 15:17:09.609 -> checkUnsolicitedMsg: end of event 15:17:09.609 -> checkUnsolicitedMsg: start of event: $TD OK,4601487328952*27 15:17:09.609 -> checksum = 27 15:17:09.609 -> expectedChecksum = 27 15:17:09.609 -> checkUnsolicitedMsg: end of event 15:17:09.609 -> checkUnsolicitedMsg: start of event: $RT RSSI=-110*1d 15:17:09.609 -> checksum = 1D 15:17:09.609 -> expectedChecksum = 1D 15:17:09.657 -> checkUnsolicitedMsg: end of event 15:17:09.657 -> checkUnsolicitedMsg: <=== end of event(s)!

Here, we can see that the checksum for the first receive test unsolicited message is calculated correctly. Then, both the checksum and expectedChecksum values do not match up with the checksum from the return after the queue message is sent. The received string shows a checksum of 27 but they both stay as 15. The same is true for the response to the get unsent messages count command. It has a checksum of 08 but again we are stuck at 15. The values do still match, so no error is detected. Then when the backlog is being pruned, we find that the checksums begin to match up with the message strings again.

This looked to me like the string being passed into the 'checkChecksum()' function was not being updated. So I looked into where this was and found it in the function below. If we look at the bottom where the checksum is actually checked, we see that the '_swarmBacklog' buffer is passed into the function at the index 'responseStartedAt'. However, when we look at where that index is found, we see that the 'responseDest' buffer is the one being iterated through to find our response. This means that the index we are finding probably won't match with the location of the response in the _swarmBacklog, especially if we are receiving unsolicited messages which are put in there. It is most likely corresponding to the first message in the _swarmBacklog which in this case is probably a receive test message. This will usually result in the wrong message string being checked for a correct checksum which will normally still find that the checksums match. However, it can cause unexpected results.

Swarm_M138_Error_e SWARM_M138::waitForResponse(const char *expectedResponseStart, const char *expectedErrorStart,
                                               char *responseDest, size_t destSize, unsigned long timeout)
{
  unsigned long timeIn;
  bool found = false;
  size_t destIndex = 0;
  bool responseStartSeen = false, errorStartSeen = false;
  int responseIndex = 0, errorIndex = 0;
  size_t responseStartedAt = 0, errorStartedAt = 0;
  Swarm_M138_Error_e err = SWARM_M138_ERROR_ERROR;

  bool printedSomething = false;

  timeIn = millis();

  while ((!found) && ((timeIn + timeout) > millis()))
  {
    size_t hwAvail = hwAvailable();
    if (hwAvail > 0) //hwAvailable can return -1 if the serial port is NULL
    {
      if ((destIndex + hwAvail) < destSize) // Check there is room to store the response (with a null on the end!)
      {
        size_t bytesRead = hwReadChars((char *)&responseDest[destIndex], hwAvail);

        if (_printDebug == true)
        {
          if (printedSomething == false)
          {
            _debugPort->print(F("waitForResponse: "));
            printedSomething = true;
          }
          _debugPort->print((const char *)&responseDest[destIndex]);
        }

        // Check each character to see if it is the expected resonse or error
        for (size_t chrPtr = destIndex; chrPtr < (destIndex + bytesRead); chrPtr++)
        {
          char c = responseDest[chrPtr]; // Check each character

          if (c == expectedResponseStart[responseIndex])
          {
            if ((responseIndex == 0) && (responseStartSeen == false))
              responseStartedAt = chrPtr; // Record where the (possible) response started
            if (++responseIndex == (int)strlen(expectedResponseStart))
            {
              responseStartSeen = true;
            }
          }
          else
          {
            responseIndex = 0;
          }

          if (expectedErrorStart != NULL)
          {
            if (c == expectedErrorStart[errorIndex])
            {
              if ((errorIndex == 0) && (errorStartSeen == false))
                errorStartedAt = chrPtr; // Record where the (possible) error started
              if (++errorIndex == (int)strlen(expectedErrorStart))
              {
                errorStartSeen = true;
              }
            }
            else
            {
              errorIndex = 0;
            }
          }

          // Set found to true if we have seen expectedResponseStart or expectedErrorStart _and_ have received a \n
          if ((responseStartSeen || errorStartSeen) && (c == '\n'))
            found = true;
        }

        // Now also copy the response into the backlog, if there is room
        // _swarmBacklog is a global array that holds the backlog of any events
        // that came in while waiting for response. To be processed later within checkUnsolicitedMsg().
        // Note: the expectedResponse or expectedError will also be added to the backlog.
        // Everything in the backlog is 'printable'. So it is OK to use strlen - so long as there is a
        // \0 at the end of the buffer!
        size_t backlogLength = strlen((const char *)_swarmBacklog);

        if ((backlogLength + bytesRead) < _RxBuffSize) // Is there room to store the new data?
        {
          memcpy((char *)&_swarmBacklog[backlogLength], (char *)&responseDest[destIndex], bytesRead);
        }
        else
        {
          if (_printDebug == true)
          {
            if (printedSomething == true)
            {
              _debugPort->println();
              printedSomething = false;
            }
            _debugPort->println(F("waitForResponse: Panic! _swarmBacklog is full!"));
          }
        }

        destIndex += bytesRead; // Increment destIndex
      }
      else
      {
        if (_printDebug == true)
        {
          if (printedSomething == true)
          {
            _debugPort->println();
            printedSomething = false;
          }
          _debugPort->println(F("waitForResponse: Panic! responseDest is full!"));
        }
      }
    }
    else
      delay(1);
  }

  if (_printDebug == true)
    if (printedSomething)
      _debugPort->println();

  if (found == true)
  {
    if (errorStartSeen) // Error needs priority over response as response is often the beginning of error!
    {
      // if (_printDebug == true)
      // {
      //   _debugPort->print(F("waitForResponse: errorStart: "));
      //   _debugPort->println((char *)&_swarmBacklog[errorStartedAt]);
      // }
      err = checkChecksum((char *)&_swarmBacklog[errorStartedAt]);
      if (err == SWARM_M138_ERROR_SUCCESS)
      {
        extractCommandError((char *)&_swarmBacklog[errorStartedAt]);
        err = SWARM_M138_ERROR_ERR;
      }
    }
    else if (responseStartSeen)
    {
      // if (_printDebug == true)
      // {
      //   _debugPort->print(F("waitForResponse: responseStart: "));
      //   _debugPort->println((char *)&_swarmBacklog[responseStartedAt]);
      // }
      err = checkChecksum((char *)&_swarmBacklog[responseStartedAt]);
    }
  }
  else
    err = SWARM_M138_ERROR_TIMEOUT;

  pruneBacklog(); // Prune any incoming non-actionable URC's and responses/errors from the backlog

  return (err);
}

As a temporary fix, I changed the calls to the checkChecksum function here to look like this: err = checkChecksum((char *)&responseDest[responseStartedAt]);; After implementing this change, the checksums seemed to match up to those on the received messages and after testing this with a large number of messages, we have not seen the error again.

Anyway, that's just what I found. Please take a look and see if you guys end up with similar conclusions.

Best, Hayden

PaulZC commented 1 year ago

Hi Hayden (@hayden-ernst ),

Thank you for reporting this. Excellent detective work!

I will investigate and test this tomorrow, but - having read through the code - I do agree with your conclusion and fix. If I understand correctly, checkChecksum will return the correct value if the backlog is empty. If the backlog contains another message, then, yes, the index is not correct and the wrong checksum is calculated.

Wow. Great find!

Until tomorrow, Paul

PaulZC commented 1 year ago

Hi Hayden,

I wasn't able to fully replicate the issue you were seeing as I did not try to queue messages, but I was able to see responseDest and _swarmBacklog become 'out of sync'. The checksum was indeed being calculated on the wrong data. I've corrected this in v1.1.10.

Thanks again for reporting this.

Very best wishes, Paul