AT Device Error Handling

wrh-dev commented 1 year ago

Some routines ignore AT device errors, such as registerNetwork() in u_cell_net.c and connect() in u_cell_mqtt.c. This results in the code waiting for a timeout even though I believe the module has already given up on the command.

To get around this, does it make sense to add checks for device errors in these and similar routines?

For example, does updating the connect() routine as shown below make sense?

...
    // Note that we retry this if the failure was due to radio conditions
    do {
        uAtClientLock(atHandle);
        pUrcStatus->flagsBitmap = 0;
        // Have seen this take a little while to respond
        uAtClientTimeoutSet(atHandle, 15000);
        uAtClientCommandStart(atHandle, MQTT_COMMAND_AT_COMMAND_STRING(mqttSn));
        // Conveniently log-in/connect is always command 0 and
        // log out/disconnect is always command 1
        uAtClientWriteInt(atHandle, (int32_t) onNotOff);
        if (U_CELL_PRIVATE_HAS(pInstance->pModule,
                               U_CELL_PRIVATE_FEATURE_MQTT_SARA_R4_OLD_SYNTAX)) {
            uAtClientCommandStop(atHandle);
            // Don't need to worry about the MQTT-SN form of the AT
            // command here since the old syntax SARA-R4's do not
            // support MQTT-SN
            uAtClientResponseStart(atHandle, "+UMQTTC:");
            // Skip the first parameter, which is just
            // our UMQTTC command number again
            uAtClientSkipParameters(atHandle, 1);
            status = uAtClientReadInt(atHandle);
            uAtClientResponseStop(atHandle);
        } else {
            uAtClientCommandStopReadResponse(atHandle);
            /*** NEW CODE START ***/
            // The library code ignores device errors here, e.g "+CME ERROR: operation not allowed",
            // which is issued if this command is sent before a previous MQTT command was finished.
            // Also factor in errors here so we don't wait for a command that was never accepted.
            uAtClientDeviceError_t deviceError;
            uAtClientDeviceErrorGet(atHandle, &deviceError);
            status = (deviceError.type == U_AT_CLIENT_DEVICE_ERROR_TYPE_NO_ERROR) ? 1 : 0;
            /*** NEW CODE END   ***/
        }
...

Does this make sense for the registerNetwork routine? I feel less confident in this change because I do occasionally see CME ERROR: No network service issued about 10-20 seconds after attempting a manual registration which results in the connection attempt failing. Not sure if the module would recover gracefully if given more time...

...
        uPortLog("U_CELL_NET: registering on %s...\n", pMccMnc);
        uAtClientLock(atHandle);
        uAtClientTimeoutSet(atHandle, 1000);
        uAtClientCommandStart(atHandle, "AT+COPS=");
        // Manual mode
        uAtClientWriteInt(atHandle, 1);
        // Numeric format
        uAtClientWriteInt(atHandle, 2);
        // The network
        uAtClientWriteString(atHandle, pMccMnc, true);
        uAtClientCommandStop(atHandle);
        /*** NEW CODE START ***/
        bool deviceErrorDetected = false;
        /*** NEW CODE END   ***/
        while (keepGoing && keepGoingLocalCb(pInstance)) {
            uAtClientResponseStart(atHandle, NULL);
            keepGoing = (uAtClientErrorGet(atHandle) < 0);
            /*** NEW CODE START ***/
            // The library code ignores errors here, e.g "+CME ERROR: no network service".
            // Also factor in errors here so we don't wait for something that will never finish.
            uAtClientDeviceError_t deviceError;
            uAtClientDeviceErrorGet(atHandle, &deviceError);
            deviceErrorDetected = (deviceError.type != U_AT_CLIENT_DEVICE_ERROR_TYPE_NO_ERROR);
            keepGoing = keepGoing && !deviceErrorDetected;
            /*** NEW CODE END   ***/
            uAtClientClearError(atHandle);
            uPortTaskBlock(1000);
        }
        uAtClientResponseStop(atHandle);
        uAtClientUnlock(atHandle);
        if (keepGoing) {
            // If we never got an answer, abort the
            // command first.
            abortCommand(pInstance);
        }
        /*** NEW CODE START ***/
        // Let the registration outcome be decided
        // by the code block below, driven by the URCs
        #if 0
        keepGoing = true;
        #else
        // Update: only keep going if we didn't explicitly get an error.
        keepGoing = !deviceErrorDetected;
        #endif
        /*** NEW CODE END   ***/

Also, is the logic for the abortCommand check backwards? Should this:

        if (keepGoing) {
            // If we never got an answer, abort the
            // command first.
            abortCommand(pInstance);
        }

be changed to:

        if (!keepGoing) {
            // If we never got an answer, abort the
            // command first.
            abortCommand(pInstance);
        }

RobMeades commented 1 year ago

Thanks for this, very useful ideas.

The MQTT code, in particular, is a bit of a rats-nest of ifs and buts as it tries to cope with our many and varied MQTT AT command interfaces, so I kind of lost confidence in making it more complex, if you see what I mean, when it was originally written. However it was coded quite a long time ago now and so making a relatively small, discrete, change, as you suggest would probably be fine and, should it not work out, it could easily/obviously be reversed in a subsequent commit.

On the registration one, as you say that's likely more complex - when is an error a "final, I'm not gonna do anything" error, versus just a whiney module. Let me consult with application engineering on this one.

I think you're right about my logic with the keepGoing flag though: problem with his kind of code is that it's quite difficult to check the condition reliably and across all modules without taking more time than I usually like our regression tests to take; my fault.

So I will address this in two parts - the MQTT one and then the registerNetwork() one (including the logic around the keepGoing flag) in discussion with AE (@philwareublox).

RobMeades commented 1 year ago

I've been through the registerNetwork() case again and you're right about needing the device error part but I now think the check that follows to do the abort was correct, though admittedly confusing. I have made a fix for both the MQTT cases and the registerNetwork() case and pushed a preview branch with those fixes here:

https://github.com/u-blox/ubxlib/tree/preview_cell_fix_err_rmea

These changes have passed our internal testing but, as I said, we probably don't cover such scenarios very well so could you take a look and confirm that the changes are what you expect? If you can confirm that I will merge the fixes to master and push that out here properly and then delete the preview_cell_fix_err_rmea branch.

wrh-dev commented 1 year ago

Thanks for making these changes! After some testing, I think they seem appropriate. For the registerNetwork() case, for example, I've seen the code catch CME ERROR: No network service errors and then use the registration URCs to correctly timeout or wait for registration to succeed.

I now follow the abort command logic after reading through the changes.

Feel free to close out this issue when the preview branch is merged!

RobMeades commented 1 year ago

Great: the changes have been submitted for review internally, should be there in a few days.

RobMeades commented 1 year ago

Fixes now pushed here in commits cf61c1beed87f6e30f66a14c70fb70009f82c8fc and e7851e0070bc6dfca8050ad2f15ea2b8fee25e6a, will delete the preview branch now, thanks again for raising these very useful issues.

u-blox / ubxlib

AT Device Error Handling #72