h2zero / esp-nimble-cpp

C++ library for the esp32 NimBLE stack based on and mostly compatible with @nkolban cpp_utils BLE library.
https://h2zero.github.io/esp-nimble-cpp/
Apache License 2.0
181 stars 62 forks source link

Allow task blocking duration to be configured #222

Closed thekurtovic closed 1 week ago

thekurtovic commented 1 week ago

There are instances where the library will block the calling task with BLE_NPL_TIME_FOREVER as the duration. This duration value should be configurable to allow for a more suitable value depending on what the application needs. NimBLEUtils::taskWait(..., BLE_NPL_TIME_FOREVER);

Example https://github.com/h2zero/esp-nimble-cpp/blob/master/src/NimBLEClient.cpp#L320-L345 https://github.com/h2zero/esp-nimble-cpp/blob/master/src/NimBLEClient.cpp#L723-L753

h2zero commented 1 week ago

Thank you for opening this issue.

I wonder what causes the issue you are experiencing. Can you share more details?

thekurtovic commented 1 week ago

For me I think it was some sort of edge case race condition, usually during the secure connection process. If the esp32 is trying to secure right as the peripheral has gone to sleep, secureConnection() would never return + never got the disconnect callback triggered. I was able to reproduce it in the past by inserting a delay in NimBLEClient.cpp, after the call to NimBLEDevice::startSecurity() and before the task blocking. Though I was using release 1.4.1 of the Arduino library at the time, and IDF 4.4.6.

I'll try testing it out again on the latest commit and see if it still happens.

This was a similar issue, and where I got the idea to change portMAX_DELAY. https://www.esp32.com/viewtopic.php?t=19354

h2zero commented 1 week ago

Thanks for the details. This is quite concerning, the disconnect event callback should always be sent and unblock anything that is waiting on a response from a peer. If that is not happening then there is a more serious issue and working around it may be necessary.

I know that the call to disconnect would fail sometimes in the past but that seems to be resolved, it was an issue with the controller.

thekurtovic commented 1 week ago

It's not consistently reproducible but I did manage to observe the issue once again, though slightly different from my initial description.

Edit: Finally observed it again with NimBLE debug logs enabled.

I [ 11220][main.cpp:354] Create client, attempting to connect...
D (11229) NimBLEScan: erase device: e3:e4:5b:75:ad:36
D (11229) NimBLEClient: >> connect(e3:e4:5b:75:ad:36)
D (11500) NimBLEClient: Got Client event
I (11501) NimBLEClient: Connected event
D (11516) NimBLEClient: Got Client event
I (11516) NimBLEClient: mtu update: mtu=247
D (11517) NimBLEClientCallbacks: onMTUChange: default
D (11527) NimBLEClient: exchangeMTUCb: status=0, mtu=247
D (11527) NimBLEClient: Got Client event
I (11528) NimBLEClient: Connection established
I [ 11539][main.cpp:142] Connected to e3e45b75ad36, RSSI -42 (bond 1)
D (11541) NimBLEClient: << connect()
I [ 11544][main.cpp:371] Securing connection...
D (11546) NimBLEClient: Got Client event
D (11554) NimBLEClient: >> secureConnection()
NimBLE secureConnection Starting delay...
D (11620) NimBLEClient: Got Client event
I (11620) NimBLEClient: Connection parameters updated.
D (11659) NimBLEClient: Got Client event
D [ 11659][main.cpp:180] Auth event 1 1 1 16
D (12949) NimBLEClient: Got Client event
I (12949) NimBLEClient: disconnect; reason=531,
I [ 12949][main.cpp:157] BLE e3e45b75ad36 disconnected (531)
NimBLE secureConnection Ending delay...
E [ 45218][wdt.cpp:95] tsk0 not responding, restarting...

Snippet from application which shows what should be printed directly after secureConnection() returns.

    LOG_I("Securing connection...");
    if (!pClient->secureConnection())
    {
        LOG_E("Failed to secure, delete client");
        deleteBond(temp);
        return false;
    }
    LOG_I("Secured connection (%u)", _gotAuthEvt);

Modification made to NimBLEClient to simulate the issue.

bool NimBLEClient::secureConnection() const {
    NIMBLE_LOGD(LOG_TAG, ">> secureConnection()");
    NimBLETaskData taskData(const_cast<NimBLEClient*>(this));
    int            retryCount = 1;

    do {
        m_pTaskData = &taskData;

        if (!NimBLEDevice::startSecurity(m_connHandle)) {
            m_lastErr   = BLE_HS_ENOTCONN;
            m_pTaskData = nullptr;
            return false;
        }
        delay(5000); // Delay to let peripheral go to sleep or disconnect
        NimBLEUtils::taskWait(*m_pTaskData, BLE_NPL_TIME_FOREVER);
    } while (taskData.m_flags == (BLE_HS_ERR_HCI_BASE + BLE_ERR_PINKEY_MISSING) && retryCount--);
    ...
} // secureConnection

Majority of the time it's crashing instead, though I'm assuming that's due to the artificial delay which was added.

Guru Meditation Error: Core  1 panic'ed (StoreProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x4201cc4e  PS      : 0x00060b30  A0      : 0x8201a32c  A1      : 0x3fcd4440
A2      : 0x00000000  A3      : 0x3fcd0f78  A4      : 0x3fcd4420  A5      : 0x3fca4650
A6      : 0x00060023  A7      : 0x00000001  A8      : 0x8201cc4e  A9      : 0x3fcd4420
A10     : 0x3fcd46d4  A11     : 0xb33fffff  A12     : 0x80387995  A13     : 0x3fca4650
A14     : 0x00060023  A15     : 0x00000001  SAR     : 0x0000000a  EXCCAUSE: 0x0000001d
EXCVADDR: 0x0000000c  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000

_ZN11NimBLEUtils8taskWaitERK14NimBLETaskDatam
C:\Users\User\Projects\avx-esp/managed_components/esp-nimble-cpp/src/NimBLEUtils.cpp:47
_ZNK12NimBLEClient16secureConnectionEv
C:\Users\User\Projects\avx-esp/managed_components/esp-nimble-cpp/src/NimBLEClient.cpp:337

I'll keep the test running for the rest of the day and hopefully catch it again with debug log.

h2zero commented 1 week ago

Thanks, I see now. So the task is being released before it was blocked and since the call to taskWait will clear the notification value first it will block forever. In that case we need a way to detect this before blocking the task.

h2zero commented 1 week ago

@thekurtovic Could you test #224 with your code and let me know please?

thekurtovic commented 1 week ago

The crash no longer occurs with the delay added, but now the task gets blocked every time whereas earlier it was rare to observe. Additionally, when my supervisory code restarts the task, it appears as if the BLE stack is in a weird state and no longer functioning. Earlier when the task was restarted, BLE scans would resume and allow the process to repeat. Now the scanning never resumes, and NimBLEDevice::deinit(true) hangs once it hits nimble_port_stop().

h2zero commented 1 week ago

Thanks, any chance to see a log?

thekurtovic commented 1 week ago

Sure. I added some extra logging within NimBLEDevice::deinit() to figure out how far it was getting.

D (9093) NimBLEScan: >> stop()
I [  9104][main.cpp:223] Scan ended...
D (9104) NimBLEScan: << stop()
D (9104) NimBLEScan: erase device: e3:e4:5b:75:ad:36
I [  9096][main.cpp:356] Create client, attempting to connect...
D (9115) NimBLEClient: >> connect(e3:e4:5b:75:ad:36)
D (9725) NimBLEClient: Got Client event
I (9725) NimBLEClient: Connected event
D (9737) NimBLEClient: Got Client event
I (9737) NimBLEClient: mtu update: mtu=247
D (9737) NimBLEClientCallbacks: onMTUChange: default
D (9748) NimBLEClient: exchangeMTUCb: status=0, mtu=247
D (9748) NimBLEClient: Got Client event
I (9748) NimBLEClient: Connection established
I [  9761][main.cpp:142] Connected to e3e45b75ad36, RSSI -49 (bond 1)
D (9764) NimBLEClient: << connect()
I [  9766][main.cpp:373] Securing connection...
D (9767) NimBLEClient: Got Client event
D (9776) NimBLEClient: >> secureConnection()
E (9780) NimBLEClient: secureConnection Starting delay....
D (9841) NimBLEClient: Got Client event
I (9841) NimBLEClient: Connection parameters updated.
D (9880) NimBLEClient: Got Client event
D [  9880][main.cpp:180] Auth event 1 1 1 16
D (12745) NimBLEClient: Got Client event
I (12745) NimBLEClient: disconnect; reason=531,
I [ 12745][main.cpp:157] BLE e3e45b75ad36 disconnected (531)
E (14791) NimBLEClient: secureConnection Ending delay...
E [ 43095][wdt.cpp:95] tsk0 not responding, restarting...
W [129096][main.cpp:110] failsafe...
W [129097][main.cpp:428] Reinitializing BLE
E (129097) NimBLEDevice: trying to deinit
E [163097][wdt.cpp:95] tsk0 not responding, restarting...
h2zero commented 1 week ago

@thekurtovic Thanks, I think I have spotted the issue and just pushed a potential fix, please let me know.

thekurtovic commented 1 week ago

It looks good now. I left it running for 12 hours and didn't see a single issue from 4,200 attempts. Thank you.

thekurtovic commented 1 week ago

Caught an issue overnight. I removed the calls to explicitly start secureConnection in the application so that it would be handled during client.getService(...); before writing to a characteristic.

I [544126][main.cpp:224] BLE adv (-90 db, phy 1 0) Name: , Address: f1:01:06:9c:ce:ea
I [544147][main.cpp:240] Create client, attempting to connect...
I [544148][main.cpp:270] Scan ended...
I [546395][main.cpp:491] Scan restart...
I [546395][main.cpp:149] Connected to f101069cceea, RSSI 0 (bond 1)
D [546396][main.cpp:540] Try to get service f101069cceea
I [546463][main.cpp:224] BLE adv (-74 db, phy 1 0) Name: SHIELD, Address: 74:7d:53:5f:13:59, serviceUUID: 0x3456
E (546813) NimBLEDevice: ble_gap_security_initiate: rc=2 
D [551423][main.cpp:188] Connection params update req
I [556469][main.cpp:224] BLE adv (-88 db, phy 1 0) Name: , Address: 70:2c:c4:fd:9f:91, serviceUUID: 0xfef3
I [558383][main.cpp:172] BLE f101069cceea disconnected (531)
E [580395][wdt.cpp:95] tsk0 not responding, restarting...
h2zero commented 1 week ago

Hmm, that error code indicates that security was already started but no event came to unblock the task

thekurtovic commented 1 week ago

I think this was an issue on my end. Noticed https://github.com/h2zero/esp-nimble-cpp/pull/227 mentions deleting client on disconnect being support now. I've always been doing NimBLEDevice::deleteClient within the onDisconnect handler, which possibly was preventing NimBLEUtils::taskRelease from running at the end of NimBLEClient::handleGapEvent.

Does https://github.com/h2zero/esp-nimble-cpp/pull/227 ensure the client task data will be released before deleting? It looks like it only accounts for connection attempts but not others like the above scenario. Should the task be released always if disconnected and pClient->m_pTaskData != nullptr? I've been testing it for a bit and haven't had any issues though, other than a slight regression with scanning apparently.

h2zero commented 1 week ago

Yes, it makes a copy of the task data pointer in the event handler now so if the client is deleted the task can still be release since it's allocated in the stack.