Closed thekurtovic closed 1 week ago
Thank you for opening this issue.
I wonder what causes the issue you are experiencing. Can you share more details?
For me I think it was some sort of edge case race condition, usually during the secure connection process. If the esp32 is trying to secure right as the peripheral has gone to sleep, secureConnection()
would never return + never got the disconnect callback triggered.
I was able to reproduce it in the past by inserting a delay in NimBLEClient.cpp, after the call to NimBLEDevice::startSecurity() and before the task blocking. Though I was using release 1.4.1 of the Arduino library at the time, and IDF 4.4.6.
I'll try testing it out again on the latest commit and see if it still happens.
This was a similar issue, and where I got the idea to change portMAX_DELAY. https://www.esp32.com/viewtopic.php?t=19354
Thanks for the details. This is quite concerning, the disconnect event callback should always be sent and unblock anything that is waiting on a response from a peer. If that is not happening then there is a more serious issue and working around it may be necessary.
I know that the call to disconnect would fail sometimes in the past but that seems to be resolved, it was an issue with the controller.
It's not consistently reproducible but I did manage to observe the issue once again, though slightly different from my initial description.
Edit: Finally observed it again with NimBLE debug logs enabled.
I [ 11220][main.cpp:354] Create client, attempting to connect...
D (11229) NimBLEScan: erase device: e3:e4:5b:75:ad:36
D (11229) NimBLEClient: >> connect(e3:e4:5b:75:ad:36)
D (11500) NimBLEClient: Got Client event
I (11501) NimBLEClient: Connected event
D (11516) NimBLEClient: Got Client event
I (11516) NimBLEClient: mtu update: mtu=247
D (11517) NimBLEClientCallbacks: onMTUChange: default
D (11527) NimBLEClient: exchangeMTUCb: status=0, mtu=247
D (11527) NimBLEClient: Got Client event
I (11528) NimBLEClient: Connection established
I [ 11539][main.cpp:142] Connected to e3e45b75ad36, RSSI -42 (bond 1)
D (11541) NimBLEClient: << connect()
I [ 11544][main.cpp:371] Securing connection...
D (11546) NimBLEClient: Got Client event
D (11554) NimBLEClient: >> secureConnection()
NimBLE secureConnection Starting delay...
D (11620) NimBLEClient: Got Client event
I (11620) NimBLEClient: Connection parameters updated.
D (11659) NimBLEClient: Got Client event
D [ 11659][main.cpp:180] Auth event 1 1 1 16
D (12949) NimBLEClient: Got Client event
I (12949) NimBLEClient: disconnect; reason=531,
I [ 12949][main.cpp:157] BLE e3e45b75ad36 disconnected (531)
NimBLE secureConnection Ending delay...
E [ 45218][wdt.cpp:95] tsk0 not responding, restarting...
Snippet from application which shows what should be printed directly after secureConnection() returns.
LOG_I("Securing connection...");
if (!pClient->secureConnection())
{
LOG_E("Failed to secure, delete client");
deleteBond(temp);
return false;
}
LOG_I("Secured connection (%u)", _gotAuthEvt);
Modification made to NimBLEClient to simulate the issue.
bool NimBLEClient::secureConnection() const {
NIMBLE_LOGD(LOG_TAG, ">> secureConnection()");
NimBLETaskData taskData(const_cast<NimBLEClient*>(this));
int retryCount = 1;
do {
m_pTaskData = &taskData;
if (!NimBLEDevice::startSecurity(m_connHandle)) {
m_lastErr = BLE_HS_ENOTCONN;
m_pTaskData = nullptr;
return false;
}
delay(5000); // Delay to let peripheral go to sleep or disconnect
NimBLEUtils::taskWait(*m_pTaskData, BLE_NPL_TIME_FOREVER);
} while (taskData.m_flags == (BLE_HS_ERR_HCI_BASE + BLE_ERR_PINKEY_MISSING) && retryCount--);
...
} // secureConnection
Majority of the time it's crashing instead, though I'm assuming that's due to the artificial delay which was added.
Guru Meditation Error: Core 1 panic'ed (StoreProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x4201cc4e PS : 0x00060b30 A0 : 0x8201a32c A1 : 0x3fcd4440
A2 : 0x00000000 A3 : 0x3fcd0f78 A4 : 0x3fcd4420 A5 : 0x3fca4650
A6 : 0x00060023 A7 : 0x00000001 A8 : 0x8201cc4e A9 : 0x3fcd4420
A10 : 0x3fcd46d4 A11 : 0xb33fffff A12 : 0x80387995 A13 : 0x3fca4650
A14 : 0x00060023 A15 : 0x00000001 SAR : 0x0000000a EXCCAUSE: 0x0000001d
EXCVADDR: 0x0000000c LBEG : 0x400570e8 LEND : 0x400570f3 LCOUNT : 0x00000000
_ZN11NimBLEUtils8taskWaitERK14NimBLETaskDatam
C:\Users\User\Projects\avx-esp/managed_components/esp-nimble-cpp/src/NimBLEUtils.cpp:47
_ZNK12NimBLEClient16secureConnectionEv
C:\Users\User\Projects\avx-esp/managed_components/esp-nimble-cpp/src/NimBLEClient.cpp:337
I'll keep the test running for the rest of the day and hopefully catch it again with debug log.
Thanks, I see now. So the task is being released before it was blocked and since the call to taskWait
will clear the notification value first it will block forever. In that case we need a way to detect this before blocking the task.
@thekurtovic Could you test #224 with your code and let me know please?
The crash no longer occurs with the delay added, but now the task gets blocked every time whereas earlier it was rare to observe. Additionally, when my supervisory code restarts the task, it appears as if the BLE stack is in a weird state and no longer functioning. Earlier when the task was restarted, BLE scans would resume and allow the process to repeat. Now the scanning never resumes, and NimBLEDevice::deinit(true) hangs once it hits nimble_port_stop().
Thanks, any chance to see a log?
Sure. I added some extra logging within NimBLEDevice::deinit() to figure out how far it was getting.
D (9093) NimBLEScan: >> stop()
I [ 9104][main.cpp:223] Scan ended...
D (9104) NimBLEScan: << stop()
D (9104) NimBLEScan: erase device: e3:e4:5b:75:ad:36
I [ 9096][main.cpp:356] Create client, attempting to connect...
D (9115) NimBLEClient: >> connect(e3:e4:5b:75:ad:36)
D (9725) NimBLEClient: Got Client event
I (9725) NimBLEClient: Connected event
D (9737) NimBLEClient: Got Client event
I (9737) NimBLEClient: mtu update: mtu=247
D (9737) NimBLEClientCallbacks: onMTUChange: default
D (9748) NimBLEClient: exchangeMTUCb: status=0, mtu=247
D (9748) NimBLEClient: Got Client event
I (9748) NimBLEClient: Connection established
I [ 9761][main.cpp:142] Connected to e3e45b75ad36, RSSI -49 (bond 1)
D (9764) NimBLEClient: << connect()
I [ 9766][main.cpp:373] Securing connection...
D (9767) NimBLEClient: Got Client event
D (9776) NimBLEClient: >> secureConnection()
E (9780) NimBLEClient: secureConnection Starting delay....
D (9841) NimBLEClient: Got Client event
I (9841) NimBLEClient: Connection parameters updated.
D (9880) NimBLEClient: Got Client event
D [ 9880][main.cpp:180] Auth event 1 1 1 16
D (12745) NimBLEClient: Got Client event
I (12745) NimBLEClient: disconnect; reason=531,
I [ 12745][main.cpp:157] BLE e3e45b75ad36 disconnected (531)
E (14791) NimBLEClient: secureConnection Ending delay...
E [ 43095][wdt.cpp:95] tsk0 not responding, restarting...
W [129096][main.cpp:110] failsafe...
W [129097][main.cpp:428] Reinitializing BLE
E (129097) NimBLEDevice: trying to deinit
E [163097][wdt.cpp:95] tsk0 not responding, restarting...
@thekurtovic Thanks, I think I have spotted the issue and just pushed a potential fix, please let me know.
It looks good now. I left it running for 12 hours and didn't see a single issue from 4,200 attempts. Thank you.
Caught an issue overnight. I removed the calls to explicitly start secureConnection
in the application so that it would be handled during client.getService(...);
before writing to a characteristic.
I [544126][main.cpp:224] BLE adv (-90 db, phy 1 0) Name: , Address: f1:01:06:9c:ce:ea
I [544147][main.cpp:240] Create client, attempting to connect...
I [544148][main.cpp:270] Scan ended...
I [546395][main.cpp:491] Scan restart...
I [546395][main.cpp:149] Connected to f101069cceea, RSSI 0 (bond 1)
D [546396][main.cpp:540] Try to get service f101069cceea
I [546463][main.cpp:224] BLE adv (-74 db, phy 1 0) Name: SHIELD, Address: 74:7d:53:5f:13:59, serviceUUID: 0x3456
E (546813) NimBLEDevice: ble_gap_security_initiate: rc=2
D [551423][main.cpp:188] Connection params update req
I [556469][main.cpp:224] BLE adv (-88 db, phy 1 0) Name: , Address: 70:2c:c4:fd:9f:91, serviceUUID: 0xfef3
I [558383][main.cpp:172] BLE f101069cceea disconnected (531)
E [580395][wdt.cpp:95] tsk0 not responding, restarting...
Hmm, that error code indicates that security was already started but no event came to unblock the task
I think this was an issue on my end. Noticed https://github.com/h2zero/esp-nimble-cpp/pull/227 mentions deleting client on disconnect being support now.
I've always been doing NimBLEDevice::deleteClient
within the onDisconnect
handler, which possibly was preventing
NimBLEUtils::taskRelease
from running at the end of NimBLEClient::handleGapEvent
.
Does https://github.com/h2zero/esp-nimble-cpp/pull/227 ensure the client task data will be released before deleting? It looks like it only accounts for connection attempts but not others like the above scenario. Should the task be released always if disconnected and pClient->m_pTaskData != nullptr
? I've been testing it for a bit and haven't had any issues though, other than a slight regression with scanning apparently.
Yes, it makes a copy of the task data pointer in the event handler now so if the client is deleted the task can still be release since it's allocated in the stack.
There are instances where the library will block the calling task with BLE_NPL_TIME_FOREVER as the duration. This duration value should be configurable to allow for a more suitable value depending on what the application needs.
NimBLEUtils::taskWait(..., BLE_NPL_TIME_FOREVER);
Example https://github.com/h2zero/esp-nimble-cpp/blob/master/src/NimBLEClient.cpp#L320-L345 https://github.com/h2zero/esp-nimble-cpp/blob/master/src/NimBLEClient.cpp#L723-L753