h2zero / NimBLE-Arduino

A fork of the NimBLE library structured for compilation with Arduino, for use with ESP32, nRF5x.
https://h2zero.github.io/NimBLE-Arduino/
Apache License 2.0
710 stars 147 forks source link

Stack canary watchpoint triggered (ipc0) - arduino 1_0_5_RC6/v3.3.4-402-gcd59d107b #177

Closed btsimonh closed 2 years ago

btsimonh commented 3 years ago

hi @h2zero,

with the vanilla latest release version of arduino, I'm getting these regularly:

10:26:55.827 readOneSensor - already active reading 2
Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception)
Debug exception reason: Stack canary watchpoint triggered (ipc0) 
Core 0 register dump:
PC      : 0x40094c83  PS      : 0x00060036  A0      : 0x40084a64  A1      : 0x3ffb9a20  
A2      : 0x00000000  A3      : 0x00060023  A4      : 0x00060020  A5      : 0x00000001
A6      : 0x00000020  A7      : 0x00000000  A8      : 0x801c5842  A9      : 0x3ffb9bb0  
A10     : 0x00000008  A11     : 0x40082f30  A12     : 0x3ffb61b4  A13     : 0x00000000
A14     : 0x3ffb616c  A15     : 0x00060b23  SAR     : 0x00000008  EXCCAUSE: 0x00000001  
EXCVADDR: 0x00000000  LBEG    : 0x4000c349  LEND    : 0x4000c36b  LCOUNT  : 0xffffffff

ELF file SHA256: 0000000000000000

Backtrace: 
0x40094c83:0x3ffb9a20 
0x40084a61:0x3ffb9af0 _xt_user_exit
0x40095657:0x3ffb9b10 
0x40095dae:0x3ffb9b30 multi_heap_malloc
0x40084535:0x3ffb9b50 
0x40084566:0x3ffb9b70 
0x40082f35:0x3ffb9b90 _malloc_r
0x4000beaf:0x3ffb9bb0 
0x401c583f:0x3ffb9bd0 btdm_task_post
0x401c58b9:0x3ffb9c00 
0x40181b7d:0x3ffb9c30 async_wakeup_request
0x40181e2d:0x3ffb9c50 esp_vhci_host_send_packet
0x4011eb8c:0x3ffb9c70 ble_hci_trans_hs_acl_tx_on_core_0
0x40087473:0x3ffb9da0 libesp32.a(ipc.o)
0x4009226a:0x3ffb9dc0 (port.o)

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee

Although not directly NimBLE related, do you have any advice? (still in an older NimBLE)

br,

s

btsimonh commented 3 years ago

hmm thinking about this - Q: what NimBLE callbacks originate in the ipc0 task? - probably one of MY callback functions is using too much stack, and the canary was only found corrupt by the above. I'm doing a test with a hand-modified sdkconfig.h - I'm not sure if this will actually increase the IPC stack size or not - 1024->2048 (no experience with sdkconfig here). Time will tell. Crash is now confirmed on other devices, but we also have devices with uptime > 20 hours, where some are barely managing 5 mins :(.

Ysbrand commented 3 years ago

Keep in mind that the device with the 22 hours uptime is only listening to the BLE adverts, nothing else. My other device which queries the TRV's every minutes crashes indeed on a regular basis.

h2zero commented 3 years ago

You need commit 08be9140c313af980331d7d8967563ee8c254489 from this repo.

btsimonh commented 3 years ago

@h2zero - thankyou. Will go for the NimBLE upgrade now :). Was resisting just because we had enough issues of our own, but we're now pretty clear of them, so that time has arrived .... Expect some feedback later today.

h2zero commented 3 years ago

It's a very ugly hack but I could see no other way to maintain support for older arduino versions and the latest master. Will be removed in the future if arduino core ever does a new release.

btsimonh commented 3 years ago

ok, it's running with latest master, without my 'getResult' in NimBLEClient. only 15 mins so far. Will report.

btsimonh commented 3 years ago

looking good so far.

Q: in Tasmota, we had these modifications in place to support a unicore ESP32 (ESP32solo1): https://github.com/arendst/Tasmota/commit/b11f1b5483a9baa264adb1ef49688d471f6fea80

Looking at your new commit, I'm not sure we need them anymore? Your thoughts? br,

S

h2zero commented 3 years ago

Glad it's working for you so far. You shouldn't need those mods anymore, as it will always use the same core as NimBLE stack for commands except when using arduino rc5/6, but if you're running unicore then that shouldn't matter.

h2zero commented 3 years ago

Working on a better solution to this than the current implementation, does not require idf commit checks. promising results so far.

btsimonh commented 3 years ago

hi h2zero, just popping back to report general success... most people are experiencing good uptime. One report today:

16:10:33.996 Mode a: bat updated
16:10:34.056 MI32: scancomplete
E (7737992) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (7737992) task_wdt:  - IDLE0 (CPU 0)
E (7737992) task_wdt: Tasks currently running:
E (7737992) task_wdt: CPU 0: btController
E (7737992) task_wdt: CPU 1: IDLE1
E (7737992) task_wdt: Aborting.
abort() was called at PC 0x40188188 on core 0

ELF file SHA256: 0000000000000000

Backtrace: 0x40091054:0x3ffbea80 0x400912d1:0x3ffbeaa0 0x40188188:0x3ffbeac0 0x40084ac9:0x3ffbeae0 0x400158ad:0x3ffb5ef0 0x401b95cb:0x3ffb5f10 0x401b8a8b:0x3ffb5f30 0x4009226a:0x3ffb5f60

Rebooting...
ets Jun  8 2016 00:22:57

very little to go on - except that 'btController' is hogging the CPU for a long time. This could be in esp library, your code, or a BT callback :(. I take it that there is nothing to be gained from any of the backtrace numbers here - we don't get the PC of the actual looping code...?

anyway, so far, much better - thanks again for all your help.

h2zero commented 3 years ago

Glad it's working well, sadly that exception you posted is an upstream issue. I have reported it here https://github.com/espressif/esp-idf/issues/6417 and there are a couple other reports of this there as well.

Which arduino commit is this on? I have only seen it in 1.0.5-rc6 but not often enough to find a reliable way to reproduce.

btsimonh commented 3 years ago

this one: 5dc72ab10d9f928442a25ef3bdcf8a31a7e16301e089fae - 17th jan Was master on the day you told me to update :). I hear you did a release version - anything significant above this? Sorry, with RC6.

h2zero commented 3 years ago

There's a couple improvements to client in 1.1.0 that you may want.

Yep rc6 seems to have this issue instead of the disconnect bug. I haven't seen it happen yet with https://github.com/espressif/arduino-esp32/commit/cee659563d38434abbe7fb3523f186f0f308bffc.

doudar commented 3 years ago

I just started getting this randomly on scans (seems to be a random core every time too). I think it's related to increasing my scan timings to:

pBLEScan->setInterval(550);
pBLEScan->setWindow(500);

I'll try lowering them and see if it fixes the issue.

doudar commented 3 years ago

Okay, so mine probably isn't related to this.

A long callback with a bunch of if tests in it causes the crash every single time when it his the scan request.

A short callback with the scan request never crashes.

I'll just set up a non blocking flag for the scan. Weirdly it never had this issue before and I haven't changed that part of my code in a bit.

h2zero commented 3 years ago

Have you updated the arduino core sources? I'm curious what commits do this so I can try to pinpoint the location of the issue.

Thanks for the input though, I'll have to try making my callback take some time and see if it triggers.

doudar commented 3 years ago

Yes.

I need to switch to using a release.

It just asked me to sync to master a couple days ago.

doudar commented 3 years ago

I should also point out that it was within a webserver callback where I had issues.

This long callback crashes every single time as soon as it hits the scan with the latest arduino: (btw) I even tried setting a 10 second vtaskdelay (maybe clear the radio is my thought) prior to the scan and then it just waits 10 seconds before the crash). I also tried moving it before and after the html send.

server.on("/send_settings", []() { String tString; bool wasBTUpdate = false; if (!server.arg("ssid").isEmpty()) { tString = server.arg("ssid"); tString.trim(); userConfig.setSsid(tString); } if (!server.arg("password").isEmpty()) { tString = server.arg("password"); tString.trim(); userConfig.setPassword(tString); } if (!server.arg("deviceName").isEmpty()) { tString = server.arg("deviceName"); tString.trim(); userConfig.setDeviceName(tString); } if (!server.arg("shiftStep").isEmpty()) { userConfig.setShiftStep(server.arg("shiftStep").toInt()); } if (!server.arg("stepperPower").isEmpty()) { userConfig.setStepperPower(server.arg("stepperPower").toInt()); updateStepperPower(); } //checkboxes don't report off, so need to check using another parameter that's always present on that page if (!server.arg("stepperPower").isEmpty()) { if (!server.arg("autoUpdate").isEmpty()) { userConfig.setAutoUpdate(true); } else { userConfig.setAutoUpdate(false); } if (!server.arg("stealthchop").isEmpty()) { userConfig.setStealthChop(true); updateStealthchop(); } else { userConfig.setStealthChop(false); updateStealthchop(); } } if (!server.arg("inclineMultiplier").isEmpty()) { userConfig.setInclineMultiplier(server.arg("inclineMultiplier").toFloat()); } if (!server.arg("blePMDropdown").isEmpty()) { wasBTUpdate = true; if (server.arg("blePMDropdown")) { tString = server.arg("blePMDropdown"); userConfig.setConnectedPowerMeter(server.arg("blePMDropdown")); } else { userConfig.setConnectedPowerMeter("any"); } } if (!server.arg("bleHRDropdown").isEmpty()) { wasBTUpdate = true; if (server.arg("bleHRDropdown")) { tString = server.arg("bleHRDropdown"); userConfig.setConnectedHeartMonitor(server.arg("bleHRDropdown")); } else { userConfig.setConnectedHeartMonitor("any"); } }

if (!server.arg("session1HR").isEmpty()) //Needs checking for unrealistic numbers. 
{
  userPWC.session1HR = server.arg("session1HR").toInt();
}
if (!server.arg("session1Pwr").isEmpty())
{
  userPWC.session1Pwr = server.arg("session1Pwr").toInt();
}
if (!server.arg("session2HR").isEmpty())
{
  userPWC.session2HR = server.arg("session2HR").toInt();
}
if (!server.arg("session2Pwr").isEmpty())
{
  userPWC.session2Pwr = server.arg("session2Pwr").toInt();

  if (!server.arg("hr2Pwr").isEmpty())
  {
    userPWC.hr2Pwr = true;
  }
  else
  {
    userPWC.hr2Pwr = false;
  }
}

String response = "<!DOCTYPE html><html><body><h2>";
if (wasBTUpdate) //Special BT update response
{
  response += "Selections Saved!</h2></body><script> setTimeout(\"location.href = 'http://" + String(userConfig.getDeviceName()) + ".local/bluetoothscanner.html';\",1000);</script></html>";
  spinBLEClient.serverScan(true);
}
else
{ //Normal response
  response += "Network settings will be applied at next reboot. <br> Everything else is availiable immediatly.</h2></body><script> setTimeout(\"location.href = 'http://" + String(userConfig.getDeviceName()) + ".local/index.html';\",1000);</script></html>";
}
server.send(200, "text/html", response);
debugDirector("Config Updated From Web");
userConfig.saveToSPIFFS();
userConfig.printFile();
userPWC.saveToSPIFFS();
userPWC.printFile();

});

And this short callback never causes an issue:

server.on("/BLEScan", []() { debugDirector("Scanning for BLE Devices"); String response = "<!DOCTYPE html>Scanning for BLE Devices. Please wait 5 seconds."; spinBLEClient.serverScan(true); server.send(200, "text/html", response); });

I'm setting a flag now with a separate non-blocking task initiating the scan without issue.

h2zero commented 3 years ago

Thanks, looks like the issue has been pinpointed by another posted here https://github.com/espressif/esp-idf/issues/6358

Looks like you found a workaround for your use at least. I'm curious if you changed the 5 second delay to just a taskYIELD() if it would alleviate this.

btsimonh commented 3 years ago

@h2zero, next report from my testing guy....:

I'm sorry: but the test ESP (connected to serial) crashed a few hours ago (the others are still up and running)
13:13:20.047 WIF: Checking connection...
13:13:40.032 WIF: Checking connection...
13:14:00.026 WIF: Checking connection...
E (78418071) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (78418071) task_wdt:  - IDLE0 (CPU 0)
E (78418071) task_wdt: Tasks currently running:
E (78418071) task_wdt: CPU 0: btController
E (78418071) task_wdt: CPU 1: IDLE1
E (78418071) task_wdt: Aborting.
abort() was called at PC 0x40163394 on core 0

ELF file SHA256: 0000000000000000

Backtrace: 0x40091054:0x3ffbfde0 0x400912d1:0x3ffbfe00 0x40163394:0x3ffbfe20 0x400860c9:0x3ffbfe40 0x400158ad:0x3ffe2380 0x401c03a3:0x3ffe23a0 0x401bf867:0x3ffe23c0 0x400922d2:0x3ffe23f0

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DOUT, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:8896
load:0x40080400,len:5828
entry 0x400806ac

That is running your build, so you should have the proper map file. On the other hand, looks like this is not in your code anyway (btController)
BuildDateTime":"2021-01-21T16:46:49"

same issue.... release nimble + that arduino commit :(.....

but the reporter also said: "the others are still up and running" - about 1 1/2 days.

h2zero commented 3 years ago

E (78418071) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: E (78418071) task_wdt: - IDLE0 (CPU 0) E (78418071) task_wdt: Tasks currently running: E (78418071) task_wdt: CPU 0: btController

That says it all lol, not much we can do until espressif issues a fix. older arduino cores don't seem to have this bug, but they have the disconnect bug that we can at least work around.

btsimonh commented 3 years ago

it could still be in your code or mine (: - 'cos they call us on that task? or not?

h2zero commented 3 years ago

It's not in our code as this has been reported by others using IDF directly (not this library). Here is the report from a IDF user, same backtrace: espressif/esp-idf#6358

btsimonh commented 3 years ago

good call :).

btsimonh commented 3 years ago

does the released 1.0.5 bring anything over rc6/rc7? s

h2zero commented 3 years ago

Possibly, looks like there might have been some controller updates, I don't know what they are though.

h2zero commented 3 years ago

Update to this; Looks like there may be a fix soon but might not be in ardruino for a while https://github.com/espressif/esp32-bt-lib/commit/950f7886b6a28734c625f979c2a73752db1dfb19

Though it says wifi, I'm hopeful that it solves the same issue as it's in the BT repo. Closed source sucks, not knowing what's actually fixed is frustrating.

doudar commented 3 years ago

Update to this; Looks like there may be a fix soon but might not be in ardruino for a while espressif/esp32-bt-lib@950f788

Though it says wifi, I'm hopeful that it solves the same issue as it's in the BT repo. Closed source sucks, not knowing what's actually fixed is frustrating.

Obviously it's the issue contained in 950f788 that was fixed.....

:grin:

h2zero commented 3 years ago

Damn, no idea how I missed that, thanks for pointing it out!

😃

h2zero commented 2 years ago

Closing as this should be resolved. Please reopen if not.