Closed ziriax closed 1 year ago
I've committed my source code and Dockerfiles to build the HEX files for both v1 and v2 here: https://github.com/Ziriax/ble-test
The website is deployed here https://ziriax.github.io/ble-test
I'm only using Windows, but should be easy to use on MacOS/Linux too.
The repo also contains a minimal react app, with code forked from https://github.com/thegecko/microbit-web-bluetooth
I have 4 micro:bit V2s, and all of them stop working after a while. I only have a single V1.5 device, but that keeps working
Interestingly, on my Android phone (Nokia 5.4, Android 10, Chrome 90), the V2 stops almost immediately, so that might be make debugging easier.
Do you have other tips on how to investigate this issue? Maybe see if a native Bluetooth app would work fine? Maybe try Scratch Link?
We would really like to support the micro:bit in our upcoming educational reactive programming sandbox (to be released in september if all goes well), but we really on bluetooth working well.
I set CODAL_DEBUG=2
, and each 3 seconds I'm printing device_heap_print
.
I'm not sure if that is the right function to call to report memory, but I don't see any memory leaks.
Any updates on this? Since this also happens with Microsoft's Makecode, this can affect a lot of users? 😢
@Ziriax Thanks for analysing the problem. If there was a memory problem, I would expect to see panic 020 or 030.
Please see debug trace settings here: https://github.com/lancaster-university/codal-microbit-v2/pull/52#issuecomment-778195851.
If you saw the debugger arrive at app_error_fault_handler (id=1, pc=83684, info=0)
, that's NRF_FAULT_ID_SD_ASSERT, which would make sense, but that could be caused by the debugger. Calling device_heap_print too often is also likely to upset softdevice, because it disables IRQs.
I have built a similar services hex, with justworks security, and tested it with the iOS micro:bit app monitor. It failed somewhere between 25 and 75 mins. So it isn't caused by open link, or using WebBLE.
I will continue to investigate...
Thanks for this! Maybe the available RAM could be displayed on the 5x5 LED screen? This wouldn't disturb the softdevice I guess.
That log message about the RAM start is for development, for adjusting the linker script to give SoftDevice the RAM it needs, which depends on several configurations.
Testing different services: LED alone = OK LED + Accel = FAIL Accel alone = FAIL
Could you maybe provide your HEX file and/or code?
Because with my code, on my PCs, I am observing "disconnects/freezes" after a random time, could be even just a few seconds...
Random seems to be the right word. I have been running the attached for over two and a half hours. It froze as I was writing this. Although it has frozen, BLE is still connected, and I can disconnect and reconnect.
Previous tests were with debug trace and were not reconnectable after failure. The LED alone test was overnight, and did not fail. LED + Accel failed quite quickly. Accel alone lasted nearly 30mins. It failed less quickly when I increased the accelerometer period.
Thanks all. Just to clarify, what exactly is it that freezes here? Just the BLE comms or the whole microbit?
If anything, I think, BLE comms keep going, but the rest of the micro:bit freezes.
When DMESG is streaming to Tera term, it seems like both BLE and micro:bit, and BLE disconnects. The Tera term stream ends mid-word.
But when it fails without DMESG, at least sometimes, though the blinking LED is frozen and the iOS app is no longer getting updated accelerations or LED states, it remains connected and I can disconnect/reconnect BLE, and discover services. But the values are still frozen.
That is very odd. Can you get a debugger on it to see if the CPU is stuck somewhere? SoftDevice won't like it, but seeing as it isn't working anyway... it won't do any harm to try after the fault occurs?
It could indeed just be that we disabled interrupts during DMESG operations... It does kind of sound like SoftDevice is not returning control back to the application code (and it does run at the highest IRQ priority level).
It would be interesting to know if this happens less / not at all when the number of times we disable interrupts is reduced.
Just a thought... but for builds with BLE running, we could consider replacing the codal nrf52 target_disable_irq() / target_enable_irq() implementation with sd_nvic_critical_region_enter() / sd_nvic_critical_region_exit() primitives.
That should keep SoftDevice happier as its internal interrupts should keep firing.
I imagine this is more expensive in terms of CPU though, so we shouldn't do that unless necessary... most micro:bit builds don't run with BLE enabled.
My first thought was about too much IRQ disabling, but hopefully we haven't hit that. If we did create a codal-microbit-v2 level target_disable_irq() with critical_region_enter/exit, would that mean that all BLE code must be careful about anything that uses target_disable_irq() protection, e.g. system_timer_xyz? Maybe a target_critical_region_enter/exit() could be useful.
Adventures with the debugger... After waiting for it to lock up, I disconnected BLE and pressed the VSCode pause button. It stopped in system_timer_wait_us, with period 10. There was no call stack, but the only candidate caller I can see is in NRF52I2C::waitForStop. I just failed to set a breakpoint there, for some reason, but is that a possibility?
Thanks @martinwork.
I don't think BLE application code would need to special consideration would it? My understanding was that critical_region_enter/exit only leaves the internal SoftDevice interrupts enabled. I thought application level events were still masked. In which case it should make no difference? I may be wrong in that assumption though - SoftDevice is a bit of a black box and the documentation on what those calls actually do is a little vague.
Thanks for the debugger analysis. Having no call stack is really odd... That call will briefly disable interrupts whilst reading the current time. But seeing as its spin waiting, the duty cycle of that could be quite high for that 10uS window...
Thanks @finneyj Your absolutely right, of course (I think!). The softdevice-to-app IRQ seems to be counted as an application IRQ.
@finneyj I think it's stuck in NRF52I2C::waitForStop. I added a counter to the outer while loop, with a panic at 100 loops.
On the way to that idea, I noticed that before BLE connection, the debugger breaks at WFE(). After failure and BLE disconnection, it breaks at system_timer_wait_us(10).
Edit: I noticed the loop already had a counter to 100 and increased my counter with the same results. Further tests indicate it only sticks when BLE is connected.
@martinwork Thanks for the HEX file, I will try it out.
Do you have some tips on how to do source debugging on the micro:bit device? I tried VSCODE with this codal-v2-samples open pull request with the latest version of OpenOCD, but I really want the debugger to stop immediately at then program counter when an exception occured, and not in the app_error_fault_handler
, after the fact.
I'm not sure what you guys are using for debugging? I'm a Windows developer myself, trying to learn about embedded software development.
Thanks a lot
I tested a program sending acceleration via serial, built with BLE services included. After pairing with iPad, it kept going for several hours while not BLE connected, but failed after about 25 minutes with BLE connected.
I think it's getting stuck waiting for NRF_TWIM_EVENT_SUSPENDED, inside NRF52I2C::waitForStop inside NRF52I2C::write: https://github.com/lancaster-university/codal-nrf52/blob/master/source/NRF52I2C.cpp#L195
I tried reversing the order of the trigger RESUME and clear SUSPENDED lines, without any luck.
I think this is fixed. Though Bluetooth seemed to help trigger it, this seems to have been a problem with I2C related to: https://github.com/lancaster-university/codal-microbit-v2/issues/102, https://github.com/lancaster-university/codal-microbit-v2/issues/141. I have had this program ble-uart-accel.zip running and connected to Bluetooth, on a V2.21 for a few hours without any problem.
I have a micro:bit V1.5, and I connect with web bluetooth services, with Chrome on Windows 10. I plot all sensor values over time, using https://github.com/thegecko/microbit-web-bluetooth. The V1.5 keeps running for hours, no problems observed so far.
However, when using the same code with the v2, it "hangs" after a while. It is not disconnecting, or at least the disconnect event handler on the micro:bit itself is not called. This can happen immediately, or after several minutes, it is random.
My codal file is:
I'm basically just using the
ble_test
sample.I enabled debugging just to figure out what might be causing it.
Note that I'm using an open link; it seems Chrome on Windows doesn't support "secure characteristic" yet: https://chromium-review.googlesource.com/c/chromium/src/+/2805864
I can share all my code (using Docker) and configs if that helps. I'm going to try on MacOS myself.
I used https://github.com/bsiever/microbit-webusb for receiving the serial messages over WebUSB.
The serial log before it hangs is: