lancaster-university / codal-microbit-v2

CODAL target for the micro:bit v2.x series of devices
MIT License
43 stars 52 forks source link

v2 bluetooth unstable? #98

Closed ziriax closed 1 year ago

ziriax commented 3 years ago

I have a micro:bit V1.5, and I connect with web bluetooth services, with Chrome on Windows 10. I plot all sensor values over time, using https://github.com/thegecko/microbit-web-bluetooth. The V1.5 keeps running for hours, no problems observed so far.

However, when using the same code with the v2, it "hangs" after a while. It is not disconnecting, or at least the disconnect event handler on the micro:bit itself is not called. This can happen immediately, or after several minutes, it is random.

My codal file is:

{
  "target": {
    "name": "codal-microbit-v2",
    "url": "https://github.com/lancaster-university/codal-microbit-v2",
    "branch": "master",
    "type": "git",
    "test_ignore": true,
    "dev": true
  },
  "config": {
    "SOFTDEVICE_PRESENT": 1,
    "DEVICE_BLE": 1,
    "MICROBIT_BLE_ENABLED": 1,
    "MICROBIT_BLE_PAIRING_MODE": 0,
    "MICROBIT_BLE_DFU_SERVICE": 1,
    "MICROBIT_BLE_WHITELIST": 0,
    "MICROBIT_BLE_DEVICE_INFORMATION_SERVICE": 1,
    "MICROBIT_BLE_EVENT_SERVICE": 1,
    "MICROBIT_BLE_PARTIAL_FLASHING": 0,
    "MICROBIT_BLE_SECURITY_LEVEL": "SECURITY_MODE_ENCRYPTION_OPEN_LINK",
    "MICROBIT_DMESG_LEVEL": 4,
    "DMESG_SERIAL_DEBUG": 1
  }
}

I'm basically just using the ble_test sample.

I enabled debugging just to figure out what might be causing it.

Note that I'm using an open link; it seems Chrome on Windows doesn't support "secure characteristic" yet: https://chromium-review.googlesource.com/c/chromium/src/+/2805864

I can share all my code (using Docker) and configs if that helps. I'm going to try on MacOS myself.

I used https://github.com/bsiever/microbit-webusb for receiving the serial messages over WebUSB.

The serial log before it hangs is:

Console Data: calling sd_ble_gatts_hvx( 71,lue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 164, 6)
Console Data: Miccalling sd_ble_gatts_hvx( 68, 0x20006224, 38, 6)
Console Data: MicroBitBLEC 2)
Graph Event: :notifyChrValue 38 (for MicroBitBLEChar)
Graph Event: :notifyChrValue 71 (for calling sd_ble_gatts_itBLEChar)
Console Data: calling sd_ble_gatts_hvx( 71, 0xtifyChrValue 68
Console Data: calling sd_ble_gatts_hvx( 68, 0x20006224, 232 sd_ble_gatts_hvx( 38, 0x20005F6C, 184, 6)
Graph Event: :n2) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 38 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hatts_hvx( 71, 0x2000622A, 52, 2)
Graph Event: :notifyChrVa_gatts_hvx( 38, 0x20005F6C, 172, 6) (for MicroBitBLEChar)
Graph Event: :notifyChtifyChrValue 80 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 80, 0x20006268, 30,sd_ble_gatts_hvx( 68, 0x20006224, 144, 6)
Graph Event: :noA, 53, 2) (for MicroBitBLEChar)
Graph Event: RECEIVED UNKNWON FRAME (for UIPM)
Graph Event: :notitBLEChar::notifyChrValue 71 (for MicroBitBLEChar)
Graph Event: :notifyChrValue 68 (for calling sd_ble_gatts_hvx( 71, 0x2croBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 68notifyChrValue 71
Graph Event: :notifyChrValue 68 (for calling sd_ble_gatts_hvx( 71, 0x2000622A, 5oBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 68, _gatts_hvx( 38, 0x20005F6C, 184, 6)
Graph Event: :notifyChle_gatts_hvx( 38, 0x20005F6C, 176, 6) (for MicroBitBLEChar)
Graph Event: :notify6) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 176, 6)
Console Data: MicrnotifyChrValue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 1F6C, 164, 6)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: calling sd_bs_hvx( 71, 0x2000622A, 53, 2)
Graph Event: RECEIVED UNKNWON FRAME (for UIPM)
Console Data: M 38, 0x20005F6C, 176, 6)
Graph Event: :notifyChrValue 38 (for MicroBitBLEChar)
Console Data: otifyChrValue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 16hvx( 38, 0x20005F6C, 180, 6)
Graph Event: :notifyChrValue , 0x2000622A, 52, 2) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: callfyChrValue 68
Console Data: calling sd_ble_gatts_hvx( 68, 0x20006224, 118, , 0x20006224, 188, 6)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: cal 6)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_)
Graph Event: RECEIVED UNKNWON FRAME (for UIPM)
Graph Event: :notifyChrVal( 38, 0x20005F6C, 176, 6) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Graph Event: RECEIVED UNKNWON FRAME (for UIPM)
Console Data: MicroBitBLEChg sd_ble_gatts_hvx( 68, 0x20006224, 20, 6)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 71, 0x2000622A, 5300622A, 53, 2)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: calling sdvx( 68, 0x20006224, 188, 6)
Graph Event: :notifyChrValue 7notifyChrValue 38 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 1hvx( 68, 0x20006224, 126, 6)
Graph Event: :notifyChrValue , 6) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gattsgatts_hvx( 38, 0x20005F6C, 172, 6)
Graph Event: :notifyChrlue 38 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 168, 6)
Console Data: Micalue 71
Console Data: calling sd_ble_gatts_hvx( 71, 0x2000622A, 53, 2)
Console Data: Mic 0x20005F6C, 176, 6)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: call, 6)
Graph Event: :notifyChrValue 38 (for MicroBitBLEChar)
Console Data: calling sd_ble_gattsvx( 71, 0x2000622A, 53, 2)
Graph Event: :notifyChrValue 388, 0x20005F6C, 172, 6) (for MicroBitBLEChar)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: cag sd_ble_gatts_hvx( 68, 0x20006224, 38, 6)
Graph Event: :ng sd_ble_gatts_hvx( 71, 0x2000622A, 53, 2) (for MicroBitBLEChar)
Graph Event: RECEIVED UNKs_hvx( 71, 0x2000622A, 53, 2) (for UIPM)
Graph Event: RECEIVED UNKNWON FRAME (for UIPM)
Console Data: Mlue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 172, 6)
Console Data: Miclue 38
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 172, 6)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 68, 0x20006224, 152, 6)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 71, 0x2000622A, 54, 2)
Graph Event: :notifyChrValue 38 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 38, 0x20005F6C, 168, 6)
Graph Event: :notifyChrValue 68 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 68, 0x20006224, 108, 6)
Graph Event: :notifyChrValue 71 (for MicroBitBLEChar)
Console Data: calling sd_ble_gatts_hvx( 71, 0x2000622A, 54, 2)
ziriax commented 3 years ago

I've committed my source code and Dockerfiles to build the HEX files for both v1 and v2 here: https://github.com/Ziriax/ble-test

The website is deployed here https://ziriax.github.io/ble-test

I'm only using Windows, but should be easy to use on MacOS/Linux too.

The repo also contains a minimal react app, with code forked from https://github.com/thegecko/microbit-web-bluetooth

I have 4 micro:bit V2s, and all of them stop working after a while. I only have a single V1.5 device, but that keeps working

Interestingly, on my Android phone (Nokia 5.4, Android 10, Chrome 90), the V2 stops almost immediately, so that might be make debugging easier.

Do you have other tips on how to investigate this issue? Maybe see if a native Bluetooth app would work fine? Maybe try Scratch Link?

We would really like to support the micro:bit in our upcoming educational reactive programming sandbox (to be released in september if all goes well), but we really on bluetooth working well.

ziriax commented 3 years ago

I set CODAL_DEBUG=2, and each 3 seconds I'm printing device_heap_print.

I'm not sure if that is the right function to call to report memory, but I don't see any memory leaks.

ziriax commented 3 years ago

Any updates on this? Since this also happens with Microsoft's Makecode, this can affect a lot of users? 😢

martinwork commented 3 years ago

@Ziriax Thanks for analysing the problem. If there was a memory problem, I would expect to see panic 020 or 030.

Please see debug trace settings here: https://github.com/lancaster-university/codal-microbit-v2/pull/52#issuecomment-778195851.

If you saw the debugger arrive at app_error_fault_handler (id=1, pc=83684, info=0), that's NRF_FAULT_ID_SD_ASSERT, which would make sense, but that could be caused by the debugger. Calling device_heap_print too often is also likely to upset softdevice, because it disables IRQs.

I have built a similar services hex, with justworks security, and tested it with the iOS micro:bit app monitor. It failed somewhere between 25 and 75 mins. So it isn't caused by open link, or using WebBLE.

I will continue to investigate...

ziriax commented 3 years ago

Thanks for this! Maybe the available RAM could be displayed on the 5x5 LED screen? This wouldn't disturb the softdevice I guess.

martinwork commented 3 years ago

That log message about the RAM start is for development, for adjusting the linker script to give SoftDevice the RAM it needs, which depends on several configurations.

Testing different services: LED alone = OK LED + Accel = FAIL Accel alone = FAIL

ziriax commented 3 years ago

Could you maybe provide your HEX file and/or code?

Because with my code, on my PCs, I am observing "disconnects/freezes" after a random time, could be even just a few seconds...

martinwork commented 3 years ago

bluetooth-test.zip

Random seems to be the right word. I have been running the attached for over two and a half hours. It froze as I was writing this. Although it has frozen, BLE is still connected, and I can disconnect and reconnect.

Previous tests were with debug trace and were not reconnectable after failure. The LED alone test was overnight, and did not fail. LED + Accel failed quite quickly. Accel alone lasted nearly 30mins. It failed less quickly when I increased the accelerometer period.

finneyj commented 3 years ago

Thanks all. Just to clarify, what exactly is it that freezes here? Just the BLE comms or the whole microbit?

martinwork commented 3 years ago

If anything, I think, BLE comms keep going, but the rest of the micro:bit freezes.

When DMESG is streaming to Tera term, it seems like both BLE and micro:bit, and BLE disconnects. The Tera term stream ends mid-word.

But when it fails without DMESG, at least sometimes, though the blinking LED is frozen and the iOS app is no longer getting updated accelerations or LED states, it remains connected and I can disconnect/reconnect BLE, and discover services. But the values are still frozen.

finneyj commented 3 years ago

That is very odd. Can you get a debugger on it to see if the CPU is stuck somewhere? SoftDevice won't like it, but seeing as it isn't working anyway... it won't do any harm to try after the fault occurs?

finneyj commented 3 years ago

It could indeed just be that we disabled interrupts during DMESG operations... It does kind of sound like SoftDevice is not returning control back to the application code (and it does run at the highest IRQ priority level).

It would be interesting to know if this happens less / not at all when the number of times we disable interrupts is reduced.

finneyj commented 3 years ago

Just a thought... but for builds with BLE running, we could consider replacing the codal nrf52 target_disable_irq() / target_enable_irq() implementation with sd_nvic_critical_region_enter() / sd_nvic_critical_region_exit() primitives.

That should keep SoftDevice happier as its internal interrupts should keep firing.

I imagine this is more expensive in terms of CPU though, so we shouldn't do that unless necessary... most micro:bit builds don't run with BLE enabled.

martinwork commented 3 years ago

My first thought was about too much IRQ disabling, but hopefully we haven't hit that. If we did create a codal-microbit-v2 level target_disable_irq() with critical_region_enter/exit, would that mean that all BLE code must be careful about anything that uses target_disable_irq() protection, e.g. system_timer_xyz? Maybe a target_critical_region_enter/exit() could be useful.

Adventures with the debugger... After waiting for it to lock up, I disconnected BLE and pressed the VSCode pause button. It stopped in system_timer_wait_us, with period 10. There was no call stack, but the only candidate caller I can see is in NRF52I2C::waitForStop. I just failed to set a breakpoint there, for some reason, but is that a possibility?

finneyj commented 3 years ago

Thanks @martinwork.

I don't think BLE application code would need to special consideration would it? My understanding was that critical_region_enter/exit only leaves the internal SoftDevice interrupts enabled. I thought application level events were still masked. In which case it should make no difference? I may be wrong in that assumption though - SoftDevice is a bit of a black box and the documentation on what those calls actually do is a little vague.

Thanks for the debugger analysis. Having no call stack is really odd... That call will briefly disable interrupts whilst reading the current time. But seeing as its spin waiting, the duty cycle of that could be quite high for that 10uS window...

martinwork commented 3 years ago

Thanks @finneyj Your absolutely right, of course (I think!). The softdevice-to-app IRQ seems to be counted as an application IRQ.

martinwork commented 3 years ago

@finneyj I think it's stuck in NRF52I2C::waitForStop. I added a counter to the outer while loop, with a panic at 100 loops.

On the way to that idea, I noticed that before BLE connection, the debugger breaks at WFE(). After failure and BLE disconnection, it breaks at system_timer_wait_us(10).

Edit: I noticed the loop already had a counter to 100 and increased my counter with the same results. Further tests indicate it only sticks when BLE is connected.

ziriax commented 3 years ago

@martinwork Thanks for the HEX file, I will try it out.

Do you have some tips on how to do source debugging on the micro:bit device? I tried VSCODE with this codal-v2-samples open pull request with the latest version of OpenOCD, but I really want the debugger to stop immediately at then program counter when an exception occured, and not in the app_error_fault_handler, after the fact.

I'm not sure what you guys are using for debugging? I'm a Windows developer myself, trying to learn about embedded software development.

Thanks a lot

martinwork commented 3 years ago

I tested a program sending acceleration via serial, built with BLE services included. After pairing with iPad, it kept going for several hours while not BLE connected, but failed after about 25 minutes with BLE connected.

I think it's getting stuck waiting for NRF_TWIM_EVENT_SUSPENDED, inside NRF52I2C::waitForStop inside NRF52I2C::write: https://github.com/lancaster-university/codal-nrf52/blob/master/source/NRF52I2C.cpp#L195

I tried reversing the order of the trigger RESUME and clear SUSPENDED lines, without any luck.

martinwork commented 1 year ago

I think this is fixed. Though Bluetooth seemed to help trigger it, this seems to have been a problem with I2C related to: https://github.com/lancaster-university/codal-microbit-v2/issues/102, https://github.com/lancaster-university/codal-microbit-v2/issues/141. I have had this program ble-uart-accel.zip running and connected to Bluetooth, on a V2.21 for a few hours without any problem.