pybricks / support

Pybricks support and general discussion
MIT License
109 stars 7 forks source link

Race condition causing Bluetooth on Technic hub and City hub to lock up while waiting for SRDY signal from Bluetooth chip #306

Closed laurensvalk closed 3 years ago

laurensvalk commented 3 years ago

Describe the bug

The start/stop buttons can still get into a bad state.

Once the Technic Hub didn't freeze, but another time it did. In both cases I could still turn it off by holding the button.

To reproduce I do not have consistent reproducable example yet. Working on this now.

I was experimenting with I/O in order to prepare a remote control demo. This is part of a larger program for a Technic Truck. Not sure yet if large program is required to reproduce.

Pybricks Code v1.0.0-beta.14

Version 89.0.4389.90 (Official Build) Built on Ubuntu , running on Ubuntu 18.04 (64-bit)


from usys import stdin
from uselect import poll

p = poll()
p.register(stdin)

while True:
    if p.poll(0):
        print(stdin.read(1), end="")

Expected behavior Not freeze UI or hub.

Screenshots image

dlech commented 3 years ago

FYI, xterm parse error is likely due to pressing backspace and is doesn't seem to be related. https://github.com/xtermjs/xterm.js/issues/3050

dlech commented 3 years ago

So this seems like all of the other firmware lockup/corruption issues that have been reported recently.

dlech commented 3 years ago

I was able to reproduce using the program above after letting it run for some time while continuously pressing buttons. The hub became unresponsive (pressing the hub button seems to have stopped the program but it did not go back to advertising, long button press could still turn off the hub). Pybricks Code did not have any issues (e.g. buttons were never disabled).

dlech commented 3 years ago

I don't seem to be able to reproduce this on PrimeHub, so that points to an issue with the Bluetooth driver since PrimeHub has a different Bluetooth chip from TechnicHub.

(I have found that drawing some ASCII art is a nice way to reproduce this issue :smile: - I couldn't get this much drawn on the TechnicHub before Bluetooth stopped responding but PrimeHub is still going)

+----------------------------------------------------------------------------------------------------------------------------------------+
|\          /\          /\          /\          /\          /\          /\          /\          /\          /\          /\          /    |
| \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /     |
|  \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /      |
|   \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /       |
|    \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        |
|     \/          \/          \/          \/          \/          \/          \/          \/          \/          \/          \/         |
|     /\          /\          /\          /\          /\          /\          /\          /\          /\          /\          /\         |
|    /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        /  \        |
|   /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \      /    \       |
|  /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \    /      \      |
| /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \  /        \     |
|/          \/          \/          \/          \/          \/          \/          \/          \/          \/          \/          \    |
+----------------------------------------------------------------------------------------------------------------------------------------+
dlech commented 3 years ago

I've tried adding a debug assert by changing the color of the hub status light on all unhandled errors in the Technic hub Bluetooth driver and it doesn't look like we are hitting any of them.

diff --git a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
index 2623bada..7a175d23 100644
--- a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
+++ b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
@@ -59,6 +59,12 @@
 #define DBG(...)
 #endif

+#include <pbio/color.h>
+#include <pbio/light.h>
+#include <pbsys/light.h>
+#undef assert
+#define assert(cond) { if (!(cond)) pbio_color_light_on(pbsys_status_light, PBIO_COLOR_RED); }
+
 // name used for standard GAP device name characteristic
 #define DEV_NAME "Pybricks Hub"

@@ -361,6 +367,7 @@ retry:
     if (status == blePending) {
         goto retry;
     }
+    assert(status == bleSUCCESS);

 done:
     send->done(send);
@@ -388,6 +395,13 @@ void pbdrv_bluetooth_stm32_cc2640_spi_xfer_irq(void) {
     process_poll(&pbdrv_bluetooth_spi_process);
 }

+static bool spi_error;
+
+void pbdrv_bt_error(void) {
+    spi_error = true;
+    process_poll(&pbdrv_bluetooth_spi_process);
+}
+
 static void read_by_type_response_uuid16(uint16_t connection_handle,
     uint16_t attr_handle, uint8_t property_flags, uint16_t uuid) {
     attReadByTypeRsp_t rsp;
@@ -515,6 +529,7 @@ static void handle_event(uint8_t *packet) {
                             break;
                         default:
                             DBG("unhandled read by type req: %04X", type);
+                            assert(0);
                             break;
                     }
                 }
@@ -587,6 +602,7 @@ static void handle_event(uint8_t *packet) {
                         ATT_ReadRsp(connection_handle, &rsp);
                     } else {
                         DBG("unhandled read req: %04X", handle);
+                        assert(0);
                     }
                 }
                 break;
@@ -674,6 +690,7 @@ static void handle_event(uint8_t *packet) {
                             ATT_ErrorRsp(connection_handle, &rsp);

                             DBG("unhandled read by grp type req: %05X", group_type);
+                            assert(0);
                         }
                         break;
                     }
@@ -701,8 +718,10 @@ static void handle_event(uint8_t *packet) {
                         DBG("noti: %d", uart_tx_notify_en);
                     } else {
                         DBG("unhandled write req: %04X", char_handle);
+                        assert(0);
                     }
                     if (!command) {
+                        assert(write_xfer_size == 0);
                         ATT_WriteRsp(connection_handle);
                     }
                 }
@@ -746,16 +765,23 @@ static void handle_event(uint8_t *packet) {
                     }
                     if (status != bleSUCCESS) {
                         DBG("status: %02X %04X", status, connection_handle);
+                        assert(0);
                     }
                 }
                 break;

                 default:
                     DBG("unhandled: %04X", event_code);
+                    assert(0);
                     break;
             }
         }
         break;
+
+        default: {
+            assert(0);
+        }
+        break;
     }

     if (bluetooth_on_event) {
@@ -1146,6 +1172,10 @@ PROCESS_THREAD(pbdrv_bluetooth_spi_process, ev, data) {
     static struct etimer timer;
     static uint8_t read_xfer_size, xfer_size;

+    PROCESS_POLLHANDLER({
+        assert(!spi_error);
+    });
+
     PROCESS_EXITHANDLER({
         spi_set_mrdy(false);
         bluetooth_reset(RESET_STATE_OUT_LOW);
@@ -1220,7 +1250,7 @@ start:

         // send the write header
         spi_xfer_complete = false;
-        pdata->spi_start_xfer(write_buf, read_buf, NPI_SPI_HEADER_LEN);
+        assert(pdata->spi_start_xfer(write_buf, read_buf, NPI_SPI_HEADER_LEN) == 0);
         PROCESS_WAIT_UNTIL(spi_xfer_complete);

         // Total transfer size is biggest of read and write sizes.
@@ -1233,8 +1263,8 @@ start:

         // read the remaining message
         spi_xfer_complete = false;
-        pdata->spi_start_xfer(&write_buf[NPI_SPI_HEADER_LEN],
-            &read_buf[NPI_SPI_HEADER_LEN], xfer_size);
+        assert(pdata->spi_start_xfer(&write_buf[NPI_SPI_HEADER_LEN],
+            &read_buf[NPI_SPI_HEADER_LEN], xfer_size) == 0);
         PROCESS_WAIT_UNTIL(spi_xfer_complete);

         spi_set_mrdy(false);
diff --git a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.h b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.h
index eccaae7a..1d89f4b2 100644
--- a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.h
+++ b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.h
@@ -22,7 +22,7 @@ typedef struct {
     /** SPI initialization callback. */
     void (*spi_init)(void);
     /** SPI start transfer callback. */
-    void (*spi_start_xfer)(const uint8_t *tx_buf, uint8_t *rx_buf, uint8_t xfer_size);
+    int (*spi_start_xfer)(const uint8_t *tx_buf, uint8_t *rx_buf, uint8_t xfer_size);
 } pbdrv_bluetooth_stm32_cc2640_platform_data_t;

 /** SRDY signal change interrupt handler. */
diff --git a/lib/pbio/platform/technic_hub/platform.c b/lib/pbio/platform/technic_hub/platform.c
index aff1f32d..03bc4f4b 100644
--- a/lib/pbio/platform/technic_hub/platform.c
+++ b/lib/pbio/platform/technic_hub/platform.c
@@ -90,8 +90,8 @@ static void bt_spi_init(void) {
     HAL_SPI_Init(&bt_spi);
 }

-static void bt_spi_start_xfer(const uint8_t *tx_buf, uint8_t *rx_buf, uint8_t xfer_size) {
-    HAL_SPI_TransmitReceive_DMA(&bt_spi, (uint8_t *)tx_buf, rx_buf, xfer_size);
+static int bt_spi_start_xfer(const uint8_t *tx_buf, uint8_t *rx_buf, uint8_t xfer_size) {
+    return HAL_SPI_TransmitReceive_DMA(&bt_spi, (uint8_t *)tx_buf, rx_buf, xfer_size);
 }

 const pbdrv_bluetooth_stm32_cc2640_platform_data_t pbdrv_bluetooth_stm32_cc2640_platform_data = {
@@ -171,6 +171,11 @@ void HAL_SPI_TxRxCpltCallback(SPI_HandleTypeDef *hspi) {
     pbdrv_bluetooth_stm32_cc2640_spi_xfer_irq();
 }

+void HAL_SPI_ErrorCallback(SPI_HandleTypeDef *hspi) {
+    extern void pbdrv_bt_error(void);
+    pbdrv_bt_error();
+}
+
 // Button

 const pbdrv_button_gpio_platform_t pbdrv_button_gpio_platform[PBDRV_CONFIG_BUTTON_GPIO_NUM_BUTTON] = {
dlech commented 3 years ago

I think the next step is to solder wires to the test points in the neighborhood of the Bluetooth chip and hope the are for the SPI signals so we can see what is really going on - unless anyone else has any better ideas?

laurensvalk commented 3 years ago

unless anyone else has any better ideas?

How about this for debugging:

  1. Include a program that reproduces it as main.py
  2. Disable all download & run functionality on firmware side
  3. Verify that the problem does not occur now.
    • At this point we only have the uart characteristic?
    • We can make basic i/o with pybricksdev/bleak for testing (since Pybricks Code probably won't work with the firmware in this state.)
  4. Re-enable download & run piece by piece until the problem arises
laurensvalk commented 3 years ago

I have found that drawing some ASCII art is a nice way to reproduce this issue

By echoing? Just printing what you shared in a tight loop doesn't seem to lock up the hub for me.

But maybe that's another angle to the problem... Does this only occur when using stdin or do we have examples with only print? We could just drop stdin support for this release.

laurensvalk commented 3 years ago

Might have found something useful... :bulb:

I'm wondering if we've got a more generic memory corruption issue with stdin, which just happens to manifest itself as a lockup on Technic Hub. The following apparent memory corruption occurs on both Technic Hub and Prime Hub.

Neither hub freezes, so if this is the same issue, debugging should be easier. (Assuming it's not just a bad implementation of the color type).

EDIT: more minimal example below.

Expected output

Detected Color.YELLOW

Detected Color.NONE

Detected Color.GREEN

Actual output

Detected Color.YELLOW

Detected Color(h=0, s=0, v=0)  # Reverse lookup of color name failed, so printed HSV

Detected Color(h=120, s=100, v=100) # Reverse lookup of color name failed, so printed HSV
from pybricks.pupdevices import ColorSensor
from pybricks.parameters import Port, Color

from usys import stdin
from uselect import poll

sensor = ColorSensor(Port.B)

last_color = Color.NONE

p = poll()
p.register(stdin)

while True:

    # This prints a color name when a new color is detected. Nothing special here.
    new_color = sensor.color()
    if new_color != last_color:
        print("\nDetected", sensor.color())
        last_color = new_color

    # Echos characters. But this code block changes how colors are printed above (try commenting out).
    # The reverse color name lookup fails when this block is here.
    if p.poll(0):
        print(stdin.read(1), end="")
laurensvalk commented 3 years ago

Might have found something useful... :bulb:

I'm wondering if we've got a more generic memory corruption issue with stdin, which just happens to manifest itself as a lockup on Technic Hub. The following apparent memory corruption occurs on both Technic Hub and Prime Hub.

Neither hub freezes, so if this is the same issue, debugging should be easier. (Assuming it's not just a bad implementation of the color type).

EDIT: This is probably just the Color dict being garbage-collected, which is a separate issue. But it did lead to some additional insights with the freeze possibly coinciding with garbage collection. See posts below.

from pybricks.parameters import Color

from usys import stdin
from uselect import poll

p = poll()
p.register(stdin)

# Prints the whole color dictionary (good)
print(Color)

# Wait for one key press
while not p.poll(0):
    pass

# Prints an empty dictionary (bad)
print(Color)
BertLindeman commented 3 years ago

Hope this helps a bit.

Ran technichub with the remote.py program. With the print it locks up in no time. Removed the print and ran again while hammering the keyboard for some time longer. Normally a CTRL-C would stop the program. But not now. The stop button DID stop the program:

Traceback (most recent call last):
  File "main.py", line 14, in <module>
SystemExit: stop button pressed
laurensvalk commented 3 years ago

Even more minimal. This freezes the Technic Hub in a second or two. Prime Hub is fine with this one. No stdin or colors here. Just loading the memory.

So could things be failing when a garbage collect occurs?

count = 0
while True:
    count += 1
    a = [i for i in range(1000)]
    print(count)
BertLindeman commented 3 years ago

Last time today I interfere here 😉 Used this - no print but memory info:

try:
    import micropython
except:
    print("No micropython import")

count = 0
while True:
    count += 1
    a = [i for i in range(1000)]
    # print(count)
    micropython.mem_info()

And got:

stack: 380 out of 4856
GC: total: 48384, used: 4688, free: 43696
 No. of 1-blocks: 9, 2-blocks: 4, max blk sz: 256, max free sz: 2724
stack: 380 out of 4856
GC: total: 48384, used: 8832, free: 39552
 No. of 1-blocks: 12, 2-blocks: 4, max blk sz: 256, max free sz: 2468
stack: 380 out of 4856
GC: total: 48384, used: 12976, free: 35408
 No. of 1-blocks: 15, 2-blocks: 4, max blk sz: 256, max free sz: 2212
stack: 380 out of 4856
GC: total: 48384, used: 17120, free: 31264
 No. of 1-blocks: 18, 2-blocks: 4, max blk sz: 256, max free sz: 1954
stack: 380 out of 4856
GC: total: 48384, used: 21264, free: 27120
 No. of 1-blocks: 21, 2-blocks: 4, max blk sz: 256, max free sz: 1695
stack: 380 out of 4856
GC: total: 48384, used: 25408, free: 22976
 No. of 1-blocks: 24, 2-blocks: 4, max blk sz: 256, max free sz: 1436
stack: 380 out of 4856
GC: total: 48384, used: 29552, free: 18832
 No. of 1-blocks: 27, 2-blocks: 4, max blk sz: 256, max free sz: 1177
stack: 380 out of 4856
GC: total: 48384, used: 33696, free: 14688
 No. of 1-blocks: 30, 2-blocks: 4, max blk sz: 256, max free sz: 918
stack: 380 out of 4856
GC: total: 48384, used: 37840, free: 10544
 No. of 1-blocks: 33, 2-blocks: 4, max blk sz: 256, max free sz: 659
stack: 380 out of 4856
GC: total: 48384, used: 41984, free: 6400
 No. of 1-blocks: 36, 2-blocks: 4, max blk sz: 256, max free sz: 400
stack: 380 out of 4856
GC: total: 48384, used: 46128, free: 2256
 No. of 1-blocks: 39, 2-blocks: 4, max blk sz: 256, max free sz: 141
stack: 380 out of 4856
GC: total: 48384, used: 8816, free: 39568
 No. of 1-blocks: 11, 2-blocks: 4, max blk sz: 256, max free sz: 2326
stack: 380 out of 4856
GC: total: 48384, used: 12960, free: 35424
 No. of 1-blocks: 14, 2-blocks: 4, max blk sz: 256, max free sz: 2070
stack: 380 out of 4856
GC: total: 48384, used: 17104, free: 31280
 No. of 1-blocks: 17, 2-blocks: 4, max blk sz: 256, max free sz: 1814
stack: 380 out of 4856
GC: total: 48384, used: 21248, free: 27136
 No. of 1-blocks: 20, 2-blocks: 4, max blk sz: 256, max free sz: 1558
stack: 380 out of 4856
GC: total: 48384, used: 25392, free: 22992
 No. of 1-blocks: 23, 2-blocks: 4, max blk sz: 256, max free sz: 1299
stack: 380 out of 4856
GC: total: 48384, used: 29536, free: 18848
 No. of 1-blocks: 26, 2-blocks: 4, max blk sz: 256, max free sz: 1040
stack: 380 out of 4856
GC: total: 48384, used: 33680, free: 14704
 No. of 1-blocks: 29, 2-blocks: 4, max blk sz: 256, max free sz: 781
stack: 380 out of 4856
GC: total: 48384, used: 37824, free: 10560
 No. of 1-blocks: 32, 2-blocks: 4, max blk sz: 256, max free sz: 522
stack: 380 out of 4856
GC: total: 48384, used: 41968, free: 6416
 No. of 1-blocks: 35, 2-blocks: 4, max blk sz: 256, max free sz: 263
stack: 380 out of 4856
GC: total: 48384, used: 46112, free: 2272
 No. of 1-blocks: 38, 2-blocks: 4, max blk sz: 256, max free sz: 138
stack: 380 out of 4856
GC: total: 48384, used: 8816, free: 39568
 No. of 1-blocks: 11, 2-blocks: 4, max blk sz: 256, max free sz: 2065
stack: 380 out of 4856
GC: total: 48384, used: 12960, free: 35424
 No. of 1-blocks: 14, 2-blocks: 4, max blk sz: 256, max free sz: 1809
stack: 380 out of 4856
GC: total: 48384, used: 17104, free: 31280
 No. of 1-blocks: 17, 2-blocks: 4, max blk sz: 256, max free sz: 1553
stack: 380 out of 4856
GC: total: 48384, used: 21248, free: 27136
 No. of 1-blocks: 20, 2-blocks: 4, max blk sz: 256, max free sz: 1297
stack: 380 out of 4856
GC: total: 48384, used: 25392, free: 22992
 No. of 1-blocks: 23, 2-blocks: 4, max blk sz: 256, max free sz: 1038
stack: 380 out of 4856
GC: total: 48384, used: 29536, free: 18848
 No. of 1-blocks: 26, 2-blocks: 4, max blk sz: 256, max free sz: 779
stack: 380 out of 4856
GC: total: 48384, used: 33680, free: 14704
 No. of 1-blocks: 29, 2-blocks: 4, max blk sz: 256, max free sz: 520
stack: 380 out of 4856
GC: total: 48384, used: 37824, free: 10560
 No. of 1-blocks: 32, 2-blocks: 4, max blk sz: 256, max free sz: 399
stack: 380 out of 4856
GC: total: 48384, used: 41968, free: 6416
 No. of 1-blocks: 35, 2-blocks: 4, max blk sz: 256, max free sz: 399
stack: 380 out of 4856
GC: total: 48384, used: 46112, free: 2272
 No. of 1-blocks: 38, 2-blocks: 4, max blk sz: 256, max free sz: 142
stack: 380 out of 4856
GC: total: 48384, used: 8816, free: 39568
 No. of 1-blocks: 11, 2-blocks: 4, max blk sz: 256, max free sz: 2325
stack: 380 out of 4856
GC: total: 48384, used: 12960, free: 35424
 No. of 1-blocks: 14, 2-blocks: 4, max blk sz: 256, max free sz: 2069
stack: 380 out of 4856
GC: total: 48384, used: 17104, free: 31280
 No. of 1-blocks: 17, 2-blocks: 4, max blk sz: 256, max free sz: 1813
stack: 380 out of 4856
GC: total: 48384, used: 21248, free: 27136
 No. of 1-blocks: 20, 2-blocks: 4, max blk sz: 256, max free sz: 1557
stack: 380 out of 4856
GC: total: 48384, used: 25392, free: 22992
 No. of 1-blocks: 23, 2-blocks: 4, max blk sz: 256, max free sz: 1298
stack: 380 out of 4856
GC: total: 48384, used: 29536, free: 18848
 No. of 1-blocks: 26, 2-blocks: 4, max blk sz: 256, max free sz: 1039
stack: 380 out of 4856
GC: total: 48384, used: 33680, free: 14704
 No. of 1-blocks: 29, 2-blocks: 4, max blk sz: 256, max free sz: 780
stack: 380 out of 4856
GC: total: 48384, used: 37824, free: 10560
 No. of 1-blocks: 32, 2-blocks: 4, max blk sz: 256, max free sz: 521
stack: 380 out of 4856
GC: total: 48384, used: 41968, free: 6416
 No. of 1-blocks: 35, 2-blocks: 4, max blk sz: 256, max free sz: 262
stack: 380 out of 4856
GC: total: 48384, used: 46112, free: 2272
 No. of 1-blocks: 38, 2-blocks: 4, max blk sz: 256, max free sz: 139
stack: 380 out of 4856
GC: total: 48384, used: 8816, free: 39568
 No. of 1-blocks: 11, 2-blocks: 4, max blk sz: 256, max free sz: 2065
stack: 380 out of 4856
GC: total: 48384, used: 12960, free: 35424
 No. of 1-blocks: 14, 2-blocks: 4, max blk sz: 256, max free sz: 1809
stack: 380 out of 4856
GC: total: 48384, used: 17104, free: 31280
 No. of 1-blocks: 17, 2-blocks: 4, max blk sz: 256, max free sz: 1553
stack: 380 out of 4856
GC: total: 48384, used: 21248, free: 27136
 No. of 1-blocks: 20, 2-blocks: 4, max blk sz: 256, max free sz: 1297
stack: 380 out of 4856
GC: total: 48384, used: 25392, free: 22992
 No. of 1-blocks: 23, 2-blocks: 4, max blk sz: 256, max free sz: 1038
stack: 380 out of 4856
GC: total: 48384, used: 29536, free: 18848
 No. of 1-blocks: 26, 2-blocks: 4, max blk sz: 256, max free sz: 779
stack: 380 out of 4856
GC: total: 48384, used: 33680, free: 14704
 No. of 1-blocks: 29, 2-blocks: 4, max blk sz: 256, max free sz: 520
stack: 380 out of 4856
GC: total: 48384, used: 37824, free: 10560
 No. of 1-blocks: 32, 2-blocks: 4, max blk sz: 256, max free sz: 399
stack: 380 out of 4856
GC: total: 48384, used: 41968, free: 6416
 No. of 1-blocks: 35, 2-blocks: 4, max blk sz: 256, max free sz: 399
stack: 380 out of 4856
GC: total: 48384, used: 46112, free: 2272
 No. of 1-blocks: 38, 2-blocks: 4, max blk sz: 256, m

EDIT: and it hang.....

laurensvalk commented 3 years ago

unless anyone else has any better ideas?

How about this for debugging:

Follow up to suggested approach: MicroPython appears to continue working, so we could add some stuff to pybricks.experimental to poke at relevant settings/status values from a user script, displaying output with motors and lights, etc. Come think of it, we can even transmit numeric values to an EV3 with infrared, which should make things a lot more convenient. Let me put some code together to send generic messages...

EDIT: And perhaps we could reset the bluetooth state/chip after the program is completed so we can print out any variables logged prior to the lockup.

laurensvalk commented 3 years ago

Here's an adapted variant of the minimal case above, with an additional observation regarding Bluetooth on the Technic Hub.

Now the interesting parts:

from pybricks.pupdevices import Motor
from pybricks.parameters import Port

motor = Motor(Port.A)
motor.run(100)

count = 0
while True:
    count += 1
    a = [i for i in range(1000)]
    print(count)

image

dlech commented 3 years ago

I think we might be confusing different issues here. So far, I have been focusing on the issue where Bluetooth stops working on the Technic hub but everything else keeps working. If the hub freezes entirely, then it is a separate issue.

laurensvalk commented 3 years ago

So far, I have been focusing on the issue where Bluetooth stops working on the Technic hub but everything else keeps working

That's the one I'm focusing on as well. See e.g this post; it remains possible to start scripts and this can be used to trigger the lockup when starting to advertise.

dlech commented 3 years ago

Summarizing our skype chat, it is starting to look like this is a memory corruption issue rather than a Bluetooth issue as allocating memory in a loop appears to be a common component to this issue and several other issues recently reported.

laurensvalk commented 3 years ago

For what it's worth, this happens much less on the City Hub. It does seem to happen eventually.

laurensvalk commented 3 years ago

I followed two different debug strategies today, where 1) raised some suspicions that seem to be supported by 2):

  1. Checked where user program was being blocked, which was in print, in pbsys_bluetooth_tx, possibly in lwrb_write since I was getting PBIO_ERROR_AGAIN all the time.
  2. Go through old firmware builds with a corresponding Pybricks Code build.

This seems to have started when we added the lightweight ring buffer with corresponding pbio revisions. Technically, when we started using it in this commit.

From then onwards, this script causes a hang. Before then, it didn't hang.

count = 0
while True:
    count += 1
    a = [i for i in range(1000)]
    print(count)

This seems like really good news, since this appears to be mainly a change in an abstraction layer instead of the introduction of a fundamental feature.

:tada: :rocket:

dlech commented 3 years ago

I think I pinned down the issue with the "color light debugging" technique (I think I just coined a new term too).

diff --git a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
index 2623bada..4d144223 100644
--- a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
+++ b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
@@ -1141,12 +1141,14 @@ static PT_THREAD(init_task(struct pt *pt, void *context)) {

     PT_END(pt);
 }
-
+#include <pbio/light.h>
+#include <pbsys/light.h>
 PROCESS_THREAD(pbdrv_bluetooth_spi_process, ev, data) {
     static struct etimer timer;
     static uint8_t read_xfer_size, xfer_size;

     PROCESS_EXITHANDLER({
+        pbio_color_light_on(pbsys_status_light, PBIO_COLOR_MAGENTA);
         spi_set_mrdy(false);
         bluetooth_reset(RESET_STATE_OUT_LOW);
         bluetooth_ready = false;
@@ -1195,7 +1197,7 @@ start:
     while (true) {
         static uint8_t real_write_xfer_size;

-        PROCESS_WAIT_UNTIL(spi_srdy || write_xfer_size);
+        PROCESS_WAIT_UNTIL({pbio_color_light_on(pbsys_status_light, PBIO_COLOR_RED); spi_srdy || write_xfer_size;});

         spi_set_mrdy(true);

@@ -1210,7 +1212,7 @@ start:

         if (real_write_xfer_size) {
             // if we are writing only, we have to wait until SRDY is asserted
-            PROCESS_WAIT_UNTIL(spi_srdy);
+            PROCESS_WAIT_UNTIL({pbio_color_light_on(pbsys_status_light, PBIO_COLOR_ORANGE); spi_srdy;});
         } else {
             // if we are reading only, the write buffer has to be all 0s
             memset(write_buf, 0, PBIO_ARRAY_SIZE(write_buf));
@@ -1221,7 +1223,7 @@ start:
         // send the write header
         spi_xfer_complete = false;
         pdata->spi_start_xfer(write_buf, read_buf, NPI_SPI_HEADER_LEN);
-        PROCESS_WAIT_UNTIL(spi_xfer_complete);
+        PROCESS_WAIT_UNTIL({pbio_color_light_on(pbsys_status_light, PBIO_COLOR_YELLOW); spi_xfer_complete;});

         // Total transfer size is biggest of read and write sizes.
         read_xfer_size = 0;
@@ -1235,10 +1237,10 @@ start:
         spi_xfer_complete = false;
         pdata->spi_start_xfer(&write_buf[NPI_SPI_HEADER_LEN],
             &read_buf[NPI_SPI_HEADER_LEN], xfer_size);
-        PROCESS_WAIT_UNTIL(spi_xfer_complete);
+        PROCESS_WAIT_UNTIL({pbio_color_light_on(pbsys_status_light, PBIO_COLOR_GREEN); spi_xfer_complete;});

         spi_set_mrdy(false);
-        PROCESS_WAIT_UNTIL(!spi_srdy);
+        PROCESS_WAIT_UNTIL({pbio_color_light_on(pbsys_status_light, PBIO_COLOR_CYAN); !spi_srdy;});

         // set to 0 to indicate that xfer is complete
         write_xfer_size = 0;

This hangs on cyan. So, what I think is happening is that the SRDY pin on the Bluetooth chip is transitioning from high back to low faster than we can detect it. There is a timing diagram of what is supposed to happen here. There must be other interrupts or something happening so that Δt6 + Δt7 happens between spi_set_mrdy(false); and PROCESS_WAIT_UNTIL(!spi_srdy); so we end up waiting forever.

laurensvalk commented 3 years ago

Should there be another light statement with a different color right after that to be sure it hangs there, not somewhere further down below or before the next loop iteration?

Hanging on the last debug signal is always a bit suspicious :)

dlech commented 3 years ago

It is in an infinite loop, so if the loop continued, the light would go back to red and PROCESS_EXITHANDLER handles the case if we break out of the loop.

Also, issue is fixed, so that was it. :smile:

laurensvalk commented 3 years ago

I can confirm that this is fixed.