pybricks / support

Pybricks support and general discussion
MIT License
107 stars 6 forks source link

[Bug] Hub can get stuck while broadcasting #1419

Open JJHackimoto opened 7 months ago

JJHackimoto commented 7 months ago

Describe the bug Block Coding: When using Broadcast in two or more tasks running simultaneously, an error will be thrown and the program will terminate. Error below:

"OSError: This resource cannot be used in two tasks at once."

Expected behavior I expected the block coding to give me an error or prevent me from building the program this way before running the program. This is especially true since block coders may not be that experienced in coding, and may not understand the error thrown in the console. The error won't be seen if running the program off the hub without a bluetooth connection either.

I could also expect this to just work since I personally can't see the reason for this not working. It currently feels unintuitive that the only way to use Broadcasting is to have a loop that constantly sends out the chosen values instead of just sending an update when the values have actually changed (for example when just sending out Booleans once in a while).

BertLindeman commented 5 months ago

Running since 15:26 zulu. Needs a good look to see it still varies in color intensity. Diner time..

[EDIT] No stuck situation yet. Have also the same program as above with a wait(100) since 19:50 running. Same there (Local time now 20:43)

[EDIT2] 23:00 local time and the test hubs are still running OK. I will keep them running and look tomorrow if they had enough power to go on. Good night.

BertLindeman commented 5 months ago

Both hubs still running happily at 10:10 (HubA started no-wait-test, yesterday at 15:26)

So I assume they do not suffer from the fast blinking fail we test here. Simply by changing

transmitter = TechnicHub(broadcast_channel=1, observe_channels=[2, 3])

into

transmitter = TechnicHub(broadcast_channel=1)  # request of Laurens

Will keep them running until battery-low or a new test request coming in. Whichever comes first 😜

laurensvalk commented 5 months ago

Excellent, thank you!

BertLindeman commented 5 months ago

Hub A still is running. Hub B was switched off this morning. After a restart, it took some time before it flashed orange for low power. The batteries recovered a bit during the night, I think.

Status now:

('technichub', '3.5.0b1', 'v1.20.0-23-g6c633a8dd on 2024-03-21')

This is hub "HubB 3.5.0b1" as a technichub
        Bluetooth chip version "v3.02.00"
        battery voltage:         5345 mV
        battery voltage:        5.345 V

Successful test.

laurensvalk commented 5 months ago

While we've made some progress, it isn't immediately clear how we should proceed. Here's a summary of what we know:

As far as I can tell, we now know that the broadcast_task can get stuck after HCI_LE_setAdvertisingData(value->size, value->data); as it waits on PT_WAIT_UNTIL(pt, hci_command_complete);

Also, I think we now know that this only happens while the hub is also observing. Setting #define OBSERVE_RESTART_INTERVAL 100 appears to make it reproduce a bit more quickly, but still quite a few minutes.

When it gets stuck in the above broadcast_task and we raise a timeout so it doesn't wait forever there, then stop_observe_task also gets stuck, on PT_WAIT_UNTIL(pt, device_discovery_done). This is what currently prevents the user program from ending, thus preventing shutdown and getting the fast blinking lights.

The chip does not seem to lock up entirely. We can still print the exception, for instance. I think we can even successfully stop broadcasting still (I'm looking into that now).


Some ideas:

Any ideas, @dlech ?

laurensvalk commented 5 months ago

As a minimal "workaround" I suppose we could at least avoid the user program lockup by raising a RuntimeError when this occurs. Then you could restart your program with a button press because Bluetooth resets after completing a program if it is not connected.

JJHackimoto commented 5 months ago

As a minimal "workaround" I suppose we could at least avoid the user program lockup by raising a RuntimeError when this occurs. Then you could restart your program with a button press because Bluetooth resets after completing a program if it is not connected.

That would be a good start, since it wouldn't require pulling the batteries anymore. It would be appreciated until the actual problem is found :)

I'm back from Easter so let me know if there's any other tests that needs to be done btw.

dlech commented 5 months ago

As far as I can tell, we now know that the broadcast_task can get stuck after HCI_LE_setAdvertisingData(value->size, value->data); as it waits on PT_WAIT_UNTIL(pt, hci_command_complete);

Does value->data and/or value->size have any effect on the issue? For example if we always send the same data, never changing, does the problem still happen? In the same amount of time?

Do any other factors affect the time it takes to lock up, e.g. adding a longer wait between broadcast calls?

Does it help to periodically stop and restart broadcasting like we do with observing?

then stop_observe_task also gets stuck, on PT_WAIT_UNTIL(pt, device_discovery_done).

Did you try checking the returned status after hci_command_status here to see if there is an error?

  • Could the restart observe task create some kind of race condition with set advertising data when they happen to precisely coincide? Maybe this is why it takes half an hour to reproduce. Or maybe the respective responses get mixed up on our end?

Clearly the Bluetooth firmware is buggy. But it is extremely difficult to try to guess what the actual mechanism is without being able to debug that firmware. I don't think we are doing anything "wrong". As far a race conditions go, we have the task queue that should prevent any possibility of this. The only exception to this is responses that are sent directly from the event handler. But when the hub is not connected to anything, we should not be receiving any events that require a response, so I think we can rule that out.

  • This would save half of the write tasks there.

But it would make it susceptible to the less than ideal race conditions/unqueued event response issues mentioned above.

laurensvalk commented 5 months ago

I have been using this program most of the time:

from pybricks.hubs import TechnicHub
from pybricks.parameters import Axis
from urandom import choice

# Set up all devices.
hub = TechnicHub(broadcast_channel=1, observe_channels=[2, 3])

# Initialize variables.
bool2 = 0

# The main program starts here.
while True:
    bool2 = choice([True, False])
    hub.ble.broadcast([bool2, choice(['abcdefghijkl', 'ghijklabcdef'])]) # same data size and type all the time

No other hub is broadcasting, so the observe_channels aren't actually receiving any data, but having them there seems required to reproduce.

I do have a second prime hub observing on channel 1 that sets the light to red/green based on the bool value, just so I can see that the above setup is still working. When the color is steady, this means the technic hub has stopped updating data.

Do any other factors affect the time it takes to lock up, e.g. adding a longer wait between broadcast calls?

It's hard to say definitively since each run takes a very long time.

I've been using a tight loop but I think @BertLindeman reproduced it on a 100 ms broadcast loop.

It does get easier to reproduce when shortening the observe restart time from 10 000 down to 100. Then it usually takes less than 5 minutes.

  • This would save half of the write tasks there.

But it would make it susceptible to the less than ideal race conditions/unqueued event response issues mentioned above.

Good point.

Thanks for the additional ideas for checking/testing! I'll look into that.

Since the error likelihood appears inversely correlated to the observe restart interval, it seems plausible that the BT chip could get confused if the observe restart and broadcast data update are too close together. But the fact that the issue reproduces even on a 100 ms broadcast update loop contradict this hypothesis.

~Nonetheless I'm now running a test where it will await briefly before updating the advertising data if it had just then restarted observing, and vice versa. Naturally, I can't cheat by setting the restart interval to 100 ms in this experiment as that would be too close to the actual delays, so this test will take a bit longer.~ This doesn't work around it.

dlech commented 5 months ago

Something else to check: Does reverting https://github.com/pybricks/pybricks-micropython/commit/fd8306d6cc86e261fca71999eb7a0f23ce1a3574 or https://github.com/pybricks/pybricks-micropython/commit/59cf0eaffe499cc2e1e9012c0b9f671486614e04 make a difference?

BertLindeman commented 5 months ago

I've been using a tight loop but I think @BertLindeman reproduced it on a 100 ms broadcast loop.

Correct Laurens. Both tight loop and with wait(100). These ran with fixed payload 26 bytes: 3 strings

laurensvalk commented 5 months ago

@BertLindeman do you know if the same issue exists on City Hub?

BertLindeman commented 5 months ago

😉 Not building too much on my memory...

I cannot find the result in my notes, but I think the cityhub did NOT have this problem

[EDIT] Found the note about the cityhub: "no crash" just did not write down how long, but hours anyway.

Have used this source ```python /issue_1419/issue_1419_simple_3str_city.py from pybricks.hubs import CityHub from pybricks.tools import wait from pybricks.parameters import Color # from urandom import choice transmitter = CityHub(broadcast_channel=1, observe_channels=[2, 3]) # 3 strings, so len = 3 + len(strings) # max payload length is 26 bytes. string_1 = "123456789" string_2 = "blob" string_3 = "asdfghjklm" print(len(string_1) + len(string_2) + len(string_3) + 3) while True: transmitter.light.on(Color(120, 100, 50)) # green # transmitter.ble.broadcast([choice([True, False]), choice([True, False])]) transmitter.ble.broadcast([string_1, string_2, string_3]) # transmitter.ble.broadcast([False, True]) transmitter.light.on(Color(0, 0, 50)) # white # TriggeredDistributor, TableReadyForTipp = transmitter.ble.observe(3) or [0] * 2 # transmitter.light.on(Color(0, 70, 50)) # red wait(100) ```

To be sure, I will start that test again. The cityhub is standing idle :smile:

[EDIT] in 10 minutes the cityhib got power-low.. restarted 12:45 and running

laurensvalk commented 5 months ago

Another useful thing is to try out is David's suggestion to periodically turn off broadcasting.

To do that, you could add a watch = StopWatch() to your code, and then in the loop add something like:

if watch.time() > 5000:
    hub.ble.broadcast(None) 
    watch.reset() 

It will automatically turn back on the next time you transmit data, so the rest of the code won't need to be changed.

BertLindeman commented 5 months ago

Another useful thing is to try out is David's suggestion to periodically turn off broadcasting.

To do that, you could add a watch = StopWatch() to your code, and then in the loop add something like:

if watch.time() > 5000:
    hub.ble.broadcast(None) 
    watch.reset() 

It will automatically turn back on the next time you transmit data, so the rest of the code won't need to be changed.

Technic hub on 3.5.0b1. Will try the test including the wait(100) as that seems to stop sooner. Very roughly 45 minutes to an hour. Although the tight loop varies much more in elapsed time. Might do both...

BertLindeman commented 5 months ago

Ah found the note about the cityhub: "no crash" just did not write down how long, but hours anyway.

BertLindeman commented 5 months ago

The reset indeed has in impact. But not positive.

HubB runs with a wait(100) and stops in 10 minutes. Status green in below program. And rapid blue flash.

unfold the program ```python from pybricks.hubs import TechnicHub # from pybricks.parameters import Axis from pybricks.parameters import Color from urandom import choice from pybricks.tools import wait, StopWatch from pybricks import version print(version) # Set up all devices. HUB = TechnicHub(broadcast_channel=1, observe_channels=[2, 3]) print('\nThis is HUB "' + str(HUB.system.name()) + '" as a ' + version[0]) print('\tBluetooth chip version "' + str(HUB.ble.version()) + '"') print("\tbattery voltage:\t", HUB.battery.voltage(), "mV") # Initialize variables. watch = StopWatch() bool2 = 0 # The main program starts here. while True: bool2 = choice([True, False]) HUB.light.on(Color(120, 100, 50)) # green HUB.ble.broadcast([bool2, choice(['abcdefghijkl', 'ghijklabcdef'])]) # same data size and type all the time HUB.light.on(Color(0, 0, 50)) # white # reset the bluetoothchip every 5 seconds if watch.time() > 5000: HUB.light.on(Color.RED) HUB.ble.broadcast(None) HUB.light.on(Color.MAGENTA) watch.reset() wait(100) ```

HubA runs the same program without the wait and runs already 20 minutes.

The color around the reset makes it a tiny bit easier to see the hub is still running.

[EDIT] I assume you do not get notified for each update in a post, so I will update this post from time to time with status

Status

HubB with wait stalls after (in order of appearance):

HubA no-wait test started stalls after

BertLindeman commented 5 months ago

Interesting stall on hubB (with wait(100) on the program in the previous post): Stall after 28 minutes now in RED. Previous stalls were all GREEN.

laurensvalk commented 5 months ago

On the bright side, I re-confirmed that broadcasting without observe active works great. It's been running overnight and still going.

On the not so bright side, it appears that the lockup issue can happen when observe is active, even when the observe restart "process" is disabled. This means there might be little we can do about it, because this is the BLE chip, not our firmware.

I'll have a look at how reasonable a reset would be when this occurs.

laurensvalk commented 5 months ago

And thanks @BertLindeman for your additional insights.

BertLindeman commented 5 months ago

And thanks @BertLindeman for your additional insights.

Updated status above. The cityhub ran this test, including the BT reset, overnight and now gets power-low. I think this prooves the cityhub is OK, like the primehub (still running).

laurensvalk commented 5 months ago

I'll have a look at how reasonable a reset would be when this occurs.

I don't think we could realistically do this while keeping the program running without maintaining all sorts of extra state information.

So maybe the best we can do is to raise an exception to trigger a program stop and skip Bluetooth de-init when this happens.

One problem is that it can occur in several tasks. For one task, it could look like this:

diff --git a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
index b5771b0a7..d1c8301fd 100644
--- a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
+++ b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
@@ -1067,6 +1067,8 @@ void pbdrv_bluetooth_peripheral_disconnect(void) {
 static PT_THREAD(broadcast_task(struct pt *pt, pbio_task_t *task)) {
     pbdrv_bluetooth_value_t *value = task->context;

+    static struct etimer broadcast_timeout;
+
     PT_BEGIN(pt);

     if (value->size > B_MAX_ADV_LEN) {
@@ -1081,7 +1083,21 @@ static PT_THREAD(broadcast_task(struct pt *pt, pbio_task_t *task)) {
     // not the command status).
     PT_WAIT_WHILE(pt, write_xfer_size);
     HCI_LE_setAdvertisingData(value->size, value->data);
-    PT_WAIT_UNTIL(pt, hci_command_complete);
+
+    // In rare cases while observing, setting the advertising data never completes
+    // and the task hangs. We cannot currently recover from this state, so we
+    // turn off Bluetooth and raise an exception on the current task in order
+    // to end the user program.
+    PROCESS_CONTEXT_BEGIN(&pbdrv_bluetooth_spi_process);
+    etimer_set(&broadcast_timeout, 2000);
+    PROCESS_CONTEXT_END(&pbdrv_bluetooth_spi_process);
+    PT_WAIT_UNTIL(pt, hci_command_complete || etimer_expired(&broadcast_timeout));
+    if (etimer_expired(&broadcast_timeout)) {
+        bluetooth_ready = false;
+        pbdrv_bluetooth_power_on(false);
+        task->status = PBIO_ERROR_IO;
+        PT_EXIT(pt);
+    }

     if (!is_broadcasting) {
         PT_WAIT_WHILE(pt, write_xfer_size);
diff --git a/pybricks/common/pb_type_ble.c b/pybricks/common/pb_type_ble.c
index 8114a1e8d..36ab73e40 100644
--- a/pybricks/common/pb_type_ble.c
+++ b/pybricks/common/pb_type_ble.c
@@ -559,6 +559,11 @@ mp_obj_t pb_type_BLE_new(mp_obj_t broadcast_channel_in, mp_obj_t observe_channel
 }

 void pb_type_BLE_cleanup(void) {
+
+    if (!pbdrv_bluetooth_is_ready()) {
+        return;
+    }
+
     static pbio_task_t stop_broadcasting_task;
     static pbio_task_t stop_observing_task;
     pbdrv_bluetooth_stop_broadcasting(&stop_broadcasting_task);
diff --git a/pybricks/iodevices/pb_type_iodevices_lwp3device.c b/pybricks/iodevices/pb_type_iodevices_lwp3device.c
index 871d243a5..cfa852fc3 100644
--- a/pybricks/iodevices/pb_type_iodevices_lwp3device.c
+++ b/pybricks/iodevices/pb_type_iodevices_lwp3device.c
@@ -321,6 +321,11 @@ STATIC void pb_lwp3device_configure_remote(void) {
 }

 void pb_type_Remote_cleanup(void) {
+
+    if (!pbdrv_bluetooth_is_ready()) {
+        return;
+    }
+
     pbdrv_bluetooth_peripheral_disconnect();

     while (pbdrv_bluetooth_is_connected(PBDRV_BLUETOOTH_CONNECTION_PERIPHERAL)) {
laurensvalk commented 5 months ago

One problem is that we'd need this timeout potentially for every wait in the start/stop broadcast/observe tasks, and potentially others like write tasks if the user also uses the remote light etc.

So maybe it is more practical to just fix forced hub shutdown instead.

laurensvalk commented 5 months ago

@dlech - why does the pb_type_Remote_cleanup use MICROPY_EVENT_POLL_HOOK but pb_type_BLE_cleanup uses MICROPY_VM_HOOK_LOOP?

Would MICROPY_EVENT_POLL_HOOK not be an issue since we call pb_type_Remote_cleanup after run_user_program completes, so outside of any nlr.

I suppose the same question applies to mp_hal_stdout_tx_flush. It uses MICROPY_EVENT_POLL_HOOK but we also call it once outside of the nlr buffer.

EDIT: I replaced these three wait loops by a single loop with only MICROPY_VM_HOOK_LOOP.

BertLindeman commented 5 months ago

Would you be interested in a program that stalls in seconds? The first test RED in 1 second, second RED in 3 seconds. Third test GREEN in 10 seconds, Fouth 3 seconds RED

Laurens program plus BT stop broadcasting after each broadcast ```python """Do a reset after each broadcast and see if this runs longer or not """ from pybricks.hubs import TechnicHub # from pybricks.parameters import Axis from pybricks.parameters import Color from urandom import choice # from pybricks.tools import wait, StopWatch from pybricks import version print(version) # Set up all devices. HUB = TechnicHub(broadcast_channel=1, observe_channels=[2, 3]) print('\nThis is HUB "' + str(HUB.system.name()) + '" as a ' + version[0]) print('\tBluetooth chip version "' + str(HUB.ble.version()) + '"') print("\tbattery voltage:\t", HUB.battery.voltage(), "mV") # Initialize variables. # watch = StopWatch() bool2 = 0 # The main program starts here. while True: bool2 = choice([True, False]) HUB.light.on(Color(120, 100, 50)) # green HUB.ble.broadcast([bool2, choice(['abcdefghijkl', 'ghijklabcdef'])]) # same data size and type all the time # reset the bluetoothchip every broadcast HUB.light.on(Color.RED) HUB.ble.broadcast(None) HUB.light.on(Color.MAGENTA) ```

[EDIT} updated BT reset to BT stop broadcasting

laurensvalk commented 5 months ago

Thank you. By the way, hub.ble.broadcast(None) does not reset the chip. It just stops broadcasting.

laurensvalk commented 5 months ago

@BertLindeman @JJHackimoto - We've added some updates to make sure that the hub can at least shutdown when this happens. By pressing the button for three seconds. No need to pull any batteries.

https://github.com/pybricks/pybricks-micropython/pull/245

The CI is getting stuck for some reason, so here is the prebuilt firmware with the shutdown fix: firmware.zip

BertLindeman commented 5 months ago

Test above in this post with "stop broadcast" after each broadcast, stops (in RED this first test) and "hold button" nicely stops the hub.

Will run other tests, that will take more time..

laurensvalk commented 5 months ago

Build links in https://github.com/pybricks/pybricks-micropython/pull/245 are fixed, so firmware for all hubs is ready for testing.

Also a bonus feature we got as part of this cleanup: You can now call .disconnect() on the Remote to disconnect it.

dlech commented 5 months ago

EDIT: I replaced these three wait loops by a single loop with only MICROPY_VM_HOOK_LOOP.

Sounds reasonable for the way we currently use these functions.

BertLindeman commented 5 months ago

Tests running on build 3360. Nice is that a stuck hub now saves the program on exit by "hold hub button"

dlech commented 5 months ago

Sounds reasonable for the way we currently use these functions.

Although it would make the stdio flush ioctl non-interuptable which could be problematic if someone tries to flush a really big buffer.

laurensvalk commented 5 months ago

Thanks for the quick review of the PR! I'll update it based on your notes.

Sounds reasonable for the way we currently use these functions.

Although it would make the stdio flush ioctl non-interuptable which could be problematic if someone tries to flush a really big buffer.

I've only changed it for the deinit - the mphal flush used in programs still uses the event poll hook. Is that what you mean?

dlech commented 5 months ago

I thought you meant you changed the implementation of mp_hal_stdout_tx_flush() itself, but that does not appear to be the case.

JJHackimoto commented 4 months ago

Can confirm this still happens using the latest beta. No need to pull the batteries anymore though :)

laurensvalk commented 4 months ago

Yeah, we haven't been able to find a fix, so just patched the shutdown issue.

BertLindeman commented 4 months ago

Spikehub hangs with fast blinking button doing no broadcasts.

The hub definition only sets the channel: hub = InventorHub(broadcast_channel=1)

If the broadcast_channel is not set, the problem does not occur.

The program is a simple combination of display animation and button press.

The complete program ```python # Hub hangs after printing 1 2 3 1 2 3 # central button press makes the hub do a rapid blue blink from pybricks.hubs import InventorHub from pybricks.parameters import Button, Icon from pybricks.tools import wait hub = InventorHub(broadcast_channel=1) # hub = InventorHub() animation = [Icon.EYE_LEFT_BLINK * i / 100 for i in [0, 100, 100, 0]] while True: print(1) hub.display.animate(animation, 10) print(2) pressed = hub.buttons.pressed() print(3) if any(pressed): pass wait(100) ```

Not sure if this issue is the correct issue. Laurens, should I move this to another or make new (bug) issue?

laurensvalk commented 4 months ago

Could you be seeing https://github.com/pybricks/support/issues/1295?

When pressing a button in your program, there is no wait in the loop so it's restarting the animation way faster than it can run, which can lead to a lockup.

BertLindeman commented 4 months ago

Could be, Laurens, but . . . added a wait(500) in the if pressed and that does not make a difference.

What makes a difference is changing the hub definition from

hub = InventorHub(broadcast_channel=1)

to

hub = InventorHub()

In the last case no problem.