pybricks / support

Pybricks support and general discussion
MIT License
109 stars 6 forks source link

Move hub Bluetooth can disconnect randomly while running a program #320

Closed dlech closed 1 year ago

dlech commented 3 years ago

As reported in https://github.com/pybricks/support/issues/304#issuecomment-830472455, the Move hub Bluetooth seems to lock up occasionally somewhat similar to #306.

Symptoms are that Bluetooth becomes disconnected while running a program and pressing the hub button to stop the program works and makes the light blink (Bluetooth advertising) since the hub is now disconnected. Reconnection may be affected by #325, but that is a separate issue.

BertLindeman commented 3 years ago

I assume this item to be a better one than #304 for this post

@dlech would you share the program you run (on the movehub?) for this issue? So we can run identical programs.

On my movehub I tried the build 1020 and 1022. I did not show the watchdog info though. Both "freeze" like this:

The last time it froze at iteration 356

My current program on the movehub:

#
# source Z:\home\bert\py\pybricks\issue\issue_#304_movehub_randint.py
#
from pybricks import version
print(version)

from pybricks.pupdevices import ColorSensor
from pybricks.parameters import Port
# from urandom import randint
from pybricks.hubs import MoveHub

hub = MoveHub()

_rand = hub.battery.voltage() + hub.battery.current()  # seed
print("Battery:", str(hub.battery.voltage()) + "mV", str(hub.battery.current()) + "mA at start")

# Return a random integer N such that a <= N <= b.
def randint(a, b):
    global _rand
    _rand = 75 * _rand % 65537  # Lehmer
    return _rand * (b - a + 1) // 65537 + a

lights = ColorSensor(Port.D).lights  # MoveHub has a fixed motor on Port.B

count = 0
print("") # start at a new line
while True:
    count += 1
    print("Start iteration", count, end="")
    for i in range(50):
        lights.on([randint(0, 100) for j in range(3)])
        print(".", end="")
    print("; ended")
dlech commented 3 years ago

would you share the program you run (on the movehub?) for this issue?

I would, but it is essentially already the same as the one you have posted. 😉

  • Bluetooth drops on the code website after timeout? About 4 minutes?

Interesting, I never waited this long. So it seems like Bluetooth isn't completely stuck then.

BertLindeman commented 3 years ago

Interesting, I never waited this long. So it seems like Bluetooth isn't completely stuck then.

Retired systemsprogrammer nowadays has the opportunity to wait some time 😄 .

Do you know if there is a 4 minutes or maybe 250 seconds timeout in the bluetooth stack?

dlech commented 3 years ago

Do you know if there is a 4 minutes or maybe 250 seconds timeout in the bluetooth stack?

Probably. We don't have any timeouts like that in our code that I recall.

dlech commented 3 years ago

I've had this running for hours today and haven't had the Bluetooth quit yet. Not sure what could be different between yesterday and today. The fact that Bert was able to reproduce the problem shows there is still some problem here. But I'm stuck until we can find a more consistent way to reproduce the problem.

BertLindeman commented 3 years ago

Most of the tests I run on win10 with Microsoft Edge - Versie 90.0.818.49 (Officiële build) (64-bits). Will try tomorrow the same hub / color sensor on Linux Mint 20.1.

BertLindeman commented 3 years ago

Intermediate report. The test on Linux Mint is now running at iteration 18000 plus. I Will let it run overnight and report further about it tomorrow.

BertLindeman commented 3 years ago

The first attempt on Linux Mint with Google Chrome ended, at iteration 134,915 probably due to a power low.

So second attempt with fresh reloaded batteries:

This all seen at 9:37 Waited until 11:25 but no automatic Bluetooth disconnect occurred. Clicked the Bluetooth disconnect button and indeed the connection dropped. And the animation of the color sensor lights continues.

I had enough time to wait for a Bluetooth disconnect, as I am building the Landrover next to the Linux PC.

No idea how to qualify this behavior . Loose remark: On rare occasions, I see the printed lines miss a dot on the printed line. Both on Windows and Linux.

BertLindeman commented 3 years ago

@dlech would a hang in pybricks-micropython/tests/basic/count_forever.py on the move hub be correctly reported here? And maybe help? The printing stops rather soon: count: 148

environment Win10 Microsoft Edge Versie 90.0.818.51 ('movehub', '3.0.0b6', '7b3d1b42 on 2021-05-01')

As seen before, but really clear here, there is data missing from the print. Showing the complete print - see missing data at/near 32, 33, 69, 70

count: 0
count: 1
count: 2
count: 3
count: 4
count: 5
count: 6
count: 7
count: 8
count: 9
count: 10
count: 11
count: 12
count: 13
count: 14
count: 15
count: 16
count: 17
count: 18
count: 19
count: 20
count: 21
count: 22
count: 23
count: 24
count: 25
count: 26
count: 27
count: 28
count: 29
count: 30
count: 31
count: 3 34
count: 35
count: 36
count: 37
count: 38
count: 39
count: 40
count: 41
count: 42
count: 43
count: 44
count: 45
count: 46
count: 47
count: 48
count: 49
count: 50
count: 51
count: 52
count: 53
count: 54
count: 55
count: 56
count: 57
count: 58
count: 59
count: 60
count: 61
count: 62
count: 63
count: 64
count: 65
count: 66
count: 67
count: 68
c
 count: 71
count: 72
count: 73
count: 74
count: 75
count: 76
count: 77
count: 78
count: 79
count: 80
count: 81
count: 82
count: 83
count: 84
count: 85
count: 86
count: 87
count: 88
count: 89
count: 90
count: 91
count: 92
count: 93
count: 94
count: 95
count: 96
count: 97
count: 98
count: 99
count: 100
count: 101
count: 102
count: 103
count: 104

count 104 is the last printed data this time.

[EDIT] On Linux Mint printing stops at about 4500 and Bluetooth connection dropped at that moment. (Same program, same hub)

dlech commented 3 years ago

If it just causes the Bluetooth to hang and the hub does not reboot itself, then yes this is the right place.

The missing data is most likely caused by losses beyond our firmware. We are using GATT "notifications" which make no guarantee that the data is actually received on the other end (related: #274). We would need a Bluetooth sniffer to tell for sure though.

dlech commented 3 years ago

Is the Linux test running in a virtual machine on the same hardware as Windows?

BertLindeman commented 3 years ago

We would need a Bluetooth sniffer to tell for sure though.

Do you mean like in

Could you do a Bluetooth capture when this happens? bleak capture

The run on windows is relatively small.

To be sure I will reboot windows and re-try first.

dlech commented 3 years ago

Do you mean like in

Could you do a Bluetooth capture when this happens? bleak capture

In this case, no. It could be that the hub is sending the data over the air, but Windows is not able to receive it due to background noise or something like that.

dlech commented 3 years ago

It would be good to start a new issue for the data loss. I am curious to know how often it affects other people. I haven't really seen it myself.

BertLindeman commented 3 years ago

Is the Linux test running in a virtual machine on the same hardware as Windows?

No The Linux Mint runs as separate PC (My LEGO PC to show building instructions and such)

It would be good to start a new issue for the data loss. I am curious to know how often it affects other people. I haven't really seen it myself.

Will do.

dlech commented 3 years ago

This could be my imagination or a coincidence, but it seems that more RF interference helps to reproduce this problem. I was having trouble reproducing this problem, but by enabling advertising on 3 other hubs while the Move hub is connected seems to significantly increase the likelihood of Bluetooth stopping on the Move hub.

Also, as Bert has already observed, another distinguishing feature of this bug compared to #306 is that Bluetooth disconnects rather than just "locking up".

It is possible to press the button on the hub to stop the program and go back to advertising. The Pybricks Code is able to reconnect and read the device info characteristics, but the UART characteristics no longer seems usable until rebooting the Move hub (Pybricks Code says "timeout waiting for checksum" in the console log).

dlech commented 3 years ago

I also noticed that after this disconnection are reconnection that the Bluetooth address is sometimes set to FF:FF:FF:FF:FF:FF, so it seems that the Bluetooth chip is not being fully reset as it should be, maybe?

After reconnection, the status notifications from the Pybricks GATT characteristic are received but the UART characteristics don't seem to work.

After reconnection, pressing the hub button to start the in-flash-memory program works once. But after stopping it, it won't start again.

dlech commented 3 years ago

The reconnection issue seems to be independent of the random disconnection issue, so I have opened #325 to track the reconnection issue.

dlech commented 1 year ago

Since https://github.com/pybricks/pybricks-micropython/commit/efda5d02604c0d8e88fd50f6985f7f89d98f65c9 I am no longer able to reproduce the issue even with 4 other hubs advertising, so hopefully that means this issue has been resolved.