pybricks / support

Pybricks support and general discussion
MIT License
107 stars 6 forks source link

[Bug] Observing data can crash the Technic Hub #1454

Open laurensvalk opened 7 months ago

laurensvalk commented 7 months ago

Describe the bug It appears that the Technic Hub can crash quickly or after some time when observing data.

This may or may not be related to the most recent fix in https://github.com/pybricks/pybricks-micropython/commit/d13ca6ffcda5131acbb59a4d8643b761a2048be4.

To reproduce

The following program appears to crash. Sometimes quickly, sometimes after some time.

I've mostly been running this without Pybricks Code connected (I was mostly programming the sending hub), but when trying it while connected it would also crash, but I don't think there was an exception. It disconnected so presumably rebooted.

from pybricks.hubs import TechnicHub
from pybricks.parameters import Axis, Direction, Port, Stop
from pybricks.pupdevices import Motor
from pybricks.robotics import DriveBase
from pybricks.tools import wait

# Set up all devices.
hub = TechnicHub(top_side=Axis.Z, front_side=Axis.X, broadcast_channel=0, observe_channels=[78])
left = Motor(Port.D, Direction.CLOCKWISE)
right = Motor(Port.C, Direction.COUNTERCLOCKWISE)
turn = Motor(Port.B, Direction.COUNTERCLOCKWISE)
drive_base = DriveBase(left, right, 48, 208)

# Initialize variables.
steer = 0
drive = 0
rotate = 0

# The main program starts here.
while True:
    drive, steer, rotate = hub.ble.observe(78) or [0] * 3
    if abs(drive) > 30 or abs(steer) > 20:
        drive_base.drive(drive, steer)
    else:
        drive_base.stop()
    if abs(rotate) > 50:
        turn.run(rotate)
    else:
        turn.stop()
    wait(10)

To be complete, here is the sending program:

from pybricks.hubs import TechnicHub
from pybricks.iodevices import XboxController
from pybricks.parameters import Axis, Button, Direction, Port, Stop
from pybricks.pupdevices import Motor
from pybricks.tools import wait

# Set up all devices.
hub = TechnicHub(top_side=Axis.Z, front_side=Axis.X, broadcast_channel=78)
boom = Motor(Port.A, Direction.CLOCKWISE)
jib = Motor(Port.D, Direction.CLOCKWISE)
hoist = Motor(Port.B, Direction.CLOCKWISE)
xbox = XboxController()

# Initialize variables.
steer = 0
drive = 0
rotate = 0
extend = 0

# The main program starts here.
while True:
    drive = xbox.joystick_left()[1] * 3
    steer = xbox.joystick_left()[0] * 2
    rotate = 10 * (xbox.triggers()[1] - xbox.triggers()[0])
    hub.ble.broadcast([drive, steer, rotate])
    extend = xbox.joystick_right()[1] * 10
    if abs(extend) > 100:
        jib.run(extend)
    else:
        jib.stop()
    hoist.run(1000 if "LB" in xbox.buttons.pressed() else (-1000 if "RB" in xbox.buttons.pressed() else 0))
    boom.run(1000 if xbox.dpad() == 1 else (-1000 if xbox.dpad() == 5 else 0))
    wait(10)
laurensvalk commented 7 months ago

This may or may not be related to the most recent fix in https://github.com/pybricks/pybricks-micropython/commit/d13ca6ffcda5131acbb59a4d8643b761a2048be4.

The reason I'm not entirely sure about this, is because unlike past endurance tests, I now also have constantly changing data, such as a value from 0 to 1000, which is sometimes packed as one byte and sometimes as two.

So it would be good to try and reproduce with just either the older build with variable data size, or the newer build with constant data size, just to be sure.

laurensvalk commented 7 months ago

Here is a simpler way to crash the observer. Sender must also be Technic Hub.

The observing Technic Hub does not crash if the sender is a Prime Hub, suggesting that sending Technic Hub is sending bad data, though it would be nice if that didn't crash the receiver.

A motor is added so we can send the angle to set the hue, as a visual cue that something goes over the air.

from pybricks.hubs import TechnicHub
from pybricks.parameters import Color
from pybricks.tools import wait

# Set up all devices.
hub = TechnicHub(observe_channels=[78])

# The main program starts here.
while True:
    drive, steer, rotate = hub.ble.observe(78) or [0] * 3
    hub.light.on(Color(rotate))
    wait(10)
from pybricks.hubs import TechnicHub
from pybricks.parameters import Port
from pybricks.pupdevices import Motor
from pybricks.tools import wait
from urandom import randint

# Set up all devices.
hub = TechnicHub(broadcast_channel=78)
motor = Motor(Port.B)

# The main program starts here.
while True:
    a = randint(-1000, 1000)
    b = randint(-1000, 1000)
    c = motor.angle()
    hub.ble.broadcast([a, b, c])
    # wait(100) # observer does not crash
    # wait(10) # observer crashes fairly soon
    wait(1) # observer crashes almost instantly
laurensvalk commented 7 months ago

The observer does seem to error gracefully in these cases, not reboot.

It's just more likely to happen when disconnected, because then observing is faster.

We get errors such as:

Traceback (most recent call last):
  File "bug_observe_crash_observe.py", line 10, in <module>
RuntimeError: received bad data
Traceback (most recent call last):
  File "bug_observe_crash_observe.py", line 12, in <module>
TypeError: can't convert str to int
(379, 692, -28)
(863, -181, b'\xe4\x01\x11\x04\x11\x124\x016\x01Y\x11\x06\x8a4\x01YB\x04\x00\x00\x00\xbf\x06\x00\x00\x97\xfe')
Traceback (most recent call last):
  File "bug_observe_crash_observe.py", line 13, in <module>
TypeError: can't convert bytes to int

Which seems to confirm the bad data theory.

A SPIKE observer will also crash the same way, which further supports the bad Technic Hub sender idea.

laurensvalk commented 7 months ago

Sometimes you get bad data which, depending on how it is used, does not lead to a crash:

(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(679, -279, 1.709704e-36)
(-997, 470, 13)
(-997, 470, 13)
(-997, 470, 13)
(-997, 470, 13)
(682, -789, 13)
laurensvalk commented 7 months ago

It does indeed seem to correlate with data variability:

With this as the sender, the observer does not crash:

from pybricks.hubs import TechnicHub
from pybricks.parameters import Port, Color
from pybricks.pupdevices import Motor
from pybricks.tools import wait
from urandom import randint

# Set up all devices.
hub = TechnicHub(broadcast_channel=78)
motor = Motor(Port.B)

# The main program starts here.
while True:
    a = randint(0, 127)
    b = randint(0, 127)
    c = motor.angle()
    hub.light.on(Color(c))
    hub.ble.broadcast([a, b, c])
    wait(1) # no crash

Changing the randints in the sender to:

    a = randint(0, 511)
    b = randint(0, 511)

crashes the observer right away.

laurensvalk commented 6 months ago

So, fast data updates in a tight loop seem fine, as long as the message format stays the same.

If the size changes, the data gets corrupted.

Hmm.

laurensvalk commented 6 months ago

Further simplifying sender to make the raw data trivial to read.

from pybricks.hubs import ThisHub
from pybricks.parameters import Port, Color
from pybricks.pupdevices import Motor
from pybricks.tools import wait
from urandom import randint

# Set up all devices.
hub = ThisHub(broadcast_channel=78)
motor = Motor(Port.B)

# The main program starts here.
while True:
    a = bytes([i for i in range(randint(4,5))])
    hub.ble.broadcast(["ABC", a])
    wait(1)

If we log just before sending, inHCI_LE_setAdvertisingData , all seems fine:

adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04

Observer:

from pybricks.hubs import ThisHub
from pybricks.parameters import Color
from pybricks.tools import wait

# Set up all devices.
hub = ThisHub(observe_channels=[78])

# The main program starts here.
while True:
    data = hub.ble.observe(78)
    print(data)
    wait(10)

Receiving on Prime Hub. Which is now mixed up. It only crashes near the end because it doesn't consider everything invalid on unpacking, but there are multiple things wrong here.

adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 15:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03 04
adv len 15:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03 04
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0d ff 97 03 4e a3 41 42 43 c4 00 01 02 03
adv len 14:  0e ff 97 03 4e a3 41 42 43 c5 00 01 02 03
adv len 14:  0e ff 97 03 4e a3 41 42 43 c4 00 01 02 03
laurensvalk commented 6 months ago

@dlech - Bug in the CC2640 or am I missing something here?

It's also possible that mixups happen all the time, but some stale data is just not crashing on unpack. (But could give bad results for any value > 1 byte if only a few bytes are stale).

laurensvalk commented 6 months ago

It's also possible that mixups happen all the time, but some stale data is just not crashing on unpack. (But could give bad results for any value > 1 byte if only a few bytes are stale).

So yes, if we randomly transmit either of

DATA1 = bytes([1,2,3,4])
DATA2 = bytes([5,6,7,8])

Then:

adv len 11:  0a ff 97 03 4e 00 c4 01 02 03 04
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 01 02 03 04
adv len 11:  0a ff 97 03 4e 00 c4 01 02 03 04
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 01 02 03 04
adv len 11:  0a ff 97 03 4e 00 c4 01 02 03 04
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 08
adv len 11:  0a ff 97 03 4e 00 c4 05 06 07 04    <--- unexpected 5 6 7 4, should be 5 6 7 8 or 1 2 3 4
laurensvalk commented 6 months ago

Could we be overriding the chip's advertising buffer while it is reading it? How might we prevent that?

dlech commented 6 months ago

We fixed something similar already in https://github.com/pybricks/pybricks-micropython/commit/462c1a6fbc21914be58065e11340cc21c21fc942.

laurensvalk commented 6 months ago

This seems to get a little deeper. The above logs were printed over uart from within the event handler:

image

dlech commented 6 months ago

Did you use a sniffer to first verify that the data going over the air is not already corrupt?

laurensvalk commented 6 months ago

Yes, I'm thinking the problem is in the sender.

Because: spike and Technic both have this problem when observing, but only if technic is the sender. But as logged above, the stuff we write to the hci still looks fine.

Haven't used a sniffer because I don't have one yet. This does seem like good starter exercise :smile:

BertLindeman commented 6 months ago

Laurens, I ran your test programs with varying data lengths above with a TechnicHub as sender and a SpikeHub as observer. Both run Firmware v3.4.0b2. Removed the instantiation of the motor, not related I think. This setup stops quite soon, seconds I think.

Then ran your second example with DATA1 and DATA2 with the same length.

issue_1454_sender_timer.py:

from pybricks.hubs import ThisHub
from pybricks.parameters import Port, Color
from pybricks.pupdevices import Motor
from pybricks.tools import wait
from urandom import randint

# Set up all devices.
hub = ThisHub(broadcast_channel=78)
# motor = Motor(Port.B)

DATA1 = bytes([1,2,3,4])
DATA2 = bytes([5,6,7,8])

# The main program starts here.
while True:
    # a = bytes([i for i in range(randint(4,5))])
    # hub.ble.broadcast(["ABC", a])
    hub.ble.broadcast(DATA1)
    wait(1)
    hub.ble.broadcast(DATA2)
    wait(1)

issue_1454_observer_timer.py:

from pybricks.hubs import ThisHub
from pybricks.parameters import Color
from pybricks.tools import wait, StopWatch

# Set up all devices.
hub = ThisHub(observe_channels=[78])

timer = StopWatch()
DATA1 = bytes([1,2,3,4])
DATA2 = bytes([5,6,7,8])

# The main program starts here.
while True:
    data = hub.ble.observe(78)
    if data in (DATA1, DATA2, None):
        print(end=".")
    else:
        print("\nWrong data:", data)
        print("run time:", timer.time(), "mSec\n")
        raise SystemExit()
    wait(10)

This test now runs for over an hour and still running. Will let it run (hope for full batteries) and report back if the run stops.

Bert

BertLindeman commented 6 months ago

As promised.... (more or less)

My PC rebooted involuntary, so no final measurement, the hubs are still sending and observing. At least six hours in this run without a problem.

laurensvalk commented 6 months ago

Thanks Bert, good to know you're seeing the same issue :slightly_smiling_face:

laurensvalk commented 6 months ago

I think we'll have to defer this to the next release, and possibly add a note to the docs to say that it is more stable if your values are meaningful in one byte increments.

JJHackimoto commented 1 week ago

Any news on this one? I would love to see broadcasting working correctly for the upcoming lego-events this fall :)

laurensvalk commented 1 week ago

It seems that the Technic Hub can occasionally send a mix of a new and previous message.

So to work around it you can:

JJHackimoto commented 1 week ago

I've actually tried just that (but using Blocks). I have a section where there's three steps to the process, and another hub sends signals, as to which step should be done next. I've then added blocks on the receiving end to check which integer was sent (0, 1 or 2), then wait for a bit and check again. If it's still the same number, then it should continue.

This should work right? It doesn't. I've had it act up and doing the steps in the wrong order, and at wrong timing, sometimes waiting almost 10 seconds before responding at all.

There's another part of the program where a sliding carriage in my machine should go all the way back before the next action can occur. I let the hub controlling the carriage send a signal once it detects that the motor has finished its rotations, and also double checked that it has returned to position 0. Still, it sometimes sends a broadcast that everything is done even though it has clearly frozen and no movement has even begun. This results in the machine physically crashing into itself.

Do you think there will ever be a fix or should I just look for other solutions? Using mindstorms hubs are so expensive but maybe it's my only way forward.

laurensvalk commented 1 week ago

I've actually tried just that (...) a section where there's three steps to the process (..) sometimes waiting almost 10 seconds before responding at all.

That isn't really what I meant. Try to avoid bidirectional communication if you can. One way broadcasting and observing a series of integers works really well, for hours on end and up to large distances.

Do you think there will ever be a fix or should I just look for other solutions?

If you can describe your setup, maybe we can think of a way to do your setup where each hub either only sends or receives.

JJHackimoto commented 1 week ago

Pinsetter schema.pdf

Sorry if I sounded upset, I appreciate all the help I can get. :)

Alright, I tried to make a detailed but "simple" explanation of my setup. It's quite complicated but I tried to boil it down to the basics. The last page describes the interactions between the hubs. It doesn't feel possible to do this without inter communication as far as I can see sadly. Ideal would probably be to run just one program and send all commands over bluetooth to the hubs like the official powered up app, but that app has so many other problems so it's way out of the question lol. Also that would probably introduce delay issues anyways.

laurensvalk commented 1 week ago

That is a nice challenge :-)

Perhaps it is possible to start with making only some hubs unidirectional? For example by moving both sensors to one hub?

The issue we're in here has the above workarounds. For the issue about getting stuck when combining observing and advertising, maybe we can introduce a way to stop observing (like you can stop advertising) so that you can consciously choose to use which to use, when. (And then you can keep it simple by having as few as possible hubs do this)

JJHackimoto commented 1 week ago

I managed to come up with a solution for one hub to only broadcast signals. The difference is HUGE. I only had one small hickup during my 10 minute testing (Before I would have had probably 5). I don't think I can optimize it more than this, but having the possibility to stop observing upon request would probably fix it entirely if I manage to integrate it into my program. Let me know if such a feature would be possible :)