Open projectgus opened 5 months ago
I think this is the bug described by @robert-hh here: https://github.com/micropython/micropython/pull/15158#issuecomment-2168123436
That's a surprisingly detailed find thanks for investigating @projectgus !
I haven't confirmed but also suspected I may have introduced a bug in tinyusb.
In my recent change I added the option to disable clearing the tx buffer on connection https://github.com/hathach/tinyusb/blob/007a8bd46dce4020a2ef972749dacb9d80428371/src/class/cdc/cdc_device.c#L298
I now see the buffer and the head/tail pointers is statically defined as part of the struct https://github.com/hathach/tinyusb/blob/007a8bd46dce4020a2ef972749dacb9d80428371/src/class/cdc/cdc_device.c#L64
This isn't initialised with any kind of = {0}
so there's a chance it's starting up with junk in the buffer that may have previously cleared by the function I've disabled above.
I haven't confirmed but also suspected I may have introduced a bug in tinyusb.
I think that part should be OK, unless I'm missing a quirk this struct is always a member in the static array _cdcd_itf
and as a static array that's going to be in .bss whether or not it's explicitly initialised to zero. I just double checked rp2 port and it's definitely .bss there.
I think all of the examples Robert posted in the thread are jumbled up parts of things printed to the console during startup and echoed back. The bit where it printed (and tried to executed) ysh import *
looks like it comes from the filesystem, but it doesn't - upysh prints a bunch of help text on import including the lines upysh is intended to be imported using:\n from upysh import *
. The race condition just happened to echo the second part of that line back to the REPL, which then tried to execute it.
I think the fix may be pretty simple here, to make cdc_connected_flush_delay apply to all flushes.
Ah, the fix is annoyingly fiddly... because we do want to write into the TinyUSB FIFO, so we don't lose any buffered data. But we don't want the FIFO to flush until a short while after the port is opened, and TinyUSB will (reasonably) try to flush the FIFO whenever it's full.
@robert-hh are you able to replicate the scenario where the device connects but repl appears to be locked up even on unplug / replug of usb?
If so, can you try hitting ctrl-d in that scenario - I'm wondering if this weird echo might be tripping the repl into raw mode which then looks silent / locked up until explicitly exited.
@robert-hh are you able to replicate the scenario where the device connects but repl appears to be locked up even on unplug / replug of usb?
It seems locked, but on a re-connect it works. Even switching the baud rate helps, because that seems to re-initialize the USB connection. Un-plug and re-plug has the same effect as hard-reset or power cycle. Edit: Neither Ctr-B, Ctrl-C or Ctr-D works in the locked state.
Ah so I had linked to robert's first comment in my comment above, but that's not quite right - I wasn't able to reproduce anything from just the code in that comment, I reproduced the other weird behaviour from the code in the later comments that imported upysh
, etc. Have edited my comment above.
@robert-hh do you still see the hang if you build with this PR? If yes, is only the test code in your first comment enough to repro for you, or do you need some of the additional code as well?
I see that behavior with the code from PR #15158. And no, it's not just the code from my first comment. Actually it does not seem to be related to a specific code, as I detailed later. The whole thing seems to be a combination of code that outputs text and delayed timing, like you found. So it made a difference if more or fewer files are in the file system, if code was imported that accesses files. I could get the issue with print statements and delays, like you did. The code that caused the issue the first time just delayed (waiting for en external device to respond) and printed (by raising an error). Without that delay (external device responds) all worked fine.
Besides that there is another USB issue I have since long with SAMD devices at an external USB hub. But then the SAMDs do not register at all or only after about 40 seconds. And they work fine when connected directly to the PC. This issue persists for a few days and then it disappears for months. It seems not related to the MP codebase.
If I revert PR #14485 8809ae716 shared/tinyusb: Buffer startup CDC data to send to host on connection
then all is fine.
To be clear: the phenomenon happens as well when using the actual master branch. The main.py is:
from upysh import *
from pye import pye
from machine import Pin
import cs1237
power=Pin(13, Pin.OUT, value=1)
cs=cs1237.CS1237(Pin(15), Pin(14))
With cs1237.py being a driver for the cs1237 ADC. If that one is NOT connected, the driver will raise an issue after about 500ms. If the driver is properly connected, all works fine. I get similar lock-up states with that set-up on a Teensy 4.1, but way rare. When locked-up, e.g. changing the USB baud rate in Picocom using Ctrl-A Ctrl-D unlocks USB.
Edit: upysh.py and pye.py are frozen into the firmware.
The main.py you've posted here is what I used to repro this initially (I didn't need to freeze anything, but that may or may not be relevant as I agree it seems timing dependent).
So, have you tried to reproduce with the fix in this PR? Is there any difference?
So, have you tried to reproduce with the fix in this PR? Is there any difference?
Um, sorry, I'm obviously tired as there is no PR and there is no fix for this yet. Oh dear!
There is another snippet of reproducer in the issue text at the top. I expect you'll be able to reproduce similar but maybe not the same problems with it, because I think it all depends on what random characters get echoed back to the board.
Not sure how best we can fix this without patching TinyUSB, but maybe I can put up a PR that does patch TinyUSB and we can at least confirm which problem(s) that makes go away.
Thanks for the feedback @robert-hh - are you using this from Linux also? If you are, it sounds to me like what you're seeing is broadly the same as what @projectgus is describing, though there might be some secondary issues too.
The repl history noise you saw could possibly be this same stdio echo / reflect bug- if that repeated bit of stuff reflected back with a \n on the end it might have been executed, erroring, echo, reflected back, re-exectuted back and forth a number of times in the brief reflection window.
@andrewleech Yes, it looks like the same issue. And once it's understood there will be a fix.
Pyserial: echo is turned off here in the _reconfigure_port(): https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/serialposix.py#L413
It looks like DTR is set here, after _reconfigure_port() https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/serialposix.py#L343
Update: Urgh, yeah Linux always asserts DTR automatically whether a serial port is opened (for obscure legacy reasons presumably). There are lots of posts about this online and the havoc it often causes to embedded devices. The only "fix" seems to be to patch / recompile the individual device drivers being used... eg. https://unix.stackexchange.com/a/543527l
Maybe we need to look at adding a new interface into tinyusb to prevent it sending anything until we say so (after our SOF based delay).
It might be worth raising this issue over at tinyusb as in theory this issue could occur even without my buffer change; all it needs is enough data being written when the port is just opened (DTR asserted) to trigger a tx flush before the host has time to disable local echo. They might be open to moving the SOF based delay before "is connected" gets flagged into tinyusb itself. SOF is already being used for "scheduling" resume behaviour on some drivers so also using it to schedule "is connected" doesn't feel like much of a stretch.
Port, board and/or hardware
rp2 port, PICO and WeACTStudio boards
MicroPython version
MicroPython v1.24.0-preview.39.g411d66586c on 2024-06-18; WeAct Studio RP2040 with RP2040 (current master)
Reproduction
Expected behaviour
Since PR #14485 merged, expected output is something like:
Observed behaviour
This is a race condition so it depends on timing, but on Linux the output sometimes shows the REPL behaving as if some random input taken from the earlier output has been sent to it, i.e.
If the output printed earlier from main.py includes valid Python statements, these might even be executed...
Additional Information
The on-the-wire behaviour can be seen in usbmon capture output where the USB captures
Output of
strace -tt mpremote a0
shows the last part of the puzzle:The above shows:
ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE
).I think the necessary conditions to trigger this are:
cdc_connected_flush_delay
) that means output isn't sent to the host immediately after opening, instead it waits until after the 1ms race window is finished. However, if output is still actively being printed then the device may choose to flush early and triggers the bug.Suggested fix
I think the fix may be pretty simple here, to make
cdc_connected_flush_delay
apply to all flushes.However this behaviour is odd and hard to see, so I wanted to document it here in case it comes up again in another context.
Code of Conduct
Yes, I agree