micropython / micropython

MicroPython - a lean and efficient Python implementation for microcontrollers and constrained systems
https://micropython.org
Other
18.78k stars 7.52k forks source link

rp2: Nonsense REPL input on Linux due to tty serial input echo race #15298

Open projectgus opened 1 month ago

projectgus commented 1 month ago

Port, board and/or hardware

rp2 port, PICO and WeACTStudio boards

MicroPython version

MicroPython v1.24.0-preview.39.g411d66586c on 2024-06-18; WeAct Studio RP2040 with RP2040 (current master)

Reproduction

  1. Copy this code block to main.py on board:
from machine import Pin
import time
led = Pin(Pin.board.LED, Pin.OUT)
for i in range(100):
    print("abcdefghijklmnopqrstuvwxyz")
time.sleep_ms(100)
led(1)  # connect while LED is on to trigger bug
for i in range(100):
    print("ABC ", i)
    time.sleep_ms(50)
led(0)
  1. Connect the board to USB or hard reset it.
  2. While the LED is on, connect with "mpremote a0" or similar.

Expected behaviour

Since PR #14485 merged, expected output is something like:

[... similar lines cut from here ...]
ABC  96
ABC  97
ABC  98
ABC  99
MicroPython v1.24.0-preview.39.g411d66586c on 2024-06-18; WeAct Studio RP2040 with RP2040
Type "help()" for more information.
>>>

Observed behaviour

This is a race condition so it depends on timing, but on Linux the output sometimes shows the REPL behaving as if some random input taken from the earlier output has been sent to it, i.e.

[... similar lines cut from here ...]
ABC  96
ABC  97
ABC  98
ABC  99
MicroPython v1.24.0-preview.39.g411d66586c on 2024-06-18; WeAct Studio RP2040 with RP2040
Type "help()" for more information.
>>> stuvwxyz
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'stuvwxyz' isn't defined
>>> 
>>> abcdefg

If the output printed earlier from main.py includes valid Python statements, these might even be executed...

Additional Information

The on-the-wire behaviour can be seen in usbmon capture output where the USB captures

  1. The host sets DTR, RTS via USB-CDC control transfers.
  2. The rp2 sends all of the buffered data to the host (correctly)
  3. The host unexpectedly echoes back some of the data to the rp2, one character at a time.
  4. If step 3 includes a newline then the device might try to evaluate it in the REPL, and returns an error.

Output of strace -tt mpremote a0 shows the last part of the puzzle:

12:37:58.522034 openat(AT_FDCWD, "/dev/ttyACM0", O_RDWR|O_NOCTTY|O_NONBLOCK|O_CLOEXEC) = 3
12:37:58.522533 flock(3, LOCK_EX|LOCK_NB) = 0
12:37:58.522733 ioctl(3, TCGETS, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B9600|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
12:37:58.522988 ioctl(3, TCGETS, {c_iflag=ICRNL|IXON, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B9600|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
12:37:58.523032 ioctl(3, SNDCTL_TMR_START or TCSETS, {c_iflag=, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|, c_cflag=B115200|CS8|CREAD|HUPCL|CLOCAL, c_lflag=, ...}) = 0
12:37:58.523367 ioctl(3, TIOCMBIS, [TIOCM_DTR]) = 0
12:37:58.523480 ioctl(3, TIOCMBIS, [TIOCM_RTS]) = 0
12:37:58.523518 ioctl(3, TCFLSH, TCIFLUSH) = 0

The above shows:

I think the necessary conditions to trigger this are:

Suggested fix

I think the fix may be pretty simple here, to make cdc_connected_flush_delay apply to all flushes.

However this behaviour is odd and hard to see, so I wanted to document it here in case it comes up again in another context.

Code of Conduct

Yes, I agree

projectgus commented 1 month ago

I think this is the bug described by @robert-hh here: https://github.com/micropython/micropython/pull/15158#issuecomment-2168123436

andrewleech commented 1 month ago

That's a surprisingly detailed find thanks for investigating @projectgus !

I haven't confirmed but also suspected I may have introduced a bug in tinyusb.

In my recent change I added the option to disable clearing the tx buffer on connection https://github.com/hathach/tinyusb/blob/007a8bd46dce4020a2ef972749dacb9d80428371/src/class/cdc/cdc_device.c#L298

I now see the buffer and the head/tail pointers is statically defined as part of the struct https://github.com/hathach/tinyusb/blob/007a8bd46dce4020a2ef972749dacb9d80428371/src/class/cdc/cdc_device.c#L64

This isn't initialised with any kind of = {0} so there's a chance it's starting up with junk in the buffer that may have previously cleared by the function I've disabled above.

projectgus commented 1 month ago

I haven't confirmed but also suspected I may have introduced a bug in tinyusb.

I think that part should be OK, unless I'm missing a quirk this struct is always a member in the static array _cdcd_itf and as a static array that's going to be in .bss whether or not it's explicitly initialised to zero. I just double checked rp2 port and it's definitely .bss there.

I think all of the examples Robert posted in the thread are jumbled up parts of things printed to the console during startup and echoed back. The bit where it printed (and tried to executed) ysh import * looks like it comes from the filesystem, but it doesn't - upysh prints a bunch of help text on import including the lines upysh is intended to be imported using:\n from upysh import *. The race condition just happened to echo the second part of that line back to the REPL, which then tried to execute it.

projectgus commented 1 month ago

I think the fix may be pretty simple here, to make cdc_connected_flush_delay apply to all flushes.

Ah, the fix is annoyingly fiddly... because we do want to write into the TinyUSB FIFO, so we don't lose any buffered data. But we don't want the FIFO to flush until a short while after the port is opened, and TinyUSB will (reasonably) try to flush the FIFO whenever it's full.

andrewleech commented 1 month ago

@robert-hh are you able to replicate the scenario where the device connects but repl appears to be locked up even on unplug / replug of usb?

If so, can you try hitting ctrl-d in that scenario - I'm wondering if this weird echo might be tripping the repl into raw mode which then looks silent / locked up until explicitly exited.

robert-hh commented 1 month ago

@robert-hh are you able to replicate the scenario where the device connects but repl appears to be locked up even on unplug / replug of usb?

It seems locked, but on a re-connect it works. Even switching the baud rate helps, because that seems to re-initialize the USB connection. Un-plug and re-plug has the same effect as hard-reset or power cycle. Edit: Neither Ctr-B, Ctrl-C or Ctr-D works in the locked state.

projectgus commented 1 month ago

Ah so I had linked to robert's first comment in my comment above, but that's not quite right - I wasn't able to reproduce anything from just the code in that comment, I reproduced the other weird behaviour from the code in the later comments that imported upysh, etc. Have edited my comment above.

@robert-hh do you still see the hang if you build with this PR? If yes, is only the test code in your first comment enough to repro for you, or do you need some of the additional code as well?

robert-hh commented 1 month ago

I see that behavior with the code from PR #15158. And no, it's not just the code from my first comment. Actually it does not seem to be related to a specific code, as I detailed later. The whole thing seems to be a combination of code that outputs text and delayed timing, like you found. So it made a difference if more or fewer files are in the file system, if code was imported that accesses files. I could get the issue with print statements and delays, like you did. The code that caused the issue the first time just delayed (waiting for en external device to respond) and printed (by raising an error). Without that delay (external device responds) all worked fine.

Besides that there is another USB issue I have since long with SAMD devices at an external USB hub. But then the SAMDs do not register at all or only after about 40 seconds. And they work fine when connected directly to the PC. This issue persists for a few days and then it disappears for months. It seems not related to the MP codebase.

robert-hh commented 1 month ago

If I revert PR #14485 8809ae716 shared/tinyusb: Buffer startup CDC data to send to host on connection then all is fine.

robert-hh commented 1 month ago

To be clear: the phenomenon happens as well when using the actual master branch. The main.py is:

from upysh import *
from pye import pye

from machine import Pin
import cs1237
power=Pin(13, Pin.OUT, value=1)
cs=cs1237.CS1237(Pin(15), Pin(14))

With cs1237.py being a driver for the cs1237 ADC. If that one is NOT connected, the driver will raise an issue after about 500ms. If the driver is properly connected, all works fine. I get similar lock-up states with that set-up on a Teensy 4.1, but way rare. When locked-up, e.g. changing the USB baud rate in Picocom using Ctrl-A Ctrl-D unlocks USB.

Edit: upysh.py and pye.py are frozen into the firmware.

projectgus commented 1 month ago

The main.py you've posted here is what I used to repro this initially (I didn't need to freeze anything, but that may or may not be relevant as I agree it seems timing dependent).

So, have you tried to reproduce with the fix in this PR? Is there any difference?

projectgus commented 1 month ago

So, have you tried to reproduce with the fix in this PR? Is there any difference?

Um, sorry, I'm obviously tired as there is no PR and there is no fix for this yet. Oh dear!

There is another snippet of reproducer in the issue text at the top. I expect you'll be able to reproduce similar but maybe not the same problems with it, because I think it all depends on what random characters get echoed back to the board.

Not sure how best we can fix this without patching TinyUSB, but maybe I can put up a PR that does patch TinyUSB and we can at least confirm which problem(s) that makes go away.

andrewleech commented 1 month ago

Thanks for the feedback @robert-hh - are you using this from Linux also? If you are, it sounds to me like what you're seeing is broadly the same as what @projectgus is describing, though there might be some secondary issues too.

The repl history noise you saw could possibly be this same stdio echo / reflect bug- if that repeated bit of stuff reflected back with a \n on the end it might have been executed, erroring, echo, reflected back, re-exectuted back and forth a number of times in the brief reflection window.

robert-hh commented 1 month ago

@andrewleech Yes, it looks like the same issue. And once it's understood there will be a fix.

andrewleech commented 1 month ago

Pyserial: echo is turned off here in the _reconfigure_port(): https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/serialposix.py#L413

It looks like DTR is set here, after _reconfigure_port() https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/serialposix.py#L343

Update: Urgh, yeah Linux always asserts DTR automatically whether a serial port is opened (for obscure legacy reasons presumably). There are lots of posts about this online and the havoc it often causes to embedded devices. The only "fix" seems to be to patch / recompile the individual device drivers being used... eg. https://unix.stackexchange.com/a/543527l

Maybe we need to look at adding a new interface into tinyusb to prevent it sending anything until we say so (after our SOF based delay).

andrewleech commented 4 weeks ago

It might be worth raising this issue over at tinyusb as in theory this issue could occur even without my buffer change; all it needs is enough data being written when the port is just opened (DTR asserted) to trigger a tx flush before the host has time to disable local echo. They might be open to moving the SOF based delay before "is connected" gets flagged into tinyusb itself. SOF is already being used for "scheduling" resume behaviour on some drivers so also using it to schedule "is connected" doesn't feel like much of a stretch.