solokeys / solo1-cli

Solo 1 library and CLI in Python
https://pypi.org/project/solo-python
Apache License 2.0
183 stars 69 forks source link

DFU mode: improve programming robustness. #98

Closed enrikb closed 3 years ago

enrikb commented 3 years ago

dfu.get_status() might return usb.core.USBError with errno == EPIPE, when called in quick succession. This is the case when called from dfu.block_on_state(). The next call usually works.

This change tries to catch this specific error and just re-tries in case it happens.

Without this change, the error (and programming abort) might get undetected, as click catches IOerror/EPIPE silently. This has significant potential to brick a device.

This change addresses issue #97.

nickray commented 3 years ago

Thanks for thinking about this!

I did not understand the click comment, what exactly happens in case of EPIPE without your patch applied? I'm thinking here it might be good to have a maximum number of retries, but perhaps you're saying that click swallows EPIPE completely, so this patch is in any case strictly better?

enrikb commented 3 years ago

I did not understand the click comment, what exactly happens in case of EPIPE without your patch applied?

As you can see here: https://github.com/pallets/click/blob/3984f9efce5a0d15f058e1abe1ea808c6abd243a/src/click/core.py#L944, click catches all OSError having errno == EPIPE. Effectively, it maps it to an exit code 1, but no exception output will be generated. (This makes sense for typical CLI commands, as an actual EPIPE will happen often.)

(In the version I have been using, it catches IOError, which now is an alias of OSError.)

When libusb1 returns a PIPE error, it raises USBError, which is derived from IOError (a.k.a OSError) also having errno == EPIPE.

So, when the error happens, without the patch, the programming stops immediately. However, there is no obvious feedback like a stacktrace, because this specific USBError is caught by click when handling the frequent CLI command event of writing to a closed pipe. Of course there will be no success message of the whole program/verify cycle. Unfortunately I did not know what to expect during my very first DFU programming try :(

Also, the error happens so often (4-5 times during one full flash programming cycle), that it was impossible for me to do a full flash program in DFU mode. I only managed to program the bootloader without the patch, after some retries.

I'm thinking here it might be good to have a maximum number of retries, but perhaps you're saying that click swallows EPIPE completely, so this patch is in any case strictly better?

You're right of course, some limit to avoid an endless loop would be good.

How would this be handled if s.state != state is never reached in dfu.block_on_state()? Is there a timeout or something?

enrikb commented 3 years ago

I have changed the approach to try the ctrl transfer inside get_status up to three times in case of USBError/EPIPE.

nickray commented 3 years ago

Thanks for the explanations! So this is about the something | head type of situation probably.

I'm happy to merge this as-is if you're happy :)

Regarding your original problem, I've never seen or heard of such frequent errors (we only had this kind of error in our deprecated web interface, which would interrupt flashing if users tabbed out). I wonder if it could be a bad connector, or what else could cause this.

enrikb commented 3 years ago

Thanks for the explanations! So this is about the something | head type of situation probably.

Exactly.

I'm happy to merge this as-is if you're happy :)

I've just verified on Windows 10 that the change doesn't seem to do any harm, either. Even though the programming also worked reliably without the patch! Same hardware where it is unstable on Linux (Ubuntu 20.04, Kernel 5.4.something).

Regarding your original problem, I've never seen or heard of such frequent errors (we only had this kind of error in our deprecated web interface, which would interrupt flashing if users tabbed out). I wonder if it could be a bad connector, or what else could cause this.

I see the failures on two (somewhat different) Dell notebooks on Ubuntu 20.04 reliably. It's a Solo Hacker V1.2 having a USB-C plug + C-to-A adapter. I assume the adapter is passive.

From my side you can merge if you feel comfortable to do so. Just don't blame me if bunches of Solo Hackers will be bricked in the next weeks ;-)