keyboardio / Chrysalis-Firmware-Bundle

Firmware sketches for boards supported by Chrysalis
GNU General Public License v2.0
32 stars 25 forks source link

Atreus - "GET_DESCRIPTOR" calls are timing out with latest firmware (`0.92.1+94`) #40

Closed NessDan closed 6 months ago

NessDan commented 6 months ago

Hey @obra , this is Daniel from keyboard.gg!

Sorry to bug, but I got a report from a user in my Discord that upgrading his firmware of his Atreus leads to it no longer working with my Edgeguard adapter. I decided to do a deep dive to see what exactly the issue is and if I can update my code to support your changes.

image

After diving in, I can confirm that at least the latest firmware version (0.92.1+94) has changed it's behavior and my calls to GET_DESCRIPTOR are no longer returning as expected (appearing to be timeouts possibly?) Downgrading the firmware to 0.92.0+77 fixes this issue, as videod below.

(Make sure to unmute!) Part 1 πŸŽ₯ Part 2 πŸŽ₯ Part 3 πŸŽ₯ Part 4 πŸŽ₯

I noticed a changelog update,

USB protocol Fixes for both AVR-based and GD32-based keyboards that may eliminate "communications timeout" issues when talking to Chrysalis from Taylor Yu code@argon.blue

This in my mind seems low-level enough that it could be the source of what I'm seeing. I'd be happy to share how my adapter is connecting with keyboards in case there's any wonder if it's the source of this issue, but this is also the only time I've seen this behavior pop up on any of the keyboards I've tested.

Going to ping @tlyu in this thread as well in hopes to hear what could be the problem. I think this could be related to this PR.


Update as well: When connecting to Windows, I've confirmed there is a timeout error during enumeration before the keyboard even begins to start working. Unsure if there's some deeper root cause here, but the older firmware didn't have this issue.

Older firmware (no timeouts, just responses.) Newer firmware (timeouts, with a response)
image 2024-01-31 02_34_08-atreus-edgeguard-0 92 1+94 - Total Phase Data Center v7 01 000

Attaching more videos & Beagle USB captures (each time, the "a" key was pressed repeatedly before disconnecting and stopping.)

πŸŽ₯ Timeout when plugged directly into Windows πŸŽ₯ Timout when plugged into Edgeguard

atreus-0.92.0+77-vs-0.92.1+94-tdc-captures.zip

tlyu commented 6 months ago

Thank you so much for the Beagle captures!

I'm pretty sure you've come across what's probably an ancient bug in the Arduino AVR USB core. The problem firmware release included an unrelated minor protocol conformance bugfix to the boot keyboard descriptor (correctly specifying a logical maximum of 255 instead of -1). That fix coincidentally increased the descriptor's length from 63 to 64 bytes.

The bug (if I'm right) is a failure by the AVR core to send a ZLP after sending a multiple of 64 bytes when a longer length has been requested for a device-to-host control transfer. The low-level fixes to the AVR core in the problem firmware were for correctly handling receiving ZLPs from the host, which are probably unrelated to the bug you're seeing.

The PR you linked is not yet in any published firmware snapshots, to the best of my knowledge. It will also coincidentally work around this problem by having optimized keyboard report descriptors that are shorter than 64 bytes.

As a workaround, you can always specify the correct length when requesting the report descriptor, which I believe is what Linux and macOS do. (That would be the report descriptor length given in the HID class-specific descriptor.)

tlyu commented 6 months ago

After reviewing the AVR core code, I can summarize: the descriptor size change unmasked an ancient bug in the Arduino AVR USB core where it doesn't properly send ZLPs in control transfers.

NessDan commented 6 months ago

Thank you so much for the responses and workarounds for the time being!

Would this be fixable in a future firmware update to the Atreus?

Also apologies for calling out the wrong PR, thanks for looking into this!

tlyu commented 6 months ago

This could definitely be fixed in a future Atreus firmware update. I don't know yet what form it would take, or when it would happen. @obra could decide on one of the workaround options, and/or we fix the underlying bug in the AVR core in our fork.

obra commented 6 months ago

@NessDan Can you test out this Atreus firmware build to see if it resolves the issue for you? It's a build of the 0.92.1 firmware with only @tlyu's USB ZLP fix applied. Atreus-v1.99.8-661-ge6a0d4.hex.zip

tlyu commented 6 months ago

With a Beagle, I confirm the observations of Windows timeouts, multiple resets, and eventual successful operation on Model 01 with 0.92.1+94. I also observe that Model 01 with 0.92.1+94 with my AVR core patch no longer has the long timeouts or multiple resets on Windows.

NessDan commented 6 months ago

New firmware is working for me & my customer :) Thank you both! (FYI I also captured the keyboard connecting to Windows & Edgeguard with the latest firmware! May be worth double checking if everything looks right though. I noticed one error, unsure if it's important or if it's on my end but everything works still.)

image

atreus-edgeguard-and-windows-0.92.2+100-tdc-captures.zip

If all of that looks normal, I think this issue is resolved and can be closed πŸ˜„

tlyu commented 6 months ago

Thanks for the feedback!

The OUT transaction seems to be sent by the host (the Edgeguard). It's a ZLP with a DATA0 PID, so most likely intended to be a handshake for an IN from a device-to-host control transfer. However, the most recent control transfer was host-to-device, with a correct ZLP IN handshake. What's even stranger is that the OUT comes 235ms after the last control transfer. It seems that the device correctly ignores the transfer without sending ACK, which I believe is the correct behavior, because it's not expecting such a transfer in its state machine.

That's about all I can conclude without a lot more details about your implementation.