mik3y / usb-serial-for-android

Android USB host serial driver library for CDC, FTDI, Arduino and other devices.
MIT License
4.81k stars 1.58k forks source link

Error message "USB get_status request failed" after a few hours working fine #503

Open vegidio opened 1 year ago

vegidio commented 1 year ago

Hi, I'm developing an application that connects to a device through USB and then sends a command (the same command) over and over again in intervals of 30 seconds.

Everything seems to work as expected for some hours, sometimes almost 24 hours working without any problem, but then we suddenly start to get this error USB get_status request failed. I searched for this error message in your repo and I found some people that encountered the same error message, but two cases in particular seem somewhat similar to mine:

In this first issue the OP reported that everything was working initially but later it stopped after 2 hours; on this second issue the OP mentions that everything stopped working after 5 hours.

I find intriguing that in all theses cases, including mine, everything is working for a while, but then out of nowhere it stops working after some hours. I know this is a long shot, but do you have any idea what could be causing these problems?

Thanks!

kai-morich commented 1 year ago

Which serial device type are you using?

I suspect it strongly depends on serial device type and/or android vendor. In my home automation I replaced pl2303 with ftdi as these stopped working after some time and only power toggle made them working again

vegidio commented 1 year ago

Thanks for the quick answer. The serial device that I'm using is this one: https://ingenico.com/sites/default/files/resource-document/2022-10/OPEN-1500_datasheet_EN_OCT22.pdf

kai-morich commented 1 year ago

Which driver is used in this library?

Thaejaesh-S commented 1 year ago

Hi Kai,

I am having a similar issue as Veigidio, the device connects but after a random amount of time(time varies with how the device is connected, ie a hub or adapter) it drops a packet and then we get USB get_status request failed or ControlTransfer in test connection fails.

We are using an STM32L4 as a USB CDC ACM device into an Android device.

Thanks, Tj

jhlink commented 1 year ago

Hi everyone,

@kai-morich . After months of digging, I think I might have explanation for this issue and the cascading failures.

I'm running a similar issue to @vegidio and @Thaejaesh-S with similar device setups. Baudrate: 115200, Stop Bit 1, Parity None.

The root issue might be requesting a read bulk transfer (from device to host) when there's no data to be sent to the host.

image

Description

Using strace on an Android device, I've captured the above ioctl syscall that mConnection.bulkTransfer utilizes. File Descriptor 37 is the serial device communicating with USB host device, which connects to the USB port on the Android device.

Bulk Write Requests

I have no issues/errors sending a bulk write request, regardless of payload size.

Bulk Read Requests

For some reason the bulk read requests are broken up consistently, starting at the very first byte, then sending the rest of the payload in a subsequent transfer. Then, an additional request, which always times out, and almost always a control transfer request follows. Since the control transfer request generally succeeds, no IOException is thrown, but if the timeout isn't sufficiently long enough, say 2s, I wouldn't be surprised if the IOException would be thrown.

I believe I can reproduce the issue with the get_status request failed, however, the fact that these ETIMEDOUT errors are occurring for every bulk read request just seems like a code smell.

Lastly, I don't know the reason behind the segmentation, but I can account for every byte in the bulk read request.

https://github.com/mik3y/usb-serial-for-android/wiki/FAQ#when-reading-a-message-why-dont-all-bytes-arrive-at-the-same-time

An abatement would be to design around this requesting bulk read transfers of a determinate length using fixed, length-prefixed, or newline terminated messages. Submitting bulk transfer read requests if there's still data to be received. I don't know the logic behind when the bulk read requests are triggered, but at least I can avoid making needless read requests.

My questions are

  1. Is there any way to query the USB host or downstream USB devices if it has any data to send? I think a control transfer can be sent to determine if an endpoint is still processing the previous request, idling, or stalled.
  2. What triggers the bulkTransfer requests? Is it polled?

Next Steps

https://github.com/mik3y/usb-serial-for-android/blob/a9c835bcb07e6e49e346cc1bcd890fabac2161ba/usbSerialForAndroid/src/main/java/com/hoho/android/usbserial/driver/CommonUsbSerialPort.java#L189-L191

I'm going to test using a control transfer to get the status of the endpoint I'm using, then determine whether the transfer to should be made or not. Failing that, I'll re-implement the read method within CommonUsbSerialPort to read with a determinate length.

If anyone has any feedback, I'd greatly appreciate it.

kai-morich commented 1 year ago

When using the SerialInputOutputManager class, a background thread basically always waits here even if no data is currently available. With infinite timeout the read blocks until data is available or connection is lost, but also works with real timeout and SerialInputOutputManager repeats the call. So i doubt that it is an issue to perform read when no data is available. I recommend to use infinite timeout.

jhlink commented 1 year ago

Using an infinite timeout for reads would require an architectural change that I cannot commit to, and asynchronous writes through SerialInputOutputManager, I believe, are no longer possible.

Unless you're proposing to instantiate another SerialInputOutputManager instance to the same USBSerialPort in a separate thread or bypassing the manager completely and making a blocking write request to the raw serial port, I can't think of an acceptably performant solution where polled reads will not block writes, or in the latter case cause undefined behavior. I'd rather avoid creating more threads, if I can help it.

I don't have evidence to support the claim that there isn't an issue with bulk transfer reads when no data is available. I expect bulk transfer requests (BTR) to terminate the end of a complete transfer with a zero length packet or an ACK. If a BTR is made when there is no data to be read, I expect a NAK. Receiving an error code like "ETIMEDOUT" makes me think that the bulk transfer request resulted with a STALL or ERR response from the device.

@kai-morich , do you know of any command line tools where I can view the protocol level USB traffic that this library is using? I'd like to verify what ioctl is sending/receiving from the USB device. Unfortunately, I'm unable to see packet traffic using strace.

How does the library access the underlying serial port? The FAQ states that the /dev/ttyxxx ports are not used by the library.

kai-morich commented 1 year ago

Try with a direct (blocking) write request. If data is smaller than USB packet size you will not notice any blocking.

bulkTransfer returns -1 on various conditions, no additional details is available from Android USB API. To distinguish some cases I added this testConnection hack.

Without rooting an Android device I am not aware of any possibility to see the low level USB traffic.

There is no underlying serial port. It's all implemented on top of USB requests.

jhlink commented 1 year ago

Try with a direct (blocking) write request. If data is smaller than USB packet size you will not notice any blocking.

This is how my application software is currently writing to the USB device. Unfortunately, this doesn't provide a root cause to the failure mode.

To distinguish some cases I added this [testConnection](https://github.com/mik3y/usb-serial-for-android/blob/a9c835bcb07e6e49e346cc1bcd890fabac2161ba/usbSerialForAndroid/src/main/java/com/hoho/android/usbserial/driver/CommonUsbSerialPort.java#L194) hack. Okay. So to summarize, even though USB protocol states bulk transfer requests should be implemented in some way, the Android USB API simply doesn't translate/provide these error codes properly to the application layer. The same seems to be true for control transfers, even though control transfers are more robust (though slower) compared to bulk transfers.

Now I understand the preference for infinite timeout reads or using requestWait/queue. Because Android's lack of error propagation, we have no option but to poll on a separate thread and hope there aren't any other failures Android is hiding, and if any exceptions are thrown, somehow restart the UsbDeviceConnection to return to nominal operation.

This would explain why some individuals implemented their own ioctl function to support bulk transfers. This is tough. :(

Without rooting an Android device I am not aware of any possibility to see the low level USB traffic. Yeah. I've hit that wall today. Unless I can install some sort of special kernel module, which is its own rabbit hole rife with dead-ends, there's no way I can view USB traffic. I'll see about getting a USB packet analyzer. Unfortunately, a logic analyzer will not work for setup due to its parasitic load.

jhlink commented 1 year ago

I was wrong about ETIMEDOUT. According to the ancient tomes of Android hosted docs, it's a legit error code due to an arbitrary assigned timeout. The Host controller would experience/see no other error.

jhlink commented 1 year ago

Just a pulse update. I've got the USB protocol analyzer.

I've had to shift gears for a separate project, but I'll be back on this in the new few days or so.

Mohammedbinnazar commented 1 year ago

Hi @jhlink @kai-morich Any update on "USB get_status request failed" and "control transfer failed" as we still get these error in the latest library though after few hours and when keeping the device idle and initiate a command.

jhlink commented 1 year ago

@Mohammedbinnazar , I should be back on this later this week. Hopefully, today once I dial in the SW changes for unrelated dev.

vegidio commented 1 year ago

Which driver is used in this library?

@kai-morich I saw now that you posted a follow-up question, but I didn't respond. I apologize for that, but shortly after I opened this issue I had a serious family emergency that kept me away from work, and computers in general, for a couple of months.

In any case, I'm back now and it seems that there are more people facing the same type of problem that I am. So, to answer your question and give more hints to anyone following this issue and hopefully we can find a solution, the driver that I'm using is CdcAcmSerialDriver.

Thanks again for your assistance.

vegidio commented 1 year ago

Ok, now that I'm back investigating this I found something yesterday that might give some hints of what's happening, at least in my case.

First I would like to share a piece of my code:

  1. First I try to find the driver associated with the USB device that I'm using:
val driver: UsbSerialDriver by lazy {
    val allDrivers = UsbSerialProber.getDefaultProber().findAllDrivers(usbManager)
    if (allDrivers.isEmpty()) error("No serial USB devices!")
    // My USB device is from Ingenico
    allDrivers.first { it.device.manufacturerName.equals("ingenico", ignoreCase = true) }
}

and then I get the UsbDevice simply doing this:

val device get() = driver.device

The code above works great when I start using my app, but when it stops responding (I've put some code to detect that), I try to reconnect. However, this time the same code above to find the driver associated with my USB device doesn't find any result for manufacturerName = Ingenico.

The list allDrivers in the code above is not empty (I have 2 USB devices connected to my Android device), but the first time I run this code, the list has 2 devices (Ingenico and another one). However, when Ingenico stops responding, the same code returns only 1 device (not Ingenico).

So, is it safe to assume that whatever is causing the loss of connectivity with the device is not a problem in the source code? I mean, when I lose connectivity, I cannot even try to reconnect to the device because it no longer appears in the list of available drivers, but the other USB device is still there. It gives me the impression that my Android phone and my code are working as expected; it's the USB device that has entered an unrecoverable state somehow.

Mohammedbinnazar commented 1 year ago

Hi @jhlink Hopefully you must have started working on this project last week. Any updates on it ?

vegidio commented 1 year ago

Folks, I want to find a solution for this problem like everyone else in this thread, however let's not act as if this is a paid service and the people working on this own us anything.

I'm sure that whenever a solution is found an update will be posted here.

jhlink commented 1 year ago

@kai-morich Status update. I spent the day reproducing the failure with a USB protocol analyzer. I seem to have found a fix for the issue, but @vegidio it would be great if you could test this to prove that it works. (Hope you're family is doing okay. )

Overview

Reproduced the failure using the example project in the repository. It was modified to force the failure through heavy read/write throughput operations. An IO exception occurred within seconds, which indicated packet corruption due to concurrency issues. Using synchronized on the USBSerialPort instance within the serial write thread seems to have resolved the issue. Over 50GB of USB traffic data was collected using the USB protocol analyzer given the following:

After applying the modifier, this setup was run for 45+ minutes with no other connection loss or failures observed on the application, USB traffic, or on the device.

@vegidio , could you apply the synchronized flag on your UsbSerialPort instance in the thread that's writing to the serial port and observe if the failure appears for you?

Methodology

Attempt 1 - get_status request failure with Legacy code

Initially, I used the USB Protocol Analyzer (USBPA for short) on the legacy software I'm maintaining. I was able to reproduce the USB get_status_request issue, but the traffic indicated that during a bulk transfer of a 300 byte payload, a baud rate configuration control transfer was made; it contained the SET_LINE_CODING code found in setParameters. 2.5 seconds later, which is the read timeout I've specified, a series of SETUP request failures occurred before succeeding about 2ms later. This is the GET_STATUS request that's being sent in the test_connection function. The USBPA identified these packet failures as an invalid PID sequence because of the following:

  1. The device failed to send the NACK response after the SETUP status transaction instead of the SETUP DATA0 transaction.
  2. Missing the CSPLIT transaction since the device was communicating through a USB Hub.

Despite the invalid PID error above, the root cause for the unexpected SET_LINE_CODING control transfer is likely due to a bug or logical failure in the application, not in the library or the hardware layer.

Summary: Do not call setParameters during ongoing traffic, which should be a self-evident principle in any communication protocol. With respect to the library, it is odd why setParameters can only be set after the UsbSerialPort is opened. I'm guessing the reason is to ensure maximum device compatibility using a predefined, default set of values to open the device before changing it later. It would be nice if the "setParameters" can be called before "open," such that the desired baud rate and port configuration is used when the port is opened or if the configuration can be part of the open function header.

Forcing the developer to change port configs (if the default 9600 baud doesn't apply) after port opening seems to allow for a category of software bugs.

Attempt 2 - force failure with example code

In order to prove there is a software bug in the library, I've modified the example project in the library. I've removed everything UI related to test the core issue. A dedicated HandlerThread was created for writing data to the UsbSerialPort and handling the data received from OnNewData.

Without incorporating the synchronized modifier on the UsbSerialPort instance, I experienced a java.io.IOException: Error writing 19 bytes at offset 32 of total 51, which indicated that packet corruption was occurring due to concurrency issues.

After adding the modifier and locking inter-thread write access to the UsbSerialPort, I didn't experience this issue even after aggressively high, bidirectional throughput.

Summary: UsbSerialPort is not thread-safe. If read/write operations are placed in a separate thread, a mutex lock will be required on the port itself (not their buffers) to avoid undefined behavior caused by concurrency issues. Reviewing the example code for SimpleUsbTerminal, I've observed a similar usage of synchronized within Serial Reads/Writes as data is being received or sent across threads.

An argument can be made that a mutex lock may not be necessary if there is only one thread that handles serial writes. Given a bona fide RTOS, I'd agree, circumstantially. However, Android is not an RTOS, and the OS can kill/restart any thread at any time unless bound to a service or tied to the UI. Given that I, as the developer, cannot predict or guarantee thread termination/restarts, a mutex lock is required.

Conclusion

Preliminary tests indicate that the failure is due to concurrency issues in accessing UsbSerialPort, which is apparently not thread-safe. The issue isn't necessarily a software bug in Android's API, although greater detail in error propagation in the USB API would alleviate the need to buy a $3k+ tool to root cause USB failures.

If any developer is splitting read/write operations into dedicated threads, they should take care to incorporate mutex locks. Otherwise, if the developer is handling all read/write operations on the main thread, the application won't see failures due to concurrency issues, though there will likely be failures due to high load on the main thread.

Suggestions

Next Steps

I'll incorporate the synchronized change in legacy code, and after a ground-up redesign, I will test the overall solution to see if the get_status request will occur.

I'll wait to see if proposed change works for @vegidio .

kai-morich commented 1 year ago

Thanks for this comprehensive analysis.

'write' is internally serialized with mWriteBufferLock so you found issues with concurrent Connection.controlTransfer? Do these only happen for setParameters that might reconfigure the serial device or also for other methods like set DTR?

Concurrent read and write should still be possible. Did you face issues with a single write thread and another (blocking) read thread like SerialInputOutputManager?

vegidio commented 1 year ago

@jhlink Thank you very much for looking into this. I will test on Monday and come back with a feedback soon!

jhlink commented 1 year ago

Thanks for this comprehensive analysis.

Of course. @kai-morich

'write' is internally serialized with mWriteBufferLock so you found issues with concurrent Connection.controlTransfer? Do these only happen for setParameters that might reconfigure the serial device or also for other methods like set DTR?

Yes, it seems that way, though I don't know what the desired behavior of mWriteBufferLock is, assuming that you're talking about SerialInputOuputManager. Is it to assure USB requests are added sequentially? Then, yes. mWriteBufferLock would achieve this. Would it limit access to the USB Serial port? No, not if there were more than one thread. I don't see anywhere in the library source code where two dedicated threads for reads and writes are being created. In this scenario, the write to the UsbSerialPort (not writeAsync provided by SerialInputOutputManager) is handled separately in a dedicated thread external to SerialInputOutputManager.

Presently, I can see that if reads and writes were executed using SerialInputOutputManager, then access control is built into the flow of control; one can't run without the other completing as they alternate.

I haven't used methods like set DTR, as it seems these are dedicated pins outside of D+/D- pins. I would imagine this would mitigate any issues since data flow direction is known. As for setParameters, I would have to repeat the experiment to prove consistency since I've only tried it once.

Yes. It seems that sending the control transfer during a bulk transfer causes cascading failures.

I don't have the traffic logs with me, but IIRC the behavior is as follows.

SET_LINE_CODING --> SPLIT/SETUP/DATA0 Transaction - Sent to Device BULK_TRANSFER - 300 BYTES --> After Xth bytes sent --> The packet before this next packet fails with an invalid PID sequence. --> SPLIT/SETUP/DATA1 Transaction is sent to Host --> The rest of the bulk transfer continues until completion. BULK_TRANSFER -- Continue... (2.5seconds later) --> GET_STATUS failures.

Concurrent read and write should still be possible. Did you face issues with a single write thread and another (blocking) read thread like SerialInputOutputManager?

Do you have an example project that demonstrates that both are possible without mutexes on the UsbSerialPort? I can run the example code on Monday and collect/post the USB traffic logs for you here. I can also share the example code that I've written on Monday, and it behaves like you've said, granted with timeouts for both read and write. 2.5sec is a very long time, though admittedly increasing that timeout to 3 or 4 seconds works. Though I'm hesitant on making that jump without assurance that the timeout will not need to increase again in the future. Depending on the implemented communication protocol, the packet timeout for a communication spec may not allow for 3 or 4 seconds.

I'm unaware if the write thread was blocking the read thread. Thread access isn't clearly shown in the USB traffic logs. However, regarding the application code, there are problems with it that are unrelated to the library. It wasn't implemented without an understanding of multithread environments. It could be that there are two write threads or poor logic that directed reinitializing the dependencies for SerialInputOutputManager.

vegidio commented 1 year ago

@jhlink Ok, I did the change that you suggested and used synchronize in all places where UsbSerialPort reads or writes something.

Before this change, my app could run for around 16 hours before losing connection with the USB device, but there was a case where it ran for almost a full day (23 hours to be precise) before it lost connection. However, this happened only once; most of the time the connection is lost around the 16 hour mark.

Now, with the change that you suggested, my app is already running for 24 hours without losing connection, which is 1 hour more than the max amount of time it ever remained connected 🎉

I'm feeling carefully optimistic, but I think you found the source of the problem. I will let my app run for 24 hours more and I will report again tomorrow. 🙂

jhlink commented 1 year ago

That's wonderful @vegidio !

I'm feeling carefully optimistic, but I think you found the source of the problem. I will let my app run for 24 hours more and I will report again tomorrow. 🙂

Thank you, but we're not out of the woods yet. Would you be opposed to running it for 48 hours continuously instead of 24?

If we assume mean time to failure is approximately 23 hours, I'd like to try for 48 hours for at least two opportunities that the failure could occur again.

Ideally, if the test time were much longer, I'd be able to provide a high degree of confidence in reliability, say running it for a week straight, but I can settle for 48 hours if this isn't convenient.

vegidio commented 1 year ago

I can definitely leave my device on for much longer. However, I hate to write this, but less than 2 hours after my last update I lost connection with the USB device 😢

There are only two places in my code where I call UsbSerialPort read and write functions and I added synchronize to both places, but I will go through the code again to see if it benefits from adding synchronize in other areas. In any case, thank you very much for the support you have provided so far.

jhlink commented 1 year ago

@vegidio , I'll share my example code that forced the issue tomorrow.

Could you review it and see how your code differs from mine?

My guess is that your code is proprietary and can't be shared, which leaves us to replicate it through contrived example code.

vegidio commented 1 year ago

@jhlink Yes, you're right, the project that I'm working on is proprietary. However, the main class that makes the connection with the USB device and also read/writes data, comes from an open source Android library that I created to help me during the development of my projects.

You can find this class called UsbHelper here: https://github.com/vegidio/android-sak/blob/main/util/src/main/kotlin/io/vinicius/sak/util/UsbHelper.kt

You will notice that I have a variable port: UsbSerialPort in this class, and this variable reads/writes data in the functions read() and write(); I added the annotation @Synchronize to both functions.

But just to make sure that I was covering all possible edge cases, I also added @Synchronize to the functions connect() and disconnect() because even though they don't read or write any data in the USB device, they also access the port: UsbSerialPort variable.

Sorry, but this is the only part of my code that I can share. However, if you could share your example I will compare with the rest of my project and see if there's something else I'm missing.

kai-morich commented 1 year ago

Terminal style applications like the example at the library or Serial USB Terminal do write in the UI thread and read in a background thread created by SerialInputOutputManager.

With this pattern, read calls are done sequentially in the background thread and write contains synchronized (mWriteBufferLock). So far I did not face issues of both threads doing concurrent USB calls.

Your aditional lock on write should not have an additional effect. Do you read from different threads concurrently?

What is the actual read timeout your are using? With too small timeouts I got strange errors.

jhlink commented 1 year ago

What is the actual read timeout your are using? With too small timeouts I got strange errors. @kai-morich , 2500 ms is the read timeout.

Sorry. I should state the context for the reads and writes. I've been basing my conclusions on the library source code. After I post the example project in an hour or so, I think we can align on an implementation to replicate the failure.

@vegidio , connect does interact with the USB port. If you open the port, there are at least two control transfers being sent (as the library source code indicates), and I think one control transfer in SetParameters, which updates the port configuration (baud rate to 115200). I haven't looked into disconnect.

I don't have experience in Kotlin, but based on the general approach, do you know if you've type-aliased Synchronize to kotlin.jvm.Synchronized? Looks like the annotation has been deprecated. Seems like concurrency is handled natively in Kotlin, but I don't know to what degree.

[Edit] Yeah. No worries. I understand getting outside help on proprietary code is difficult. We can work around it. :)

vegidio commented 1 year ago

@jhlink You're right, @Synchronized is indeed deprecated on Kotlin, but only in the Common, JS and Native versions:

Kotlin All

However, if you unselect the other Kotlin platforms and leave only JVM, you will see that it's still supported there:

Kotlin JVM

There are actually two ways to use synchronize in Kotlin: we can use the @Synchronized annotation and synchronized block. I won't try to explain the difference between both, so I will just post a link of how ChatGPT explained the difference: https://chat.openai.com/share/94e04bb4-c19d-494e-9406-116cd172020b 😄

In my first attempt I tried to use @Synchronized (the annotation), but I will try to use the synchronized block now and see if I have a better control of how UsbSerialPort is used.

Besides that, after your suggestion I started to read more about how synchronized works with Kotlin Coroutines (which I use in my project) and I found an article mentioning that they don't always work well together: https://blog.danlew.net/2020/01/28/coroutines-and-java-synchronization-dont-mix/ - I will see if I get better results following the approach suggested in this article.

jhlink commented 1 year ago

@vegidio I see. Just wanted to make sure the compiler wasn't ignoring the Synchronized flag. I operate in the world of C, with occasional excursions into Python script land and begrudging dabbles in Java la-la land.

@kai-morich , Here's the example project. A stripped down version of the UsbSerialExample but with two handler threads for each read and write. To be clear, the reception of the read is handled in a separate thread. The SerialIOManager isn't initialized in a separate read thread.

If you have any suggestions on forcing the issue, I can run the tests and gather USB traffic.

notcaremath commented 1 year ago

I have a similar problem now, after calling the UsbDeviceConnection controlTransfer function, it keeps blocking despite setting timeout. This is the issueTracker I submitted https://issuetracker.google.com/issues/298700517

notcaremath commented 1 year ago

The linux kernel of the android device I am using now is 5.4.19, there is a problem, the usb driver code of linux is submitted, it seems to be caused by this reason, everyone take a look https://github.com/torvalds/linux/commit/ae8709b296d80c7f45aa1f35c0e7659ad69edce1 https://lore.kernel.org/all/20210903175312.GA468440@rowland.harvard.edu/

jhlink commented 12 months ago

Well. That's a very illuminating and useful share, @notcaremath .

Thank you.

The android os and kernel I'm using is v11 on kernel 5.4. The USB control transfer method uses this function, the function header of which says... that it shouldn't be used in an interrupt context or seems to imply that it shouldn't be used in a multithreaded context.

/**
--
  | * usb_control_msg - Builds a control urb, sends it off and waits for completion
  | * @dev: pointer to the usb device to send the message to
  | * @pipe: endpoint "pipe" to send the message to
  | * @request: USB message request value
  | * @requesttype: USB message request type value
  | * @value: USB message value
  | * @index: USB message index value
  | * @data: pointer to the data to send
  | * @size: length in bytes of the data to send
  | * @timeout: time in msecs to wait for the message to complete before timing
  | *   out (if 0 the wait is forever)
  | *
  | * Context: !in_interrupt ()
  | *
  | * This function sends a simple control message to a specified endpoint and
  | * waits for the message to complete, or timeout.
  | *
  | * Don't use this function from within an interrupt context. If you need
  | * an asynchronous message, or need to send a message from within interrupt
  | * context, use usb_submit_urb(). If a thread in your driver uses this call,
  | * make sure your disconnect() method can wait for it to complete. Since you
  | * don't have a handle on the URB used, you can't cancel the request.
  | *
  | * Return: If successful, the number of bytes transferred. Otherwise, a negative
  | * error number.
  | */

Speculating here, but this may explain why the synchronized qualifier seemed to resolve the issues I've seen. The bulk transfer requests are synchronous, and I believe performing an additional synchronous transfer, be it a control or bulk during an existing transfer, would result with undefined behavior.

I think this might be why developers that have opted to implement asynchronous requests using queue and requestWait don't see this issue. Speculating here, but guessing that queueing USBRequests would result with a call to proc_do_submiturb, which eventually leads to a call to usb_submit_urb, which is similar to what's being called here usbfs_start_wait_urb in Android 13, Kernel v5.15, admittedly with some 200 ms delays.

I think there are 4 paths ahead.

Options

  1. For any Android OS v12 with Kernel 5.10 or lower, use the synchronized qualifier on the UsbSerialPort instance to align with synchronous usage of control/bulk transfers per the Android Kernel source.
  2. For any Android OS v12 with Kernel 5.10 or lower, don't use the control/bulk transfer Android API methods and opt to use queue/requestWait instead.
  3. Find a different SOC that supports Android Kernel 5.15 and no action is needed. Unfortunately, this USB patch is only available for Android 13 and higher.
  4. Figure out how to upgrade/override the USB/core kernel module. Only certain Android OS versions support loadable kernel modules, but even then I'm not sure if USB Core is possible to update. It's unlikely that it's possible to update a kernel on an Android device without experiencing hardware compatibility issues, as indicated by the compatibility matrix here.

That patch was applied in kernel v5.15.x, which is used in Android 13 and higher. The highest kernel version available in Android 12 is 5.10.x. The highest kernel version available in Android 11 is 5.4.x.

@vegidio @kai-morich Could you share what android OS and kernel versions you're using?

@vegidio , if you're on Android 13 with kernel v5.10 or lower, could you update your kernel to v5.15.x and test with the same synchronized qualifier software you ran originally? If we assume that this issue occurs in any kernel version 5.10 or earlier that operates in a multi-threaded environment, stands to reason that this issue will be resolved.

kai-morich commented 12 months ago

in the library

As I use read with infinite timeout in my terminal App, there is typically only a single concurrent bulkTransfer or controlTransfer done by the foreground thread. I have to create a test program that does multiple blocking calls concurrently.

I use a pool of Android devices from 4.3 to 13, but never checked the kernel version, neither observed different behavior. As various of the devices run CyanogemMod or LineageOS they come with more plain Linux + Android and might behave differently than Android versions overoptimized by big vendors.

notcaremath commented 12 months ago

It is speculated that such a problem is more the unreasonable implementation of the USB protocol in the usb terminal, such as processing usb control commands in the hardware interrupt, and there is no timely response. The host is blocked.

jhlink commented 12 months ago

in the library

  • queue + requestWait is only used for read with infinite timeout,

Ahh... yes. Was there a reason the timeout for requestWait wasn't supplied? Since API lvl 26, requestWait has that option .

As I use read with infinite timeout in my terminal App, there is typically only a single concurrent bulkTransfer or controlTransfer done by the foreground thread. I have to create a test program that does multiple blocking calls concurrently.

Okay. Feel free to use the example project I posted previously.

I use a pool of Android devices from 4.3 to 13, but never checked the kernel version, neither observed different behavior. As various of the devices run CyanogemMod or LineageOS they come with more plain Linux + Android and might behave differently than Android versions overoptimized by big vendors.

Were these on emulators or on actual hardware? I don't think this issue would appear on the former. @kai-morich

@notcaremath I do not understand your comment. Are you referring to Kai's USB Terminal app or the USB/core implementation in the android kernel?

kai-morich commented 12 months ago

Since API lvl 26, requestWait has that option .

requestWait with timeout has other issues and as I anyway need the old implementation for older devices I did not switch or investigate if it got better with more recent Android versions: https://github.com/mik3y/usb-serial-for-android/blob/34e6d989fdf8d542981c0d1ad9c93027c9be7063/usbSerialForAndroid/src/main/java/com/hoho/android/usbserial/driver/CommonUsbSerialPort.java#L186

Were these on emulators or on actual hardware?

All real devices.

vegidio commented 12 months ago

Hey folks, sorry for the late response, but here is the info about my setup: I'm running Android 8.1.0, with kernel 4.4.95, so it seems to fall into the version range that can be problematic according to your theory?

Unfortunately, upgrading the Android version/kernel is not an option in my case. But I will see if I can test this application in a device with a higher Android version/kernel and check if I get different results.

Thanks!

jhlink commented 12 months ago

Hey folks, sorry for the late response, but here is the info about my setup: I'm running Android 8.1.0, with kernel 4.4.95, so it seems to fall into the version range that can be problematic according to your theory?

@vegidio No worries. Glad to hear you're still with us. Yes. I think that the pre-5.14.x kernel method of handling bulk/control transfers involve submitting requests that cannot be terminated by the kernel, that the request must be resolved with a response before the next request can be processed regardless of any provided timeout.

One possible avenue is testing and expanding the queue/requestWait for reads and writes.

Though, I'm puzzled how @kai-morich wasn't able to run into any of these issues when testing on real devices. Kai, could you provide more details about how your testing on these devices? Could you share the test plan/process for testing on these devices?

Mohammedbinnazar commented 12 months ago

@Everyone I'm aware of the substantial effort involved, and I genuinely appreciate your hard work. If you could kindly share an estimate for when we might anticipate the release of the updated and fixed Android USB serial port connection library, it would be tremendously valued. @jhlink @kai-morich @vegidio

jhlink commented 11 months ago

@Mohammedbinnazar The current proposal is to use synchronized on UsbSerialPort for read/writes and that extreme care should be taken when using this library in a multi-threaded context. My theory is that the bug stems from the application not from the library, as indicated in the USB traffic. Development and testing is still underway.

It seems the bug would not occur in Android 13 Kernel 5.14.x+ per the source code. I have no test data to support this, nor do I intend to gather it as it is out of scope for me.

The only likely release/change from this PR would be an update to the README/FAQ about best practices of the library in multi-threaded (Java) contexts using Synchronized (or at the very least a "Do Not Do" list) and maybe cautionary warnings for corresponding use in Kotlin, depending on what @vegidio discovers in his test.

Based on my investigations thus far, the library is sound and stable. Improvements could be made like anything else but not essential. As in C, I believe the library should be kept simple, small, backwards compatible, and fundamental. The onus is on the developer to understand then utilize the tool in any capacity, even if it means allowing the developer to self-inflict irreparable failure.

jhlink commented 11 months ago

@vegidio , can you tell me what USB speed your target device is operating? Is it Low/Full speed or High speed?

One way to check is connecting your device by USB to your Mac/PC and determining the speed from the Device Manager.

image

jhlink commented 9 months ago

@kai-morich .

An update. Sorry about the long haitus. Got pulled in a couple different directions.

Conclusion

Hardware Issue

The Android application is communicating with a USB device through a Usb 2.1 USB Hub. After analyzing the traffic, the hub would respond to the USB host with SPLIT-NACKs.

The USB specification is very clear about the cause of this. The Transaction Translator buffer in the USB Hub/Chip is busy/full and results with a SPLIT-NACK. Why the transaction translator is busy or fills so quickly remains to be seen as data traffic and speed is low at best.

I believe this is a hardware issue with the USB hub chip itself. For myself, I'm using a Genesis logic controller.

Resolution

Use a different USB Device with a Different USB Chip

I suggest sharing a warning in the Readme that certain Genesis USB chips are defective or have faulty logic for handling the transaction translator, and it's best to use a different USB Chip manufacturer.

Abatement

Unfortunately, the Android app does not provide a means to detect android USB hub devices. If the USB chip cannot be changed, then the following would probably work.

  1. Write a JNI library that sends a control transfer directly to the hub to reset/clear the transaction translator buffer.

Unfortunately, I'm not sure if this requires rooting the device.

Beyond that, there's nothing else that comes to mind that could reset the hub/device to recover from the filled buffer beyond power cycling the USB device, if that's even possible in software.

jhlink commented 6 months ago

Okay. I'm finally rounding out this thread. @kai-morich , I motion closing this issue. :) I'd like to put this issue to rest, please.

Abatement - Abstract

I wrote a C module using libusb and compiled a JNI library using NDK-build. I'm nearly there, but I don't recommend this approach. The effort is herculean, requires a deep understanding of low level USB protocol, USB Spec, (or at the very least a USB protocol analyzer to see USB traffic) and assumes root access to Android + ability to modify the boot.img partition.

The following is a solution to "recover" from a busy buffer aka 'get_status madness.' It doesn't resolve the root cause which is likely due to a non-standard USB-spec implementation on the USB IC but not limited to electrical issues.

If you experience these "usb get-status failed" issues, the easiest solution is to junk that USB device and find a different device with a battle-tested USB Logic implementation. Ideally, something like a certain TE USB interface/controller, where USB logic is handled through discrete state machine logic instead of error prone firmware.

If you can't do this and possess root access + ability to modify boot.img of the Android OS, then feel free to proceed.

Actions

After building the JNI module, I ran into access permission issues like many other libusb developers encountered years ago. The JNI module/Android app failed to connect to the USB device directly despite being able to connect to the device via standalone libusb C program installed and executed through ADB shell (as root).

Elevating the android app to "System" and changing the entire /dev/bus/usb path to System user friendly permissions aren't sufficient. Starting at some Android release, SELinux policies were enforced, denying unauthorized access from different contexts, like system Apps, regular apps, shell, root, etc. The SELinux policies had to be modified to grant the desired USB access to a desired app. This blog was instrumental in identifying existing policies and testing policy changes.

From there, the path is relatively straightforward. Now that the Android app can communicate directly with the USB device, a TT RESET Control transfer request, described in the USB Spec, can be sent to the USB device to recover the 'usb get_status` issue.

Conclusion

Thank you @kai-morich for your contributions and continued maintenance. I hope this investigation provided closure. It certainly has for me, accompanied with a brand new appreciation and heightened skepticism for USB boards.