Using async API can re-order the arrival of packets.

dontech commented 9 months ago

Found during the investigation of: https://github.com/mcuee/libusb-win32/issues/51

@dontech Well, I quickly modified my loopback test program for handling multiple transfers at once, but it doesn't show the issue...

The difference in between this simple version and my full application is that in loopback test each transfer has a fixed constant buffer pointer and buffer contents are verified sequentially, given transfer callback all happens within the same "handling thread". Given that verifying buffers is a "long" process, I am suspecting that all transfers are done while I am verifying the first one, I then resubmit it, verify the second, etc. but de facto there might be only one active transfer at a time.

In my full application what I do is that, in callback function, I put the transfer buffer pointer into a list then quickly allocate a new buffer, set the transfer buffer pointer with it and submit the transfer again.

I will try to improve loopback test for simulating this, but it might not be easy.

In the meantime: do you have any idea how your changes could create the behavior I am observing? Thanks.

dontech commented 9 months ago

Hi @sonatique

OK, I think i know what is going on. The problem here is that you rack up 10 transfers which are all directly submitted to the underlying driver (USBD).

Since the last patch, each transfer is broken into several USBD requests (like usbk). When you submit multiple requests at the same time, there is no guarantee that the reads will be done in order.

So the first transfer might do:

1: packet 1, packet 2, packet 6

And second will do:

2: packet 3, packet 4, packet 5

Are you sure you are missing data, and that they are not just out of order?

Thanks,

/pedro

dontech commented 9 months ago

@dontech : thanks for the explanation. No, I am not sure at all that I am missing data vs disordered. I didn't investigate more as I wanted to reproduce with the simpler "loopback test", but to no avail so far.

I guess your explanation is absolutely correct, but then, this is a big problem, at least for met (and for other people using more than one concurrent transfers) ... :-(

I tested with libusbK.sys and I have no issues: I can have multiple concurrent transfers, each of 2MBytes or so...

Thanks.

Do you think that there is any chance that you could fix this while retaining the possibility of having transfer >64K?

dontech commented 9 months ago

Just verified it. This is exactly what is happening.

libusb0-sys:[transfer_complete] sequence 7: 64 bytes transmitted, maximum_packet_size=64, totalLength=128 libusb0-sys:[transfer_complete] sequence 8: 64 bytes transmitted, maximum_packet_size=64, totalLength=64 libusb0-sys:[transfer_complete] sequence 7: 0 bytes transmitted, maximum_packet_size=64, totalLength=64 libusb0-sys:[transfer_complete] sequence 7: 64 bytes transmitted, maximum_packet_size=64, totalLength=64 libusb0-sys:[transfer_complete] sequence 9: 8000 bytes transmitted, maximum_packet_size=64, totalLength=8000

We cannot fix this without serializing what goes into the driver. We could put a lock on it or something.

dontech commented 9 months ago

The basic problem is that you expect the data to arrive in the same order as your read requests.

Yes, indeed. There have been several discussion on libusb-1.0 github regarding whether this is guaranteed to happen in all cases, and the conclusion was: yes.

And in fact it is the case with latest official libusb-win32 release.

Looks like I'll have to choose in between more than one transfer or transfer size >64K... Damned.

dontech commented 9 months ago

Looks like I'll have to choose in between more than one transfer or transfer size >64K... Damned.

Its not related to 64K directly. It is related to the endpoint size. In my case I reproduced it with an endpoint size of just 64 bytes.

Why do you need all those queued requests? Do they actually help performance?

dontech commented 9 months ago

@dontech

What I meant is that I have to choose to either stay with current official release and have more than one buffer but of no more than 64KByte in size, or to use the latest patch to use an arbitrary large buffer size but only one at at time...

Well, I technically do not need 16 buffers, it's an old number put here to be sure it was large enough. My recent tests with a USB3 device producing sustained 180MBytes/s indicate that my sweet spot is 3 buffers of 2MBytes each. With less buffers I cannot cope with 180MBytes/s, smaller buffers increase CPU consumption. That is why I really hope to have libusb0.sys supporting my ideal configuration.

dontech commented 9 months ago

@sonatique we will get there, do not worry.

What is really needed is sort of a "mode" switch. There are 2 scenarios:

1) The user wants to use async APIs. When using async APIs the driver should schedule multiple USBD request for efficiency. The driver should only do 1 request pr. USBD call, to prevent data re-ordering.

2) The user is using sync API, and the driver can issue multiple requests serially, and there is no ordering issue.

The answer is simple:

When the driver issues the next USBD request, it should check if another request has already been issued. If it has, it should not re-request another for the same transfer, to keep the order intact.

sonatique commented 9 months ago

@dontech Thanks a lot for considering finding a solution.

When you're writing about "async API" vs "sync API" which API are you referring to? The one from libusb0 or libusb-1.0?

dontech commented 9 months ago

When you're writing about "async API" vs "sync API" which API are you referring to? The one from libusb0 or libusb-1.0?

Both libraries support both mode.

1) If you are requesting 1 blocking read or write it is SYNC. 2) If you are requesting more than one read or write (without completing them imidiatly) it is ASYNC.

dontech commented 9 months ago

OK I fixed the problem.

So now, if another request is queued for the same endpoint, the IRP cannot be resubmitted as it would cause the order to go bonkers, and will not achieve anything anyway as another request is picking up the response anyway.

This does however mean that if you are submitting allot of requests, the re-queueing of IRPs is not likely to happen, meaning that you will be using more IOCTL requests for the transfer.

Currently this is the only way I see to fix this, while guaranteeing data integrity.

dontech commented 9 months ago

@sonatique: Please test: https://github.com/mcuee/libusb-win32/releases/tag/test_release_async_order

sonatique commented 9 months ago

@dontech : yeah! I no longer observe "corruption" with my high level application, neither using 16 x 16K buffers or 3 x 2M buffers, that is very cool, thanks a lot! I will now go back to measuring performances with my other setup.

My question about Sync and Async was related to the fact libusb-1.0 Sync API is implemented using its Async API + Wait. So, I was concerned by the fact that libusb0.sys could never really "understand" whether the requests are conceptually Sync or Async given whatever the higher level code does at libusb-1.0 level, libusb0.sys would always see Async calls.

Anyway, you seemed to have fixed the problem, that is amazing. I am not sure I understand all the theoretic implications on performances, but I will test and see what I get.

sonatique commented 9 months ago

@dontech I tested for performances with my dedicated setup and I have a bad news, unfortunately: CPU consumption is now higher than it ever was for more than 1 concurrent transfer.

Here are some numbers (complement what I posted here: https://github.com/mcuee/libusb-win32/issues/45#issuecomment-1873997468): [N: number of concurrent transfers; S: buffer size of each transfer]

V1.3.0.1 (10/20/2023) snapshot libusb0.sys: N=16, S=64K: 180MBytes/s, 16% CPU N=3, S=512K: 180MBytes/s, 4% CPU N=3, S=2048K: 180MBytes/s, 1.5% CPU N=1, S=2048K: 172MBytes/s, 1.5% CPU

libusbK0.sys: N=16, S=64K: 180MBytes/s, 17% CPU N=3, S=512K: 178MBytes/s, 5% CPU N=3, S=2048K: 178MBytes/s, 1.5% CPU N=1, S=2048K: 170MBytes/s, 1.7% CPU

This PR libusb0.sys: N=16, S=64K: 180MBytes/s, 16% CPU N=3, S=512K: 180MBytes/s, 17% CPU <= ouch! N=3, S=2048K: 180MBytes/s, 18% CPU <= ouch! N=1, S=2048K: 170MBytes/s, 1.7% CPU

These results are consistently reproducible on my machine.

Basically the benefit of using larger buffer size is completely nulled regarding CPU usage when using more than one concurrent transfer ;-(

EDIT: woups: I just got a BSOD when I physically unplugged my device while running N=3, S=2048K test with this PR libusb0.sys. Stop code is: KERNEL_MODE_HEAP_CORRUPTION. I never experienced BSOD with libusb0.sys when unplugging before.

dontech commented 9 months ago

@sonatique thanks for testing.

1) I think the high CPU comes from my use of spinlocks during read/write requests. I have come up with a better idea, which I will implement instead. Should fix this.

2) Not sure what that crash (BSOD) is. Are you sure this is something new? Can you narrow this down, and make a test-case ?

Thanks,

/pedro

dontech commented 9 months ago

@sonatique I re-fixed this and also fixed up some things that where not safe on multi-core setups.

1) I have removed the spin-lock and replaced it with some simple interlocked variables that should be allot more efficient.

2) I have tested (wow!) that patch more intense this time. Now it can run for hours with no errors.

3) It does not do magic however. To make sure the ordering of the packets are always correct, the system now aborts (OK with zero data) if there is a scheduling conflict.

I am eager to hear your feedback on this. Please post new results:

dontech commented 9 months ago

New test release here: https://github.com/mcuee/libusb-win32/releases/tag/test_release_async_order2

sonatique commented 9 months ago

@dontech : thanks a lot for your hard work. Unfortunately, your latest "test release 1.3.0.1-async-order-2" seems to even give an higher CPU usage than the previous one for concurrent transfers: 3 X 2MB: 180MB/s 22% Cpu 1 X 2MB: 172MB/s 1.7% Cpu

As for the BSOD: I didn't get it when unplugging during 1x 2MB test, but I got it immediately during 3x 2MB test: Bugcheck code: | 0x3B(0xC0000005, 0xFFFFF8073DE3C107, 0xFFFFAF0DCCDB0B80, 0x0) Bugcheck name: | SYSTEM_SERVICE_EXCEPTION Driver or module in which error occurred: | libusb0.sys (libusb0+0x5EA4) 012924-11296-01.dmp

dontech commented 9 months ago

@sonatique I can see that USBPcap appears in your stack trace. USBPcap is notorious for causing problems.

You probably installed it along with Wireshark I guess. Could you uninstall USBPcap and make sure it does not attach to your driver before testing ?

sonatique commented 9 months ago

@dontech : Thanks for looking at it. I indeed had USBPcap installed (old version from 2020) along with other Pcap stuff. I uninstalled everything. First I ran another unplug test while having 3 concurrents transfers, before rebooting after *PCap uninstallation (note: nobody asked me to reboot), and I immediately got BSOD: 020224-20390-01.dmp

Then when the computer rebooted, I ran another test, and this time I was no longer able to get BSOD despite trying several times to unplug while running transfers.

So you probably found the cause of it, thanks a lot! But anyway, I am still wondering why I never had this before (while having USBPcap installed from the very beginning). It seems to mean that you added something that now "conflicts" somehow with USBPcap. I am concerned that some people that could potentially use a product containing this newer libusb0.sys may have USBPcap installed and therefore suffer instabilities they didn't have before...

Regarding the high CPU usage: as you expected I guess: there is no improvement after I removed USBPcap..

dontech commented 9 months ago

Hi @sonatique

1) There cold easily be a bug in USBPcap that causes this. I have a had all kinds of side-effects from running with USBCap attached. It has definitely become more stable, but I would not rule this out.

2) Could you lock-down which version of libusb-win32 caused this problem to start if it was a regression?

Thanks,

/pedro

dontech commented 9 months ago

Regarding the high CPU usage: as you expected I guess: there is no improvement after I removed USBPcap..

1) Yes this sounds plausable.

2) Did you try running the non-debug version when doing the bandwidth test? There is a awful lot of logging going on in the debug driver, which might slow things down...

dontech commented 9 months ago

@sonatique We need to quantify the problem I a way we can both see it.

Could you re-run your tests with this app? It is a modification of your original app, so it runs directly with libusb-win32 library

1) With this I do not really see the high CPU usage you are seeing but I might be doing something different.

2) Are you sure the high CPU usage is not due to the high number of completed transactions, if you are logging/doing something for each transaction? It is important to note that the latest changes I made will fix the ordering issue, but will also cause more transactions to complete as transactions will be shorter if there is a conflict to avoid re-ordering of data.

testlibusb.zip

mcuee commented 9 months ago

@dontech

Maybe it is good to add testlibusb.c example as an official libusb-win32 example.

Existing benchmark.c is a bit complicated and has not been updated to catch up with libusbK kBench. https://github.com/mcuee/libusb-win32/tree/master/libusb/examples https://github.com/mcuee/libusbk/tree/master/libusbK/src/kBench

Then there is no intermediate example between the simple bulk.c and the more complex benchmark.c example.

sonatique commented 9 months ago

@dontech

Hi @sonatique

There cold easily be a bug in USBPcap that causes this. I have a had all kinds of side-effects from running with USBCap attached. It has definitely become more stable, but I would not rule this out.

Could you lock-down which version of libusb-win32 caused this problem to start if it was a regression?

Thanks,

/pedro

Hi, thanks. My point is that the only ever version of libusb-win32 (libusb0.sys) I experienced BSOD with is your last experimental one. I used libusb-win32 for years (official release version up to the last one) with USBPCap installed (last version since 2020). And I am pretty sure that I unplugged device violently more than once during those years. What I mean is that: yes, USBPcap is somehow responsible for it (since removing it seems to make the issue vanish) but there is something in your last code change that made libusb0.sys "sensitive" to USBPcap, in a way it was not before. I am concerned by the fact that if this code goes into the wild, many people having USBPcap installed could suddenly start having issues they didn't have before... In summary: I would prefer to re-install UBSCap and test with it, such that you could maybe eliminate the code causing the issue...

sonatique commented 9 months ago

@dontech

Did you try running the non-debug version when doing the bandwidth test? There is a awful lot of logging going on in the debug driver, which might slow things down...

I only used release version as far as I can tell. My last one was this: https://github.com/mcuee/libusb-win32/releases/download/test_release_async_order2/libusb-win32-bin-1.3.0.1.zip

sonatique commented 9 months ago

@sonatique We need to quantify the problem I a way we can both see it.

Could you re-run your tests with this app? It is a modification of your original app, so it runs directly with libusb-win32 library

With this I do not really see the high CPU usage you are seeing but I might be doing something different.

Are you sure the high CPU usage is not due to the high number of completed transactions, if you are logging/doing something for each transaction? It is important to note that the latest changes I made will fix the ordering issue, but will also cause more transactions to complete as transactions will be shorter if there is a conflict to avoid re-ordering of data.

testlibusb.zip

For sure I will test with your new app, as soon as I can.

Quick question though: when you say "high number of completed transactions" : I have trouble understanding how this relates to lisbusb-1.0 transfers .

I mean: I submit N (=3) libusb-1.0 transfers for the same IN endpoint and I have a single callback function to be called for all three of them. This is my only interaction with the process of submitting and receiving something. I do not have access to transactions done by libusb0.sys, it is a black box to me, so I cannot log or do something with what happens below libusb-1.0. I don't know how many low-level interactions are done with the hardware.

Do you mean that I will now get several callback calls for each submitted transfer? That instead of getting one libusb-1.0 callback when my first submitted transfer is done with receiving 2MB, I will now get many callback calls carrying small amounts of bytes?

dontech commented 9 months ago

Hi, thanks. My point is that the only ever version of libusb-win32 (libusb0.sys) I experienced BSOD with is your last experimental one. I used libusb-win32 for years (official release version up to the last one) with USBPCap installed (last version since 2020). And I am pretty sure that I unplugged device violently more than once during those years. What I mean is that: yes, USBPcap is somehow responsible for it (since removing it seems to make the issue vanish) but there is something in your last code change that made libusb0.sys "sensitive" to USBPcap, in a way it was not before. I am concerned by the fact that if this code goes into the wild, many people having USBPcap installed could suddenly start having issues they didn't have before... In summary: I would prefer to re-install UBSCap and test with it, such that you could maybe eliminate the code causing the issue...

I have moved this to a new ticket: https://github.com/mcuee/libusb-win32/issues/56

dontech commented 9 months ago

Quick question though: when you say "high number of completed transactions" : I have trouble understanding how this >>relates to lisbusb-1.0 transfers .

What i mean is to fix the ordering of packets i have to allow smaller packets to complete with smaller amounts of data. You will probably notice that some of your reads complete with only 64K of data or less.

To fix this i need to fix: https://github.com/mcuee/libusb-win32/issues/52

sonatique commented 8 months ago

What i mean is to fix the ordering of packets i have to allow smaller packets to complete with smaller amounts of data. You will probably notice that some of your reads complete with only 64K of data or less.

Ok, got it . I added some display to my test app in order to see this and indeed, I see that all (but the very first one) read transfers complete with 64K when using more than 1 concurrent transfer. When using only 1 transfer, I see completion at the maximum value (2MB). So this explains the high CPU, given I have the same CPU usage than when using multiple 64K transfers.

The benefit of increasing the buffer size is thus lost when using more than one buffer (basically all results are equal to test with N buffer of 64K, whatever the max size I am using).

dontech commented 8 months ago

The benefit of increasing the buffer size is thus lost when using more than one buffer (basically all results are equal to test with N buffer of 64K, whatever the max size I am using).

Yep, and I am fixing this now. I will push a fix soon that will allow each transfer to be up to 32MB for superspeed. You should see a very big boost... :-)

dontech commented 8 months ago

OK here is the new version with the fix in for the max transfer size.

https://github.com/mcuee/libusb-win32/releases/tag/test_release_async_order4

dontech commented 8 months ago

@sonatique Please check that your transfer sizes go up and if it helps on CPU usage.

sonatique commented 8 months ago

@dontech Woohoo, yes, this time I can have several concurrent transfers of 2MB or more (I tested with 32MB) and low CPU. I validated that all transfers now complete with their full buffer size (only tested with read transfers), hence the benefit in CPU usage (when device deliver enough available data, of course)

That is really super good, thanks!

(side note: I experienced a BSOD while removing my unit, given I still have USBPcap 1.5.0.3 installed. Not a big deal here since this issue is adressed in #56 , so just mentioning it).

dontech commented 8 months ago

@sonatique What is the latest numbers? CPU usage?

Lets close this one for now, so we can focus on the remaining BSOD, which is probably the last major pain to get this to a release stage, so you can use it for production.

sonatique commented 8 months ago

@dontech I am at around 1.5% CPU or less (I guess the error bars are quite large for such small percentages) consistently for 3 concurrent transfers each with buffer size 2MB (2048KB), transfer rate is 180MBytes/s.

Adding more concurrent transfers doesn't change anything. Going back to 1 decrease the transfer rate to ~172MB/s. 2 transfers is similar to 3.

Using larger buffer size doesn't significantly decrease CPU usage (and does not increase transfer rate given I am already saturating my device), but I guess this could be application / scenario / device dependant.

I also checked for "corruption" and there is none.

I think we're all good here!

dontech commented 8 months ago

Excellent. Another one bites the dust! :-)

mcuee / libusb-win32

Using async API can re-order the arrival of packets. #54