Use RAW_IO on Windows for bulk and interrupt IN endpoints

Windows serializes IN transfers by default, negating the usefulness of Queue. RAW_IO disables this.

Libusb is adding an option for RAW_IO and decided to not turn it on by default for fear of breaking existing code that makes requests that are not a multiple of maxPacketSize. But nusb has no such consideration, and already documents that IN transfers must be a multiple of maxPacketSize, because it's a good idea anyway to avoid errors on overflow. I think RAW_IO should just be turned on unconditionally, and not doing so is just failing to provide the streaming behavior that users would expect.

I'm the author of the libusb RAW_IO support and have been running into this issue on different projects for many years. Everyone's instinct is always to try to abstract the issue away, but it's trickier than it seems.

I don't think your approach of requiring all inbound bulk & interrupt transfers be a multiple of maxPacketSize is viable if you want the library to be usable for any and all devices. I think the maxPacketSize limitation should be removed from the bulk_in and interrupt_in methods.

Sometimes you will need to read a specific, small number of bytes from an endpoint, and the device just doesn't have any more than that to give you. Think simple serial protocols implemented over a pair of bulk endpoints, for instance. Making the host keep polling for more bytes than are actually expected may cause unwanted behaviour. Even if it is safe to keep polling, the maxPacketSize limitation means that the caller cannot express their wish for a transfer to return once it has N bytes, or at a given timeout. They must set either set a long timeout, and wait unnecessarily for more bytes even after N have arrived, or set a short one that works in most cases and then need to handle the case where the device is a little slower.

It would make sense for the Queue API to make use of RAW_IO, and to require multiple-of-maxPacketSize transfers, but there is no need to place that limitation on all individually submitted transfers, and doing so may make the library impractical to use for some use cases.

This raises a related issue, though: does a Queue have exclusive use of an endpoint during its lifetime? If so, it's straightforward for Queue to enable RAW_IO on the endpoint, and disable it again if the Queue is dropped.

However, the current API appears to allow for individual transfers to be submitted on an endpoint, whilst that endpoint is simultaneously in use by a Queue. That seems like it would only ever cause problems, and probably ought not to be permitted.

Personally I would have been inclined to make the API use an Endpoint type, instances of which would be obtained from an Interface, rather than sticking with the libusb pattern of simply referencing an endpoint as a u8. Then exclusive ownership of a given endpoint would be available in a straightforward and idiomatic way, which would make it possible to deal with the RAW_IO issue much more cleanly.

Sometimes you will need to read a specific, small number of bytes from an endpoint, and the device just doesn't have any more than that to give you. Think simple serial protocols implemented over a pair of bulk endpoints, for instance. Making the host keep polling for more bytes than are actually expected may cause unwanted behaviour. Even if it is safe to keep polling, the maxPacketSize limitation means that the caller cannot express their wish for a transfer to return once it has N bytes, or at a given timeout.

If you submit an IN transfer of maxPacketSize, and the device responds with a packet that is smaller than that, it's a short packet that ends the transfer and you get the data immediately, right? Or does WinUSB with RAW_IO not behave like other OSs in that regard?

It seems like the only drawback is requiring you to allocate more memory than you actually use if you know the device never sends a full packet, but we're talking about <1024 bytes per transfer here, so not a big deal on the host.

This raises a related issue, though: does a Queue have exclusive use of an endpoint during its lifetime?

Currently, no. If you use multiple queues on an endpoint or submit transfers outside of a queue, the ordering is not defined and you probably shouldn't do this, but it won't do anything bad.

Personally I would have been inclined to make the API use an Endpoint type, instances of which would be obtained from an Interface

This was my initial plan, basically making Queue AKA Pipe AKA Endpoint the only API and rejecting attempts to create more than one per endpoint. However when implementing it I didn't see a need to go out of my way to track and prevent creating multiple when there was no memory safety reason to do so, and worried that forcing people into the Queue pattern was too radical a change. I added the individual transfer API with TransferFuture because it was easy and seemed like what people would expect to see if they didn't want Queue. However, the TransferFuture API is full of footguns around cancellation, so I'd consider going back to only offering Queue and could make it exclusive.

If you submit an IN transfer of maxPacketSize, and the device responds with a packet that is smaller than that, it's a short packet that ends the transfer and you get the data immediately, right? Or does WinUSB with RAW_IO not behave like other OSs in that regard?

It seems like the only drawback is requiring you to allocate more memory than you actually use if you know the device never sends a full packet, but we're talking about <1024 bytes per transfer here, so not a big deal on the host.

Ah, we're thinking at different layers of abstraction. I was thinking in terms of asking for length N causing the library to make as many requests to the OS as needed to fill that length, until hitting a timeout if set.

Since you're treating a short packet as ending a transfer, and also requiring the use of a specific RequestBuffer that the user will have to copy from, rather than letting the user send data straight into their own structures, then requiring transfers to be a multiple of maxPacketSize makes more sense than I initially thought.

There are also cases where one may need to receive exactly N bytes from an endpoint, and where those N bytes may come in multiple short packets, but that can be addressed one layer above this API.

Currently, no. If you use multiple queues on an endpoint or submit transfers outside of a queue, the ordering is not defined and you probably shouldn't do this, but it won't do anything bad.

It may not do anything bad in the sense of memory safety, but it still seems bad - is there any use case where it would do something useful?

This was my initial plan, basically making Queue AKA Pipe AKA Endpoint the only API and rejecting attempts to create more than one per endpoint. However when implementing it I didn't see a need to go out of my way to track and prevent creating multiple when there was no memory safety reason to do so, and worried that forcing people into the Queue pattern was too radical a change. I added the individual transfer API with TransferFuture because it was easy and seemed like what people would expect to see if they didn't want Queue. However, the TransferFuture API is full of footguns around cancellation, so I'd consider going back to only offering Queue and could make it exclusive.

I think you're seeing a false dichotomy here. It's not a question of either u8 or a single "Queue AKA Pipe AKA Endpoint" type. There's no need for Endpoint and Queue to be the same type. I would make them separate.

Requesting an Endpoint from an Interface would just give you an exclusively owned handle to that endpoint, on which you could submit individual transfers. This would be useful in itself, to help manage the use of an endpoint within a program. For instance, you could have TX and RX threads, one owning a bulk IN endpoint and the other a bulk OUT one.

If you wanted to use the Queue pattern, you would construct the Queue around the Endpoint, giving it ownership. If you then wanted to return to submitting individual transfers on that endpoint, Queue could have a method that consumes the Queue and returns the original Endpoint.

I started sketching out what a PR for this might look like.

There's a few decisions to take. The first one is whether to require a program to specifically request RAW_IO.

Personally I would love it if everything just worked, at maximum performance, without downstream developers ever having to think specifically about this nasty little quirk of WinUSB. And unlike libusb, you've made the decision upfront to require that all transfers be a multiple of the endpoint's maximum packet size, which makes it possible for this to be done transparently.

So I'm in favour of this being fully automatic, but making that choice does impose some new complexity: nusb would have to transparently split up large transfers into smaller WinUsb_ReadPipe operations in order to satisfy the maximum transfer size requirement for RAW_IO. Personally I think that's a viable choice, but I'd like to know what you think before pursuing that approach.

If nusb does take that approach, there's a further choice: when should it actually make the necessary SetPipePolicy and GetPipePolicy calls? I think the best option is in PlatformSubmit::submit, just before the first ReadPipe call on an endpoint - and then keeping a map in WindowsInterface of which endpoints have had RAW_IO enabled, and what the maximum transfer size for them is.

Alternatively, the SetPipePolicy and GetPipePolicy calls could be done pre-emptively when an interface is claimed, for all inbound bulk or interrupt endpoints on that interface. I'm less keen on that, as it adds unnecessary calls that can go wrong, but there's an argument for it if you want to keep operations at submission time to an absolute minimum.

This is one area where I feel like including an Endpoint type in the API might have been helpful, because it would give you somewhere to do any per-endpoint setup work and store per-endpoint state. But it's not essential.

So I'm in favour of this being fully automatic, but making that choice does impose some new complexity: nusb would have to transparently split up large transfers into smaller WinUsb_ReadPipe operations in order to satisfy the maximum transfer size requirement for RAW_IO. Personally I think that's a viable choice, but I'd like to know what you think before pursuing that approach.

Yes, I think it would ideally be automatic.

What kind of per-transfer size limits are we talking about here? I remember seeing 4 MB somewhere, which is similar in magnitude to what you might want to limit yourself to on Linux given its 16MB default system wide transfer limit. Wondering if it's viable to punt that to the library user, either by documentation or by exposing the limit in a new API (though is there an API that would make sense for that cross-platform?).

My design philosophy with nusb has been to keep everything as thin of a wrapper around OS capabilities as possible and avoid hidden internal state. So I think it would ideally not split up transfers, but I could be convinced that it's worth it here if the limit is small. Libusb goes too far in trying to cover up OS differences, but it is possible to go too far in the other direction and make it hard to write portable code.

when should it actually make the necessary SetPipePolicy and GetPipePolicy calls?

PlatformSubmit::submit seems reasonable. The one thing is that error handling is a little messy there, because I chose to make transfer submit not able to fail synchronously so users only need one error handling path. So you have to stash the error and fake a completion on the transfer to fail asynchronously. I'm not sure how fallible SetPipePolicy actually is, in cases other than bad args, though, and if it fails unexpectedly you maybe want to log something continue without RAW_IO rather than fully breaking.

This is one area where I feel like including an Endpoint type in the API might have been helpful, because it would give you somewhere to do any per-endpoint setup work and store per-endpoint state. But it's not essential.

Yeah, I've tentatively decided that this is a good idea for v0.2 or v1.0. I kind of want to get the library feature-complete before redesigning the API though just to make sure all constraints are considered (e.g. https://github.com/kevinmehall/nusb/issues/11, https://github.com/kevinmehall/nusb/issues/47).

What kind of per-transfer size limits are we talking about here? I remember seeing 4 MB somewhere, which is similar in magnitude to what you might want to limit yourself to on Linux given its 16MB default system wide transfer limit. Wondering if it's viable to punt that to the library user, either by documentation or by exposing the limit in a new API (though is there an API that would make sense for that cross-platform?).

In older versions of Windows IIRC it can be as small as 64KB, minus a page or so. These days it's usually 2.1MB (0x1FEC00), but I think it can also be ~256KB for Low/Full speed devices.

The problem with trying to document it is that there is no guarantee of these numbers. There's no defined constants, you can only query what the limit is at runtime for a specific endpoint. We could document the behaviour seen in practice, but Windows could legitimately just decide to start using different limits.

My design philosophy with nusb has been to keep everything as thin of a wrapper around OS capabilities as possible and avoid hidden internal state. So I think it would ideally not split up transfers, but I could be convinced that it's worth it here if the limit is small. Libusb goes too far in trying to cover up OS differences, but it is possible to go too far in the other direction and make it hard to write portable code.

I think that the options are either:

Let the maximum transfer size be unlimited, and split large transfers into smaller parts at the OS level where needed.
Add an API that allows querying the maximum transfer size that can be submitted for a given endpoint, and make it the caller's responsibility to split larger reads up into multiple transfers.

Personally I'm in favour of option 1.

I feel like option 2 just creates some extra needless hoops for the caller to jump through, without really providing any additional control or insight as a result. From the program's point of view, the possible outcomes for an internally-subdivided transfer are exactly the same as for one that's submitted as a unit: it either completes successfully, or it completes partially with an error.

Option 2 may also lead to surprising differences in behaviour when the same program is run on different systems. Consider a program that needs to make a very large read. It queries the maximum transfer size it can use, and splits its transfers accordingly, as required. On the developer's Windows system the program reads at say 20-40MB/s and the max transfer size is 2.1MB. They find they can use the completion of a transfer as a good opportunity to update a progress bar during a large read operation; on Windows it will update smoothly at 10-20Hz or so. Then a user runs that same program on another OS with no transfer size limit, and now the progress bar hangs until the end of the operation. Or they run it on a system with a very small transfer limit, and now it eats a lot of CPU because it's trying to redraw the UI too often.

If we choose option 1, then the behaviour from the point of view of the caller is always consistent. If you submit a large transfer, you get a single event when it completes. If you want more frequent events you submit smaller transfers. The observed behaviour from the caller's point of view is always the same, even if different things may be happening behind the scenes.

In fact it's not necessary to choose just one or the other. We could both provide an API that allows querying the maximum OS transfer size supported for an endpoint, and also give nusb the ability to split up transfers that are larger than this. Then if the caller does not want their transfers to be split, they can query the max size and act accordingly. If they don't care, they can just submit whatever size they want - provided it's a multiple of max packet size - and nusb will deal with it.

PlatformSubmit::submit seems reasonable. The one thing is that error handling is a little messy there, because I chose to make transfer submit not able to fail synchronously so users only need one error handling path. So you have to stash the error and fake a completion on the transfer to fail asynchronously. I'm not sure how fallible SetPipePolicy actually is, in cases other than bad args, though, and if it fails unexpectedly you maybe want to log something continue without RAW_IO rather than fully breaking.

I'm not sure how fallible it is either. But I don't think there's ever a case where we'd have to signal an error there. In the case where SetPipePolicy fails, we can just log a warning and continue without RAW_IO. It's only a performance loss.

On libusb, this issue is much nastier, because potentially they may need to make it possible to disable RAW_IO, in order to go back to submitting odd-size transfers. If SetPipePolicy fails at that point, then the program is completely stuck.

Yeah, I've tentatively decided that this is a good idea for v0.2 or v1.0. I kind of want to get the library feature-complete before redesigning the API though just to make sure all constraints are considered (e.g. #11, #47).

That makes sense. One thing I think it would help with is reducing the amount of locking needed. Currently you may have multiple threads working different endpoints, but they all need to share the WindowsInterface. So just to store a record of which endpoints have had RAW_IO enabled, I found I needed something like raw_io: Mutex<HashMap<u8, usize>> in that struct, and now every submit call has to take that lock.

In older versions of Windows IIRC it can be as small as 64KB, minus a page or so. These days it's usually 2.1MB (0x1FEC00), but I think it can also be ~256KB for Low/Full speed devices.

Rust dropped support for Windows versions older than 10, which seems like a reasonable minimum support target, and Windows 10 is the only version I've tested nusb on.

The problem with trying to document it is that there is no guarantee of these numbers. There's no defined constants, you can only query what the limit is at runtime for a specific endpoint. We could document the behaviour seen in practice, but Windows could legitimately just decide to start using different limits.

Yeah, I wouldn't be comfortable with promising a specific value either, but it seems like the trend is upwards and unlikely that Microsoft would lower the limit.

From the program's point of view, the possible outcomes for an internally-subdivided transfer are exactly the same as for one that's submitted as a unit: it either completes successfully, or it completes partially with an error.

Does WinUSB have a way to atomically cancel subsequent IN transfers after receiving a short packet that normally terminates a transfer? Linux has USBDEVFS_URB_BULK_CONTINUATION for this, but I don't see anything similar as a WinUSB pipe policy.

Consider a program that needs to make a very large read. It queries the maximum transfer size it can use, and splits its transfers accordingly, as required. On the developer's Windows system the program reads at say 20-40MB/s and the max transfer size is 2.1MB. They find they can use the completion of a transfer as a good opportunity to update a progress bar during a large read operation; on Windows it will update smoothly at 10-20Hz or so. Then a user runs that same program on another OS with no transfer size limit, and now the progress bar hangs until the end of the operation. Or they run it on a system with a very small transfer limit, and now it eats a lot of CPU because it's trying to redraw the UI too often.

I've been meaning to write some examples / convenience wrappers on top of Queue that could implement Read / AsyncRead / Write / AsyncWrite. If someone wants to e.g. read a big file from a device, I think they would ideally be using something higher level than expecting to submit it as a single transfer. Individual transfers are a fairly low-level API, and there are lots of different ways to abstract them depending on the nature of their use. I think it could be in scope for nusb to provide abstractions for different common patterns here, but that could be a layer on top of Transfer and Queue.

Anyway, I'm not sure here. I think I'd prefer to start with the simple approach of not splitting transfers, and revisit if it actually becomes a practical problem for someone. But if you've already started that code, I'd be happy to take a look at it.

Rust dropped support for Windows versions older than 10, which seems like a reasonable minimum support target, and Windows 10 is the only version I've tested nusb on.

I agree that targeting Windows 10 for support/testing is reasonable, but note that support was not completely dropped for older versions. Rust is still able to build programs for even much older Windows versions. They just no longer have Tier 1 support.

Does WinUSB have a way to atomically cancel subsequent IN transfers after receiving a short packet that normally terminates a transfer? Linux has USBDEVFS_URB_BULK_CONTINUATION for this, but I don't see anything similar as a WinUSB pipe policy.

That's a great point, and no, not that I'm aware of. In all the cases I've encountered where I've needed RAW_IO, I've been streaming from endpoints where either I never expect a short packet at all, or where a short packet marks the end of the stream and no further data will follow if the device is polled again - or at least unless some control request is sent first.

If you had an endpoint that will send chunks of data one after the other, with each delimited by a short packet, and you wanted to be sure to grab only one chunk and not poll further -- then I think really it's not safe to be queuing up multiple transfers at all. You can do so if you can be certain there is a feature like USBDEVFS_URB_BULK_CONTINUATION in effect, but that seems dangerous to assume for a cross-platform API. Is there a feature like that on macOS? And does the Linux version actually guarantee that the host controller will not poll again, or can it be the case that the next operation is already queued in the hardware, even if the kernel-side queue is atomically cancelled?

I've been meaning to write some examples / convenience wrappers on top of Queue that could implement Read / AsyncRead / Write / AsyncWrite.

That sounds good, and would certainly alleviate a lot of the hoop-jumping.

I think I'd prefer to start with the simple approach of not splitting transfers, and revisit if it actually becomes a practical problem for someone. But if you've already started that code, I'd be happy to take a look at it.

OK. I hadn't gotten as far as implementing that. So I'll go with the simpler option and see how it looks.

kevinmehall / nusb

Use RAW_IO on Windows for bulk and interrupt IN endpoints #6