Allow /post when /read is in progress

tsusanka commented 4 years ago

In this PR I would like to document the introduced changes. In #130 I would like to then provide additional motivation and documentation for our future selves. I will do that after we agree on this PR.

This PR has three commits:

At first I created 7a1d985 that introduces Force() method in the USBDevice interface in order to omit the TransferMutexLock in libusb, see https://github.com/trezor/trezord-go/commit/7a1d9858e439a841b10c66550e2e4c7571fd5a52#diff-da4fb7299e95c07f8f33934faef66777R535.
In 8e97c7d I have implemented, as discussed, a measure to make sure that the /post write does not overwrite another write. This is achieved with two Mutexes - one for read and one for write on the current session. That means that no two concurrent reads or writes are possible, however read and write can occur at the same time for given session.
In 0b6d02d I have removed most of 7a1d985, because the locks in libusb now seemed (a) redundant (b) being overwritten for /post anyway. So although they were probably correctly present in the low-level libusb code (because they are specific to the low-level implementation) it seems redundant now. At the same time, it seems to me that the higher-level locks do no harm for the other transfer implementations (HID, udp etc.), but of course this needs extensive testing.

I will of course squash before merge.

Also this does not change 95ff303445fa3ddb26e9e06d45cefbe6e17423f9, which is already in master. That commit omits the check for "other call on same session" for /post. The same thing might be introduced for /read, but since we do not need it at the moment I did not implement it.

In future, we should consider doing a bit of refactoring and make /call basically an alias for /post and then /read.

I am glad for any comments from anyone. @matejcik could maybe have a quick look?

matejcik commented 4 years ago

I don't think completely removing the transfer mutex is correct.

To summarize: the old code would allow only one pending call to lowlevel.Interrupt_Transfer, regardless of direction. AFAICT from the libusb doc, transfer direction is determined by the endpoint being used.

Comments on LibUSBDevice suggest that only one interrupt_transfer should ever be running at the same time, but the existence of debugTransferMutex suggests that different endpoints are OK for this. So calling on a read endpoint and a write endpoint simultaneously should be OK.

What worries me is that the new code enforces the "no two interrupt transfers" at a higher level. Hypothetically you could sidestep the locks by calling Core.writeDev or readDev directly, or any of the lower level methods.

A better design would be to have a per-endpoint mutex in the libusb code - perhaps modify transferMutexLock so that it takes the endpoint id as a parameter, instead of just debug: bool? And have four mutexes instead of two, or an array indexed by the endpoint.

tsusanka commented 4 years ago

This has simplified that code significantly, so I have force-pushed.

As @matejcik suggested I have added two more locks for libusb. I am passing a pointer to the appropriate mutex from the read/write functions, which seemed as the cleanest solution.

matejcik commented 4 years ago

I don't think removing the high-level locks is correct either.

IMHO the code as of 0b6d02d is correct in the practical sense: as long as nobody does any further development on trezord, it will work and lock properly. The low-level locks are a "code safety" measure, as I see it. (utACK on correctness of 9ca0991 for that)

A possible problem now is that sending a large message to Trezor is going to take multiple calls to Write (and thus readWrite), because each call writes exactly 64 bytes. With /post not being high-level locked, two writers on the same session might end up interleaving each other's packets.

To summarize:

two Interrupt_Transfers on the same endpoint should not run concurrently - otherwise presumably libusb crashes or something.
two "message writes" and "message reads" (what /post and /read does, respectively) should not run concurrently - otherwise we lose packet ordering.
Two /calls should not run concurrently - otherwise the "message writes" and "message reads" could interleave and the second caller would read the first response.

In all cases it is OK to leave the caller waiting on a mutex until the previous caller is done. In addition, there is no usecase for allowing simultaneous calls to /read (and thus /call) in the first place. The current behavior is sufficient to implement "retry until success" strategy on the client. (...except maybe something like "shut down the other caller", which ... is a usecase, but probably not one we are interested in)

(this is different for /post, which must not wait for a pending "message read", otherwise canceling messages won't work)

tsusanka commented 4 years ago

Very good points. For me the most important part is:

The low-level locks are a "code safety" measure, as I see it.

So in a summary:

two Interrupt_Transfers on the same endpoint should not run concurrently - otherwise presumably libusb crashes or something.

Is achieved (or rather improved) by https://github.com/trezor/trezord-go/pull/183/commits/9ca09919502aef29b2a1e509c7a432b9cf8acf5b.

two "message writes" and "message reads" (what /post and /read does, respectively) should not run concurrently - otherwise we lose packet ordering.

Will be done by re-introducing https://github.com/trezor/trezord-go/commit/8e97c7d4c71c82b2f3767a912e154c0adbb490ad.

Two /calls should not run concurrently - otherwise the "message writes" and "message reads" could interleave and the second caller would read the first response.

This is solved by the fact that no two /calls are allowed at the same time - see: https://github.com/trezor/trezord-go/blob/05991cea5900d18bcc6ece5ae5e319d138fc5551/core/core.go#L557

This should be enough. What do you think?

matejcik commented 4 years ago

agreed on all counts

matejcik commented 4 years ago

As discussed in a side-channel: We could also add a separate mutex for /call to strenghten the protection. We probably won't though.

For one, this would change behavior. Instead of rejecting the /call, bridge would wait until the previous caller is done. This is potentially a long time, because it is waiting for a response from Trezor. (that is also a point for rejecting simultaneous calls to /call and /read: makes it harder to write incorrect clients that don't keep track of their requests)

For two, this isn't a feature we would be interested in. Sequential /calls would allow for message batching, but (1) we don't need that for performance, because call latency isn't the bottleneck; and furthermore the messages would get stuck in the USB buffers anyway; (2) it doesn't match Trezor communication model too well, most operations are interactive; (3) if a client really wants batching, nothing is stopping them from implementing it with /post and /read.

tsusanka commented 4 years ago

Done as discussed. Please check the comment in ba94c79 especially I am semi-happy about it.

tsusanka commented 4 years ago

I will merge this after Szymon tests this (I've tested it myself, but just to make sure).

tsusanka commented 4 years ago

Merged.

trezor / trezord-go

Allow /post when /read is in progress #183