vfu_dma_read breaks if client sends message in between

tmakatos commented 3 years ago

There are cases where a libvfio-user public function (e.g. vfu_dma_read) sends a message to the client and then synchronously waits for the response. A response from the server cannot be avoided in the DMA read case as it contains the read data. The problem is that right after having sent the message to the client, the client might send its own message, which will be received by the blocking call of the server. However, the server is expecting a completely different message. vfu_dma_read is one of the functions that has this problem, effectively every public function that uses vfu_msg has the same problem, except the version negotiation ones.

vfu_irq_trigger has the same problem, since the server might send a trigger IRQ message and immediatelly receive a new command from the client, while it's expecting to receive the response for the IRQ message it just sent.

A similar situation can occur when the client is waiting for a response from the server and the server happens to send it an VFIO_USER_VM_INTERRUPT. This is a valid use case though, the client should be expecting such messages.

jlevon commented 3 years ago

There are currently only two paths that have this issue, vfu_dma_read() and vfu_dma_write() via tran->send_msg().

vfu_irq_message() (what Thanos was referring to above) has been removed for now.

awarus commented 2 years ago

is this issue still not resolved?

tmakatos commented 2 years ago

is this issue still not resolved?

Still not resolved.

mnissler-rivos commented 1 year ago

This issue has bitten me, here are some thoughts.

For context, I've been using libvfio-user to hook up PCI device models to hardware PCIe root port implementations, basically by bridging the PCIe transaction layer to VFIO-user. You can think of this as plugging a software-emulated PCIe card into a PCIe hardware design, where the hardware side is augmented with a VFIO-user client that talks to qemu as the vfio-user server.

In this scenario, I can't use memory-mapped DMA, because I need to inject DMA transactions back into PCIe hardware as memory read and write transaction layer packets. So I rely on VFIO-user's message-based DMA via VFIO_USER_DMA_{READ,WRITE} for my DMA transactions (side note: I have a qemu patch to add support for message-based DMA, happy to contribute that). However, with message-based DMA I immediately run into the problem described here: It's pretty common that MMIO traffic arrives while qemu is waiting for the reply to a DMA operation it has requested, and thus the synchronous wait-for-DMA-reply approach doesn't work. So, I'm interested in fixing this problem.

Here are a couple avenues I can think of:

While waiting for a DMA reply message, don't bail if a command arrives, but queue the command message up. Once the DMA reply comes in, process any queued commands. I have a very crude implementation of this to unblock my work, it certainly feels like a hack (see also below).
Rewrite libvfio-user to work fully asynchronously, allowing commands and replies to be received in arbitrary order. This would also require API changes - vfu_sgl_read and vfu_sgl_write would have to become asynchronous and report their result via a callback.
Tackle the root cause of the issue: Don't allow command messages in both directions over the socket. AFAICT, VFIO_USER_DMA_{READ,WRITE} are the only instances of server -> client commands (note that interrupts, which are the other server -> client interaction, already use event fds). So, it seems realistic to deprecate server -> client commands on the socket entirely (they've never worked properly in the libvfio-user implementation in the light of this bug anyways) and do message-based DMA in a different way: Introduce a 3rd flavor of VFIO_USER_DMA_MAP that passes a separate socket file descriptor to the server, which the server then uses to send VFIO_USER_DMA_{READ,WRITE} commands to the client. Note that this DMA socket would carry only carry commands in the server -> client direction. This would be a protocol-level change, but one that fits cleanly into the existing protocol architecture.

Assessing these options:

(1) is nice in that it can be done entirely inside libvfio-user, without any API changes. (For completeness, note that vfu_get_poll_fd can no longer be just the socket file descriptor, since it must indicate readiness also for queued messages, but there's a workaround using eventfd + epoll). (2) and (3) imply API changes, and (3) even requires a protocol-level change.
(2) is fundamentally the hardest to implement for clients and servers, since synchronous request-reply code can't easily be fixed. Even if we were to make libvfio-user's message handling completely asynchronous, switching DMA API functions to become asynchronous will have ripple effects. For example, qemu's DMA handling is currently synchronous and probably isn't practical to change to asynchronous, thus forcing qemu to implement a variant of (1) to hide the asynchronicity. The implementation of (1) is a bit messy since we need to buffer potentially unlimited messages in vfu_ctx, but perhaps that can be limited to a reasonable value in practice? (3) is pretty simple to implement since it is independent of the main socket.
Whatever solution we pick, there's a potential for deadlocks if both sides choose to use synchronous I/O. For example, the server might wait for a DMA operation while the client waits for an MMIO access to complete. At least one side will have to process messages "out of band", and thus implement some form of asynchronicity. IMO, (3) provides the most flexibility in that the client can choose to either be fully asynchronous and listen for the main socket as well as any DMA sockets, or it can choose to service the DMA socket(s) on a different thread. Implementations that don't require message-based DMA can choose simpler synchronous implementations. qemu can keep its synchronous DMA implementation (at the cost of the client needing to be able to handle asynchronous DMA requests, but it seems fair to pay that cost in the client if the client wants message-based DMA).

Overall, I like (3) best. I'm interested to hear other thoughts / opinions. Assuming we can reach consensus on how to move forward, I'll be willing to implement a solution.

tmakatos commented 1 year ago

I do like idea (3) the most. I even think we've entairtained this idea of having a separate socket for server -> client messages in the very early days, when we were implementing the protocol specification, but didn't have a solid use case at that point and decided to not bake it in the protocol. I don't think modifying the protocol is an issue, it's still experimental in QEMU (and we're still going to modify the protocol when we adopt migration v2). Let's hear from @jlevon and @jraman567 .

mnissler-rivos commented 1 year ago

Thanks @tmakatos. If anyone sees issues with approach (3) please let me know - I hope to have some cycles next week to start working on this.

jlevon commented 1 year ago

plugging a software-emulated PCIe card into a PCIe hardware design

Cool!

note that interrupts, which are the other server -> client interaction, already use event fds

While that's true today, that wouldn't be true if we ever wanted to do other implementations, over TCP/IP to a remote host, for example.

So I don't think it would make sense to build something specific to the DMA path.

This problem is the one of the reasons we didn't want to stabilize the library yet. I agree with the enumeration of the three possible approaches. I don't really like 1).

I'd sort of be OK with 3), though I can't speak for the qemu or vfio-user client maintainers. It feels like a bit of a workaround of inadequate libvfio-user design though, to be honest. It would also mean that async consumers of the API, still need to block during device->guest DMA transactions, which is an issue in itself. The separate socket of 3) doesn't help at all there. That is, the dma read/write APIs are fundamentally broken anyway.

2) is admittedly a lot more work - but it would then mirror what the client side does, which can handle this problem. So ideally we'd do 2) - Any thoughts on implementing that Mattias? I haven't really thought through the difficulties of implementation in terms of libvfio-user itself, but I think it shouldn't be too difficult:

->send_msg() no longer takes or waits for a reply. In fact we somehow make the support for synchronous request then reply, be only usable by samples/client.c
it takes a callback and private data pointer; we store this somewhere: probably need a list/hash for multiple outstanding requests.
is_valid_header() changes to allow replies just for DMA
handle_request() takes the reply and uses the given callback
change quiesce to ensure no DMA op is ongoing somehow
same thing for migration state
TBD if we need to allow for cancellation: I'd guess not, since we can't control if the client side is going to write into the provided buffer.

mnissler-rivos commented 1 year ago

It feels like a bit of a workaround of inadequate libvfio-user design though, to be honest. It would also mean that async consumers of the API, still need to block during device->guest DMA transactions, which is an issue in itself. The separate socket of 3) doesn't help at all there. That is, the dma read/write APIs are fundamentally broken anyway.

I see what you're saying, but for clarity we should separate the discussion into two concerns:

What kind of DMA APIs we provide: Synchronous, asynchronous or both. The decision should be informed by the needs of the consumers of the API. See below.
What the right protocol-level implementation is. Decision here is mostly determined by whether the protocol can support the APIs we need to support in a reasonable way.

So ideally we'd do 2) - Any thoughts on implementing that Mattias?

The libvfio-user changes to go fully async are relatively straightforward, I agree. The challenge is for code that consumes the API, because consuming async APIs causes ripple effects. If you look at qemu for example, it relies on synchronous pci_dma_read and pci_dma_write APIs. Switching these APIs to async would be a major effort in qemu, and it would mean touching all the hardware model implementations to make their DMA request code paths async. IMO, that's in the ain't-gonna-happen bucket, at least unless qemu has appetite for changing their DMA APIs for additional reasons.

If we assume that qemu hardware models will stick with a synchronous DMA API, then there are two paths:

libvfio-user offers both sync and async DMA APIs and lets the consumer choose what works for them. That's pretty simple to support if we have separate sockets.
libvfio-user's DMA API is async only, and qemu needs to adapt that to a synchronous API somehow. It's questionable whether this is at all possible. The only thing I can think of would probably look similar to the queuing approach I described as (1) above, but at the libvfio-user API surface level - ugh.

Bottom line: I don't really see how a libvfio-user that doesn't provide a synchronous DMA API can be consumed by qemu. This arguably causes some design warts, but a middle ground design that allows both sync/async DMA doesn't seem such a bad way forward, and it also gives API consumers an upgrade path to async.

jlevon commented 1 year ago

into two concerns

Sure, but one informs the other: right now, we can't provide an async API, because the underlying API is sync. While it's easier to build a sync API on top of an async implementation.

Switching these APIs to async

Yes, I would expect there to be both sync and async APIs, precisely for the sort of reasons you are mentioning. Of course, the former would basically need to insist on there being no interleaving still, so would still have the issue. That's true of form 3) as well, since we'd have to return back to run vfu_run_ctx() for the reply anyway.

It's possible the sync version could read then re-queue any intermediate messages, perhaps, but there's risks there in the cases of quiesce, migration, unmap, and so on.

mnissler-rivos commented 1 year ago

It's possible the sync version could read then re-queue any intermediate messages, perhaps, but there's risks there in the cases of quiesce, migration, unmap, and so on.

Plus, it's pretty heavy-weight since at least in theory it needs to queue up an unbounded number of messages of arbitrary sizes. It works though, it's approach (1) in my post above, and I have a hack that I'm currently running with.

It seems like we're all in agreement that we should have async DMA (in addition to sync) at the libvfio-user API level. Then the open question is whether we do the queuing approach, or we introduce separate socket(s) for the server->client direction.

Trying to summarize pros/cons (please call out anything I have missed):

queuing:

(+) no protocol changes required
(+) conceptually cleaner: single communication channel
(-) need to allocate unpredictable amounts of memory to hold queued messages
(-) vfio-user implementations are forcibly async with the need to match replies to commands

separate socket(s):

(+) can process any message as it's read from the socket, when handling sync DMA other commands stay on the main socket, natural back pressure to client.
(+) simplified implementations are possible that handle the sockets using send-request-read-reply loops (and can achieve async by using threads). judging from what's out there, this style is popular, see for example https://github.com/rust-vmm/vfio-user/blob/main/src/lib.rs
(-) protocol change

As is probably obvious by know, I'm leaning towards separate socket(s), mostly because I think it's simpler and makes vfio-user easier to implement, but I can totally relate to the single-socket solution feeling conceptually cleaner. In the end, I don't really feel strongly though, just hoping to reach closure so we can move forward.

jlevon commented 1 year ago

I'm not clear on how a separate socket helps at all with providing a sync API; since the reply arrives on the original socket still, you still need to somehow process other incoming messages.

mnissler-rivos commented 1 year ago

Ah, so we've been talking past each other, I'm not intending to send the DMA reply on the main socket.

Let me recap the separate socket idea again to clarify:

Client obtains a pair of sockets by calling socketpair(), let's call them dma_client_fd, and dma_server_fd, respectively.
When calling VFIO_USER_DMA_MAP command, the client passes a flag to indicate a new mode of performing DMA operations (let's call it VFIO_USER_F_DMA_REGION_SOCKET_IO), as well as dma_server_fd.
When the server wants to perform DMA for the mapping in question, it puts an VFIO_USER_DMA_READ command on dma_server_fd.
The client monitors and reads dma_client_fd. When it receives the VFIO_USER_DMA_READ command, it performs the DMA operation and sends back the reply on dma_client_fd.

This enables the DMA operation to be handled "out of band" of the main socket. Thus, a synchronous write-DMA-command-then-read-back-reply on the sockets are unproblematic (since commands on the main socket are only ever sent by the client, and commands on the DMA socket are only ever sent by the server, the condition that is the root cause of the bug never occurs). Furthermore, the server can choose to synchronously wait for the DMA operation to complete and ignore all other I/O on the main socket during that time (this is what I was referring to when I said this is simpler to implement).

jlevon commented 1 year ago

Note that this DMA socket would carry only carry commands in the server -> client direction.

I think I read "messages" here instead of "commands", hence the confusion. It's bi-directional, in fact.

So, yeah, that sounds reasonable.

jlevon commented 1 year ago

oops

jlevon commented 1 year ago

@mnissler-rivos I asked Jag about this on slack if you'd like to join

tmakatos commented 1 year ago

I've just started reading the rest of the discussion, but the following just occured to me:

There are cases where a libvfio-user public function (e.g. vfu_dma_read) sends a message to the client and then synchronously waits for the response.

What's the point of the vfio-user server requiring a response from the client for a DMA command? If the client doesn't like the command it received, the server doesn't need to know about that specific fact, the client might deal with it any way it sees fit (e.g. reset the device). So if we remove this behavior there's no problem in the first place, right?

jlevon commented 1 year ago

DMA read requires a response with the data

mnissler-rivos commented 1 year ago

What's the point of the vfio-user server requiring a response from the client for a DMA command?

For a DMA read, it wants the data from the client. There is no conceptual reason why that couldn't be delivered asynchronously though. In case of qemu, the device models often have code that perform DMA reads to get the next network packet and then process it, for example, and that code is mostly synchronous right now. So, IMHO we should definitely provide an async API, but keep a synchronous version to avoid breaking more simplistic server designs.

tmakatos commented 1 year ago

DMA read requires a response with the data

D'oh of course.

Furthermore, the server can choose to synchronously wait for the DMA operation to complete and ignore all other I/O on the main socket during that time (this is what I was referring to when I said this is simpler to implement).

What if the client never responds to the DMA read? What if it decides to reset the device, so the server should actually be paying attention to the main socket? I guess the client will have to start making assumptions that the server won't be paying attention to the main socket so it first has to fail the DMA read?

jlevon commented 1 year ago

Yep - we'll need to clearly define exact behaviour in these sorts of cases

tmakatos commented 1 year ago

I guess it boils down to what real HW would do: if it was in the process of doing DMA would it be handling MMIO in the first place?

mnissler-rivos commented 1 year ago

I don't think there's a universal answer on how hardware behaves. At the PCIe TLP level, all packets are tagged in order to match up completions, so there aren't any inherent restrictions. I'm also not aware of any requirements to that effect in the PCIe spec, but I'll ask a coworker who knows it much better than me to make sure. That said, the spec does include a completion timeout mechanism (section 2.8) to deal with situations where an expected reply never arrives.

From a VFIO software perspective, robust clients and servers should ideally not make any assumptions on communication patterns in my opinion. Having implemented a fully asynchronous client now, I am painfully aware though that this can be challenging in practice. Plus, you can't just mandate full async and ignore the practical constraints in existing software such as qemu.

Note that separating the sockets offer increased flexibility in implementing a correct client: It can spawn a thread to handle DMA separately from the main socket, and unless there is problematic cross-thread synchronization, everything will be fully async safe even without switching to an event-based design.

mnissler-rivos commented 1 year ago

FWIW, I have implemented a first draft of the separate socket approach here: https://github.com/mnissler-rivos/libvfio-user/compare/master...mnissler-rivos:libvfio-user:socket_dma

I've tested it with qemu's vfio-server implementation, and it seems to work fine. It's still lacking tests and the async API though. Wanted to share what I have though since I'll be out until Monday.

tmakatos commented 1 year ago

I don't think there's a universal answer on how hardware behaves.

Fair enough. I think the dual-socket approach makes sense.

FWIW, I have implemented a first draft of the separate socket approach

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler. Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Also, it would be nice to have some kind of Python-based unit test, which could also act as a sample (e.g. a server with two DMA regions, one accessible via its own socket, the other not, and the client doing DMA transfers on both regions).

mnissler-rivos commented 1 year ago

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler.

DMA_ACCESS_MODE_MESSAGE is the old behavior with the DMA command passed over the main socket. I was (perhaps incorrectly?) assuming that we'll want to keep existing behavior for backwards compatibility and enable the new way on request by the client.

Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Interesting point. I guess it would make sense to reject, but given that you can technically have multiple DMA mappings, each with its own mode, I'm not so sure. If we conclude that we won't support VFIO_USERDMA{READ,WRITE} on the main socket, then it makes sense to reject them obviously.

Also, it would be nice to have some kind of Python-based unit test, which could also act as a sample (e.g. a server with two DMA regions, one accessible via its own socket, the other not, and the client doing DMA transfers on both regions).

Yes, I will add that before sending an actual pull request.

tmakatos commented 1 year ago

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler.

DMA_ACCESS_MODE_MESSAGE is the old behavior with the DMA command passed over the main socket. I was (perhaps incorrectly?) assuming that we'll want to keep existing behavior for backwards compatibility and enable the new way on request by the client.

I misremembered the semantics of VFIO_USER_DMA_MAP: previously we could get an FD which meant we could mmap(2) it, now we might get an FD which is not a file but a socket, so we'd have to send(2) to it, therefore the client needs to tell the server what kind of FD it is. (I think this could be a useful explanation in the commit message?) So the distiction you make is necessary.

What I wanted to avoid is the special casing in vfu_dma_transfer(), but I see it's necessary since vfu_ctx->tran->send_msg() uses the default transport FD. I can't come up with a nicer way, perhaps add an extra fd argument to vfu_ctx->tran->send_msg() that if it's not -1 then use that instead? Not sure it's worth it TBH.

Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Interesting point. I guess it would make sense to reject, but given that you can technically have multiple DMA mappings, each with its own mode, I'm not so sure. If we conclude that we won't support VFIO_USERDMA{READ,WRITE} on the main socket, then it makes sense to reject them obviously.

makes sense

jraman567 commented 1 year ago

Hi, I got to think about this problem for a bit.

To step back, we have assumed that the QEMU end would be the client and the remote end would be the server. This assumption is not accurate. When the remote end initiates DMA reads, the remote takes the role of the client. As such, there is a possibility of a deadlock when both ends initiate requests simultaneously.

We should have a lock to manage concurrent access to the communication channel. Given QEMU and the remote are in different processes, it may take a lot of work to implement a locking model correctly and fairly. Please advise if any C library implements locking between different processes.

Given the above, I suggest going with two socket pairs. QEMU should exclusively initiate requests on one channel, and the remote end should do the same with the other one. Please share your thoughts.

jlevon commented 1 year ago

vfio-user, as a protocol, shouldn't be encoding stuff like processes, so any kind of locking is a non-starter. But, anyway, I don't accept that there is a deadlock here, that depends entirely on the implementation of both sides. You'd only have a deadlock in an ABBA sort of situation which is not what we have here.

Practically speaking, we have to think through the reset etc. scenarios for sure, but sounds like consensus is the separate socket thing.

mnissler-rivos commented 1 year ago

Thanks @jraman567 for sharing your thoughts, and I'm happy to hear we are aligned on using a separate socket for server->client commands and their replies.

@jraman567 what do you think about my specific proposal to optionally pass another socket in VFIO_USER_DMA_MAP request? This essentially establishes a server->client communication channel per DMA mapping. As an alternative, we could set up a single socket for reverse communication as part of negotiate. The former fits in more naturally with the current structure of the protocol IMO, the latter is probably preferable if we anticipate to add more server-initiated commands in the future.

jlevon commented 1 year ago

I'm not really a big fan of tying it to DMA_MAP. I'd rather have some negotiate step provide an fd.

mnissler-rivos commented 1 year ago

some negotiate step

Would that be passing the FD along with VFIO_USER_VERSION at negotiation time, possibly indicating the index in the fd array via a parameter in the capability json? Or rather a whole new negotiation message exchange?

jlevon commented 1 year ago

not sure I have an opinion on that

mnissler-rivos commented 1 year ago

Here's a fresh snapshot of what I have right now: https://github.com/mnissler-rivos/libvfio-user/compare/master...mnissler-rivos:libvfio-user:socket_dma

Includes test fixes and a pytest for vfusgl{read,write}.

Note that this is still passing the socket fd in DMA_MAP, since I had this almost ready when we started discussing this angle yesterday. I will investigate how a solution that sets up a context-scoped server -> client socket next.

Happy to hear any input you may have, but no pressure, I still got work to do :-)

tmakatos commented 1 year ago

@mnissler-rivos I think it's mature enough to become a a PR (so we don't have to re-review it, I've already reviewed part of it), passing the sockect fd during negotiate can be fixed up by additional commits.

mnissler-rivos commented 1 year ago

@tmakatos I hear you - but please give me a day or so to investigate the context-scoped reverse socket approach.

There's a chance that with the alternative implementation I can hide all the details of the reverse socket in the transport layer. If that succeeds, most of the changes I have on the branch won't be necessary.

jraman567 commented 1 year ago

@mnissler-rivos Setting up the reverse communication channel during negotiation sounds good to me too. Thank you!

mnissler-rivos commented 1 year ago

Here's a proof of concept for setting up a separate server->client command socket at negotiation time: https://github.com/mnissler-rivos/libvfio-user/compare/master...mnissler-rivos:libvfio-user:cmd_socket

It ended up cleaner than I had feared, and it's future-proof in the sense that it transparently moves all server->client commands over to the separate socket.

I haven't finished a suitable pytest yet, but I have test against my client and confirmed that it works.

It'd be great if you could weigh in on whether this approach is indeed what you want to go with - I'll be happy to turn this into a proper PR then.

jlevon commented 1 year ago

That is indeed a lot less than I expected! I have some comments but looks broadly fine, please open 2 PRs for the separate changes, and don't forget to update the vfio-user spec too. Thanks!

mnissler-rivos commented 1 year ago

Sorry for the delay, I was on vacation. I've just created a pull request that sets up a separate socket pair at negotiation time, complete with tests and specification updates.

@jlevon you asked for 2PRs above - not sure whether you'd like me to split this one or see a separate PR for the alternative approach of setting up a socket FD for each DMA mapping?

jlevon commented 1 year ago

@mnissler-rivos close_safely() is unrelated to these changes so should be a separate PR (with the other on top I guess).

mnissler-rivos commented 1 year ago

Ah, makes sense. Here is a separate pull request for close_safely: https://github.com/nutanix/libvfio-user/pull/763

jlevon commented 1 year ago

fixed via #762

nutanix / libvfio-user

vfu_dma_read breaks if client sends message in between #279