support for vIOMMU with mmap()

libvfio-user doesn't work with a vIOMMU that remaps GPAs to IOVAs and wants to access guest RAM via mmap() (it works fine if guest RAM is accessed via vfu_sgl_read and vfu_sgl_write).

From discussion with @jlevon and @mnissler-rivos on libvfio-user Slack (https://libvfio-user.slack.com/archives/C01AFGCSPTR/p1696411059172959):

IOMMU PT/identity DMA: works fine: incoming IOVAs are the same as GPAs

IOMMU remapping: doesn't work: IOVAs need translation somehow into the underlying GPA prior to the libvfio-user dma code

ATS on top: mostly about endpoint caching translations. would need some version of this between qemu/libvfio-user for any hope of performance

PRI on top: mostly about demand paging: same?

PASID on top: no idea!

I've spent some more time looking into IOMMU/ATS/PRI/PASID and wrote some exploratory code. With this background, I think I have a better grasp now on what would be needed to support these features:

Conceptually, the vfio-user server operates in the virtual I/O address space created by the IOMMU it is subordinate to. This means that all DMA operations are in terms of IOVA. This includes not only DMA read/write, but also MAP/UNMAP. As a result, when IOMMU mappings get created and torn down, the corresponding IOVA regions need to be propagated via VFIO_USERDMA{MAP,UNMAP} to the server. This is already implemented (but lacking PASID support, see below).
ATS primitives can be mapped to VFIO-user as follows:
- A translation request can be satisfied simply by checking whether the server has a DMA mapping for the requested IOVA. At the hardware level, this returns the physical address if a mapping is present, which is mainly used to generate DMA requests with pre-translated addresses. Since the VFIO-user server operates in the IOVA address space, the physical address isn't really needed AFAICT. In case I am wrong and it is actually needed, it could be supplied by the client as an additional parameter when creating the DMA mapping.
- Invalidations are represented by VFIO_USER_DMA_UNMAP. There is an interesting detail here where the invalidation completion is only OK to sent once the server has made sure to finish/abort any ongoing DMA accesses to the mapping. This is probably more relevant in hardware than it is for VFIO-user, but this may manifest in race conditions in the protocol that we need to think about.
- PRI page requests are a concept that currently doesn't exist on the VFIO-user side. It will require a new server->client message. The expectation is that the client will set up DMA mappings for the requested address(es) if possible.
- In hardware, page request group response messages (i.e. page request completions) are used to signal that a bunch of page requests have been processed. Since we have explicit VFIO_USER_DMA_MAP commands, these are probably sufficient to inform the server of the fate of a previous page request, at least for successful page requests. Something may have to be done about rejected page requests, perhaps by generating VFIO_USER_DMA_MAP commands with the protection flags zeroed out to indicate that the corresponding IOVA is in fact not accessible.
PASID support is on top of all the above (although on-demand paging with PRI is typically used with PASIDs). Each PASID basically refers to a separate IOVA address space. This means that all operations dealing with IOVAs need to support optional annotation with a PASID. This obviously affects VFIO_USERDMA{READ,WRITE}, but also VFIO_USERDMA{MAP,UNMAP}. This either means breaking changes to the existing protocol message format to add the PASID field, or introducing extended versions of these messages that include the PASID field. On the libvfio-user side, the DMA controller would be extended to manage mappings not only for a single I/O address space, but one independent address space per PASID.
Additional non-trivial work to integrate the above with qemu's vfio-user server and client will be necessary, including, but not limited to:
- As far as I'm aware there is no support for page requests in qemu at the time of writing this. It would perhaps make most sense as an additional IOMMU operation, or perhaps a new function on AddressSpace for device models to call.
- Depending on the previous design decision, the vfio-user server may have to provide an IOMMU implementation after all (it currently just adds/removes memory regions in response to DMA map/unmap commands).
- Neither can I find PASID support in current qemu, although it seems a patch has been proposed years ago.
- I expect additional work in qemu IOMMU implementations to wire up PASID / page request support.

So much for what I've been able to identify. Given the absence of device models that make use of ATS, I don't plan to implement the above in the predictable future, but wanted to dump it here in the hope that it'll be useful to whoever may pick this up in the future.

Thanks for the write up. Just a couple of comments:

There is an interesting detail here where the invalidation completion is only OK to sent once the server has made sure to finish/abort any ongoing DMA accesses to the mapping. This is probably more relevant in hardware than it is for VFIO-user, but this may manifest in race conditions in the protocol that we need to think about.

The server is required to handle this: https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#id5

PRI page requests are a concept that currently doesn't exist on the VFIO-user side. It will require a new server->client message. The expectation is that the client will set up DMA mappings for the requested address(es) if possible.

Unless I'm missing something this looks quite straightforward.

Regarding the rest, I'll try to extend this once I've fully understood it ;).

FWIW, I have uploaded my proof-of-concept code at https://github.com/mnissler-rivos/libvfio-user/tree/ats

It successfully handles PASID-enabled DMA read/write against pages that get dynamically requested. Tested with a standalone implementation of qemu's edu device model (with some extensions to add PASID support) as the server and a heavily hacked qemu client based on oracle's qemu github repo.

Just sharing what I have in case it is useful - this code isn't suitable for merging, but intended to supplement my wall of text above.

Took me a while to read this. One question: I'm not quite clear why we would/should implement something like PRI

Since we (server) need to have mapped access to memory ahead of time anyway via DMA_MAP, why does it make sense for us to be demand-faulting in like this?

You will probably know/understand some/most of the below, but let me start at the beginning to present a coherent line of thought:

The premise of placing IOMMUs in the DMA data path is that devices actually don't have access to the entire system memory, but the OS can control what they have access to by programming the (v)IOMMU. Even then, the device/VFIO-user server can still pretend that it has full access to its I/O virtual address space: The host can perform DMA_MAP/DMA_UNMAP operations as the guest kernel manages the IOVA and reprograms the vIOMMU. As long as the device only hits accessible memory, everything will work fine (qemu actually supports this to some extent for vhost already IIRC).

The problematic point with the above approach is that we need to proactively set up IOMMU mappings for all memory that the device/VFIO-user server may potentially access. There are two issues with that:

Device-accessible memory is effectively pinned and cannot be paged out (by the guest kernel), potentially causing memory footprint issues
Shared Virtual Addressing (SVA), i.e. sharing an address space between the device and a user space process, is hard if not impractical to support since you'd have to proactively map most/all user space process memory.

Both cases are addressed by demand-paging as enabled by PRI.

So much for what problems PRI support can solve. Whether these problems are worth addressing in VFIO-user is a separate question though. I suspect that right now, setups that employ a vIOMMU are pretty rare, and VFIO-user servers that expose devices which actually would benefit from demand-paging / PRI are largely non-existent (as a side note, my own interest is motivated by the desire to model/mimic real hardware as closely as possible, but I appreciate that this isn't a primary goal for the VFIO-user protocol or libvfio-user project). The situation might change in the future though: If/when more hardware starts adopting ATS/PRI and SVA becomes more prevalent, sooner or later folks will want to use this within their VMs, and thus PRI/ATS will become relevant for VFIO-user.

Given the above, IMO it's perfectly fair to keep this open as a future enhancement, to be picked up if/when a more substantial use case appears.

That said, there is one angle that I think may warrant consideration now: PASID support in the protocol. DMA_MAP, DMA_UNMAP, DMA_READ, DMA_WRITE operations must convey the PASID they're targeting, which will necessitate additional protocol fields. I understand the protocol is still unstable, so we might want to add PASID fields now to avoid going through an awkward protocol upgrade at a later point. Then again, I don't see PASID support in VFIO at this point, so there's also an argument to wait for PASID support to appear there.

setups that employ a vIOMMU are pretty rare desire to model/mimic real hardware as closely as possible Specifically setups that do other than identity mappings. If we have identity mappings (and of course no PASID etc) everything already works.

desire to model/mimic real hardware as closely as possible

This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.

What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?

This isn't arguing for or against actually having ATS etc. support, just the nature of the implementation.

PASID support in the protocol

I would like to see that for sure.

What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?

Ah, thanks for drawing attention to this angle. I had actually thought about this and remote IOMMU approaches are definitely possible. I previously ended up concluding that the IOMMU probably makes more sense on the qemu side, but failed to mention any of this... Here are some thoughts:

I think it's safe to assume that the IOMMU would be on one side only - running two in series seems like a bad idea.
If the IOMMU is on the server side, the guest kernel must still be able to program it. So you end up running an IOMMU control interface over the VFIO-user socket?
If you run a full IOMMU on the server side, then it would have to be a separate IOMMU from the one that exists in qemu. Doable, but different from how typical hardware topologies look like. Or perhaps you could hook up notifiers for vIOMMU operations and then proxy them over to the remote IOMMU implementation. Doable, but probably not trivial given the need for I/O fences when invalidating for example.
One purpose of IOMMUs is enforcing security boundaries, i.e. making sure devices can't access any memory that they haven't been granted access to. This matches up nicely with the VFIO-user process boundary if the IOMMU is on the qemu side, but not when the server process can still access all memory and just performs local access checks. You'd still have to trust the server process to not be malicious. Whereas with the IOMMU on the qemu side, only allowing mmap()ed access selectively and/or using message-based DMA, you are in a much better spot. I expect some VFIO-user protocol hardening would still be needed for the security story to make sense, but it seems plausible.
The ATS design already includes the concept of an Address Translation Cache (ATC) on the device side to maintain a mapping of IOVA -> (G)PA to improve performance. For servers that want to access memory directly, it would probably be more natural to imitate that (DMA_MAP operations would have to indicate physical addresses in addition to IOVA), and then handle translation on the server side and access mmap-ed guest physical memory directly.

This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.

This might be feasible in the long run, but would certainly require more plumbing as the kernel would somehow have to communicate page requests from the device to the server (as opposed to just inspecting the address space and deciding that there's nothing there). Something like a page fault handler for hardware faults.

nutanix / libvfio-user

support for vIOMMU with mmap() #787