nutanix / libvfio-user

framework for emulating devices in userspace
BSD 3-Clause "New" or "Revised" License
168 stars 51 forks source link

support for vIOMMU with mmap() #787

Open tmakatos opened 1 year ago

tmakatos commented 1 year ago

libvfio-user doesn't work with a vIOMMU that remaps GPAs to IOVAs and wants to access guest RAM via mmap() (it works fine if guest RAM is accessed via vfu_sgl_read and vfu_sgl_write).

From discussion with @jlevon and @mnissler-rivos on libvfio-user Slack (https://libvfio-user.slack.com/archives/C01AFGCSPTR/p1696411059172959):

  1. IOMMU PT/identity DMA: works fine: incoming IOVAs are the same as GPAs
  2. IOMMU remapping: doesn't work: IOVAs need translation somehow into the underlying GPA prior to the libvfio-user dma code
  3. ATS on top: mostly about endpoint caching translations. would need some version of this between qemu/libvfio-user for any hope of performance
  4. PRI on top: mostly about demand paging: same?
  5. PASID on top: no idea!
mnissler-rivos commented 1 year ago

I've spent some more time looking into IOMMU/ATS/PRI/PASID and wrote some exploratory code. With this background, I think I have a better grasp now on what would be needed to support these features:

So much for what I've been able to identify. Given the absence of device models that make use of ATS, I don't plan to implement the above in the predictable future, but wanted to dump it here in the hope that it'll be useful to whoever may pick this up in the future.

tmakatos commented 1 year ago

Thanks for the write up. Just a couple of comments:

There is an interesting detail here where the invalidation completion is only OK to sent once the server has made sure to finish/abort any ongoing DMA accesses to the mapping. This is probably more relevant in hardware than it is for VFIO-user, but this may manifest in race conditions in the protocol that we need to think about.

The server is required to handle this: https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#id5

PRI page requests are a concept that currently doesn't exist on the VFIO-user side. It will require a new server->client message. The expectation is that the client will set up DMA mappings for the requested address(es) if possible.

Unless I'm missing something this looks quite straightforward.

Regarding the rest, I'll try to extend this once I've fully understood it ;).

mnissler-rivos commented 1 year ago

FWIW, I have uploaded my proof-of-concept code at https://github.com/mnissler-rivos/libvfio-user/tree/ats

It successfully handles PASID-enabled DMA read/write against pages that get dynamically requested. Tested with a standalone implementation of qemu's edu device model (with some extensions to add PASID support) as the server and a heavily hacked qemu client based on oracle's qemu github repo.

Just sharing what I have in case it is useful - this code isn't suitable for merging, but intended to supplement my wall of text above.

jlevon commented 1 year ago

Took me a while to read this. One question: I'm not quite clear why we would/should implement something like PRI

Since we (server) need to have mapped access to memory ahead of time anyway via DMA_MAP, why does it make sense for us to be demand-faulting in like this?

mnissler-rivos commented 1 year ago

You will probably know/understand some/most of the below, but let me start at the beginning to present a coherent line of thought:

The premise of placing IOMMUs in the DMA data path is that devices actually don't have access to the entire system memory, but the OS can control what they have access to by programming the (v)IOMMU. Even then, the device/VFIO-user server can still pretend that it has full access to its I/O virtual address space: The host can perform DMA_MAP/DMA_UNMAP operations as the guest kernel manages the IOVA and reprograms the vIOMMU. As long as the device only hits accessible memory, everything will work fine (qemu actually supports this to some extent for vhost already IIRC).

The problematic point with the above approach is that we need to proactively set up IOMMU mappings for all memory that the device/VFIO-user server may potentially access. There are two issues with that:

  1. Device-accessible memory is effectively pinned and cannot be paged out (by the guest kernel), potentially causing memory footprint issues
  2. Shared Virtual Addressing (SVA), i.e. sharing an address space between the device and a user space process, is hard if not impractical to support since you'd have to proactively map most/all user space process memory.

Both cases are addressed by demand-paging as enabled by PRI.

So much for what problems PRI support can solve. Whether these problems are worth addressing in VFIO-user is a separate question though. I suspect that right now, setups that employ a vIOMMU are pretty rare, and VFIO-user servers that expose devices which actually would benefit from demand-paging / PRI are largely non-existent (as a side note, my own interest is motivated by the desire to model/mimic real hardware as closely as possible, but I appreciate that this isn't a primary goal for the VFIO-user protocol or libvfio-user project). The situation might change in the future though: If/when more hardware starts adopting ATS/PRI and SVA becomes more prevalent, sooner or later folks will want to use this within their VMs, and thus PRI/ATS will become relevant for VFIO-user.

Given the above, IMO it's perfectly fair to keep this open as a future enhancement, to be picked up if/when a more substantial use case appears.

That said, there is one angle that I think may warrant consideration now: PASID support in the protocol. DMA_MAP, DMA_UNMAP, DMA_READ, DMA_WRITE operations must convey the PASID they're targeting, which will necessitate additional protocol fields. I understand the protocol is still unstable, so we might want to add PASID fields now to avoid going through an awkward protocol upgrade at a later point. Then again, I don't see PASID support in VFIO at this point, so there's also an argument to wait for PASID support to appear there.

jlevon commented 1 year ago

setups that employ a vIOMMU are pretty rare desire to model/mimic real hardware as closely as possible Specifically setups that do other than identity mappings. If we have identity mappings (and of course no PASID etc) everything already works.

desire to model/mimic real hardware as closely as possible

This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.

What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?

This isn't arguing for or against actually having ATS etc. support, just the nature of the implementation.

PASID support in the protocol

I would like to see that for sure.

mnissler-rivos commented 1 year ago

What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?

Ah, thanks for drawing attention to this angle. I had actually thought about this and remote IOMMU approaches are definitely possible. I previously ended up concluding that the IOMMU probably makes more sense on the qemu side, but failed to mention any of this... Here are some thoughts:

This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.

This might be feasible in the long run, but would certainly require more plumbing as the kernel would somehow have to communicate page requests from the device to the server (as opposed to just inspecting the address space and deciding that there's nothing there). Something like a page fault handler for hardware faults.