Open tmakatos opened 1 year ago
I've spent some more time looking into IOMMU/ATS/PRI/PASID and wrote some exploratory code. With this background, I think I have a better grasp now on what would be needed to support these features:
So much for what I've been able to identify. Given the absence of device models that make use of ATS, I don't plan to implement the above in the predictable future, but wanted to dump it here in the hope that it'll be useful to whoever may pick this up in the future.
Thanks for the write up. Just a couple of comments:
There is an interesting detail here where the invalidation completion is only OK to sent once the server has made sure to finish/abort any ongoing DMA accesses to the mapping. This is probably more relevant in hardware than it is for VFIO-user, but this may manifest in race conditions in the protocol that we need to think about.
The server is required to handle this: https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#id5
PRI page requests are a concept that currently doesn't exist on the VFIO-user side. It will require a new server->client message. The expectation is that the client will set up DMA mappings for the requested address(es) if possible.
Unless I'm missing something this looks quite straightforward.
Regarding the rest, I'll try to extend this once I've fully understood it ;).
FWIW, I have uploaded my proof-of-concept code at https://github.com/mnissler-rivos/libvfio-user/tree/ats
It successfully handles PASID-enabled DMA read/write against pages that get dynamically requested. Tested with a standalone implementation of qemu's edu device model (with some extensions to add PASID support) as the server and a heavily hacked qemu client based on oracle's qemu github repo.
Just sharing what I have in case it is useful - this code isn't suitable for merging, but intended to supplement my wall of text above.
Took me a while to read this. One question: I'm not quite clear why we would/should implement something like PRI
Since we (server) need to have mapped access to memory ahead of time anyway via DMA_MAP, why does it make sense for us to be demand-faulting in like this?
You will probably know/understand some/most of the below, but let me start at the beginning to present a coherent line of thought:
The premise of placing IOMMUs in the DMA data path is that devices actually don't have access to the entire system memory, but the OS can control what they have access to by programming the (v)IOMMU. Even then, the device/VFIO-user server can still pretend that it has full access to its I/O virtual address space: The host can perform DMA_MAP/DMA_UNMAP operations as the guest kernel manages the IOVA and reprograms the vIOMMU. As long as the device only hits accessible memory, everything will work fine (qemu actually supports this to some extent for vhost already IIRC).
The problematic point with the above approach is that we need to proactively set up IOMMU mappings for all memory that the device/VFIO-user server may potentially access. There are two issues with that:
Both cases are addressed by demand-paging as enabled by PRI.
So much for what problems PRI support can solve. Whether these problems are worth addressing in VFIO-user is a separate question though. I suspect that right now, setups that employ a vIOMMU are pretty rare, and VFIO-user servers that expose devices which actually would benefit from demand-paging / PRI are largely non-existent (as a side note, my own interest is motivated by the desire to model/mimic real hardware as closely as possible, but I appreciate that this isn't a primary goal for the VFIO-user protocol or libvfio-user project). The situation might change in the future though: If/when more hardware starts adopting ATS/PRI and SVA becomes more prevalent, sooner or later folks will want to use this within their VMs, and thus PRI/ATS will become relevant for VFIO-user.
Given the above, IMO it's perfectly fair to keep this open as a future enhancement, to be picked up if/when a more substantial use case appears.
That said, there is one angle that I think may warrant consideration now: PASID support in the protocol. DMA_MAP, DMA_UNMAP, DMA_READ, DMA_WRITE operations must convey the PASID they're targeting, which will necessitate additional protocol fields. I understand the protocol is still unstable, so we might want to add PASID fields now to avoid going through an awkward protocol upgrade at a later point. Then again, I don't see PASID support in VFIO at this point, so there's also an argument to wait for PASID support to appear there.
setups that employ a vIOMMU are pretty rare desire to model/mimic real hardware as closely as possible Specifically setups that do other than identity mappings. If we have identity mappings (and of course no PASID etc) everything already works.
desire to model/mimic real hardware as closely as possible
This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.
What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?
This isn't arguing for or against actually having ATS etc. support, just the nature of the implementation.
PASID support in the protocol
I would like to see that for sure.
What's foxing me about PRI here is really about who is actually acting as the IOMMU, I think. Presumably qemu+kvm itself still has all of guest memory mapped into its process space right? What I'm asking is whether we should do the same on the server side. That is, is the IOMMU effectively emulated in qemu or in libvfio-user - does that makes sense?
Ah, thanks for drawing attention to this angle. I had actually thought about this and remote IOMMU approaches are definitely possible. I previously ended up concluding that the IOMMU probably makes more sense on the qemu side, but failed to mention any of this... Here are some thoughts:
This is a worth goal - and potentially especially relevant in cases where the server side is acting as a proxy for some kind of real hardware underneath.
This might be feasible in the long run, but would certainly require more plumbing as the kernel would somehow have to communicate page requests from the device to the server (as opposed to just inspecting the address space and deciding that there's nothing there). Something like a page fault handler for hardware faults.
libvfio-user doesn't work with a vIOMMU that remaps GPAs to IOVAs and wants to access guest RAM via
mmap()
(it works fine if guest RAM is accessed viavfu_sgl_read
andvfu_sgl_write
).From discussion with @jlevon and @mnissler-rivos on libvfio-user Slack (https://libvfio-user.slack.com/archives/C01AFGCSPTR/p1696411059172959):