Device Passthrough ( Most notably, GPU )

qrpike commented 8 years ago

I know it's currently not possible, but if/when would it be realized to be able to pass through device access.

Main question is for the GPU so the linux VM can spin up/down machine learning containers.

Thanks,

xez commented 8 years ago

Probably not at the PCIe level. I don't know if there are any common paravirtualized interfaces for GPGPU?

brainstorm commented 8 years ago

A pity that PCI passthrough is not in place for xhyve from bhyve... as I gather in the docs perhaps by design choice as stated in xhyve's README.md?:

(...) xhyve is equivalent to the bhyve process but gains a subset of a userspace port of the vmm kernel module. SVM, PCI passthrough and the VMX host and EPT aspects are dropped.

Since @nvidia cannot implement support for Docker on it for OSX.

bms commented 8 years ago

Look at XenServer's design for this.

brainstorm commented 8 years ago

@bms ... not sure what you mean by that. My goal is to be able to run GPU applications with Docker on OSX (which runs on top of xhyve). Linux GPUs+Docker is already well supported by nvidia-docker since there's no xhyve:

https://github.com/NVIDIA/nvidia-docker

So I'm not sure how Xen fits in the picture here... care to explain?

bms commented 7 years ago

My point was that Xen (specifically Citrix XenServer) already has a mature architecture for GPU virtualization, which -- correct me if I'm wrong -- is not a feature in either BHyve or XHyve yet. How Docker encapsulates a GPU virtualization approach, I have no idea.

pmj commented 7 years ago

(Stumbled across this as I'm investigating bhyve/xhyve for a project.)

I don't have any personal experience with it, but XenServer's vGPU stuff is fully Nvidia-specific. I don't know if the hypervisor/Dom0 (host) side of it is open at all.

You can do pure (non-mediated) PCI(e) passthrough with bhyve on FreeBSD and indeed Xen and KVM with Qemu on Linux though; this works via a kernel driver which claims the device on the host (vfio on Linux), and programming the IOMMU so the device's DMA can only access the VM's memory. Graphics card passthrough adds extra difficulty, but that's mostly at the firmware/initialisation level.

For basic PCIe passthrough on OSX/MacOS hosts, I guess the first place to look would be Apple's VT-d driver, which is loaded by default on Ivy Bridge and newer Macs as far as I'm aware. This controls the IOMMU. I've not dealt with this directly beyond writing (PCIe) device drivers for OSX, where DMAs need to select whether they want to use IOMMU address translation or not, but from this I have a sneaking suspicion that Apple just puts all devices in one IOMMU group and that's it. That approach wouldn't be compatible with isolating a selection of devices for assigning to a VM. I certainly don't see an API there at first glance that a passthrough host driver might be able to call. So implementing this could well require extending Apple's VT-d driver, which will probably require the expertise of someone who understands VT-d really, really well. (Or gaining that expertise; the official documentation is very daunting, however.)

Note also that if you're going to pass through one of your Mac's GPUs, the passthrough driver will need to claim it during early boot and make it completely unavailable to the host OS's graphics drivers, as WindowServer currently does not support any kind of hot-enabling/hot-disabling of IOFramebuffer devices.

Manouchehri commented 7 years ago

@pmj Apple uses VT-d domains. Loukas/snare's BruCON 0x06 Thunderbolt talk covered it. (Note: snare works for Apple now and this research was done way back in 2014, so some of the security concerns aren't applicable anymore.)

https://www.youtube.com/watch?v=epeZYO9qFbs&feature=youtu.be&t=2068

https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html

RockNHawk commented 6 years ago

+1

westover commented 6 years ago

+1

pmj commented 6 years ago

@Manouchehri Looks like you might be right, skimming through the VTd driver source some more, it looks like a new space is created for each mapper, and it would appear that most PCI devices get their own mapper, and kexts can explicitly ask for that mapper. It's still not obvious how the connection to a device's specific mapper is made from a particular IOMemoryDescriptor/IODMACommand in the usual case of using the "system" mapper, which is the case in most device drivers. You'll notice the documentation you linked refers through to this doc where they go into code specifics. Neither of the key calls, IODMACommand::prepare and IODMACommand::gen64IOVMSegments reference the device, so it's not clear how the system works out what device you're going to give those DMA addresses to. The other question is how well all of this works together with the Hypervisor.framework's VM memory mappings. Still: I'd have to look into it in much more detail, but it does look doable. I doubt I'll ever get around to it in my spare time though.

alexkreidler commented 6 years ago

Is this on the roadmap at all?

moniquelive commented 6 years ago

+1

pmj commented 6 years ago

Is xhyve currently even actively maintained? Is there such a thing as a roadmap? If not - is there commercial interest in supporting/developing xhyve further?

rickard-von-essen commented 6 years ago

@pmj see #124

pmj commented 6 years ago

I've recently taken a deep dive on the current macOS VT-d code, and it seems that @Manouchehri's assertion is correct - each PCI device does indeed end up with its own VT-d mapper and thus domain/space. The bhyve/FreeBSD PCI passthrough code is reasonably straightforward as far as MMIO, interrupts, etc. are concerned. One thing I've yet to investigate is whether there are any dragons lurking in getting the entire VM-physical to host-physical mapping table, which would need to be fed to the IOMMU. In theory this shouldn't be a problem, but who knows, Hypervisor.framework might expect VM memory to be host-pageable or some other assumption.

As people have mentioned GPU passthrough: FreeBSD/bhyve documentation declares that GPU passthrough is not supported. I don't know what black magic would be required to make this happen. Is anyone even interested in non-GPU passthrough?

Manouchehri commented 6 years ago

@pmj My use case is passing through a ConnectX-3 Pro card (40/56GbE fiber).

lterfloth commented 5 years ago

Any information on the progress? Does it work yet? 👍

pmj commented 5 years ago

Any information on the progress? Does it work yet? 👍

@lterfloth If you're asking me personally: I haven't worked on this beyond the initial research convincing myself that it's possible with Apple's VT-d implementation. It's a fairly large chunk of work -- probably a few weeks of the initial burst of development, followed by who knows how much time debugging and tweaking with real devices and guest OSes/drivers for all the various edge cases etc. This is certainly more than I have spare time for (or can blow off paid contracts for), so at least for me personally, it's not happening unless someone sponsors it or I end up in a situation where I need the feature desperately enough to invest in it. I obviously can't speak for anyone else!

marcj commented 4 years ago

unless someone sponsors it

@pmj, so how much do you need? :)

davesque commented 4 years ago

As an interesting clarification or note to anyone reading this thread, it appears that, if xhyve did support pci passthrough for GPUs, then it would be one of the only (maybe the only) way(s) to do machine learning on macOS with NVidia GPUs. NVidia officially dropped support for CUDA on macOS last year:

So the most reasonable alternative seems to be Docker for Mac using nvidia-docker and a VM with PCI passthrough (unless the host OS drivers are still somehow required for this).

So then the three main VM platforms for docker that I'm aware of for macOS are:

xhyve
Virtualbox
VMWare

Virtualbox recently dropped support for PCI passthrough altogether:

https://www.virtualbox.org/wiki/Changelog#v0 - Search for "passthrough". Previously, virtualbox supported passthrough for Linux hosts only, but this was dropped in v6.1.

VMWare Fusion might support passthrough, but I'm not sure if it's available for Linux virtualization or only for Windows (which wouldn't make it very useful because why not just use bootcamp?).

I feel like this makes it more likely that this work could be funded somehow. Unfortunately, I can't point to any specific source of funding. But there are countless ML shops out there that have developers with macbooks. Seems like there's a lot of pressure behind this particular issue and desire for it to be fixed.

marcj commented 4 years ago

There's definitely a big demand for that as developing with Docker is the defacto standard in professional environments and macOS has 30% share among developers (even more in professional environments). Once xhyve supports passthrough you could also support AMD cards. Considering that the eGPU market gets bigger and bigger, device passthrough for macOS + Docker is very much needed.

@davesque nvidia-docker is not necessary anymore and is deprecated. Docker supports now natively since 19.03 GPU assignment (hard-coded for the moment for NVIDIA & Intel)

Once @pmj decides to work on it and tells a little about the costs, I'd make sure to finance this party by providing money and finding supporters/investors.

marcj commented 4 years ago

As an interesting clarification or note to anyone reading this thread, it appears that, if xhyve did support pci passthrough for GPUs, then it would be one of the only (maybe the only) way(s) to do machine learning on macOS with NVidia GPUs. NVIDIA officially dropped support for CUDA on macOS last year:

@davesque indeed, NVIDIA dropped macOS support for CUDA. There are workarounds to do machine learning in macOS using PlaidML (which used OpenGL/Metal under the hood), but that is not as fast as CUDA of course and again doesn't work with Docker. Note however that there are currently no official NVIDIA drivers anymore for new macOS versions, especially for the eGPU support. You find workarounds to that, though. So, AMD would be favourable in this scenario (which is great, because they are cheaper) until Apple decides to support NVIDIA again.

When xhyve would support GPU passthrough (for internal GPUs and eGPUs) so it can be used in Docker then it's the definitely the only solution to do ML on macOS. No other macOS virtualization engine supports that (not even paid VMWare). I bet @pmj could even create a commercial version of xhyve which people would pay for. However, I think we could find a way to make it open-source by finding sponsors.

pmj commented 4 years ago

So the most reasonable alternative seems to be Docker for Mac using nvidia-docker and a VM with PCI passthrough (unless the host OS drivers are still somehow required for this).

Host drivers, if any were present, would need to explicitly be prevented from grabbing the GPU for making passthrough work. Not a problem as such, just saying you don't want them there, only one OS at a time can drive the card.

VMWare Fusion might support passthrough, but I'm not sure if it's available for Linux virtualization or only for Windows (which wouldn't make it very useful because why not just use bootcamp?).

I don't believe Fusion supports or ever supported passthrough. I think the only VMWare product to support it is ESXi, but I'm certainly not a VMWare expert.

I feel like this makes it more likely that this work could be funded somehow. Unfortunately, I can't point to any specific source of funding. But there are countless ML shops out there that have developers with macbooks. Seems like there's a lot of pressure behind this particular issue and desire for it to be fixed.

I'll throw a risk factor out there: Apple is aggressively moving away from 3rd party kexts, and possibly/probably moving to ARM based CPUs. Who knows whether kexts and Thunderbolt will survive that transition? On the other hand they've just released the Mac Pro for the video editing crowd, who will presumably kick up a stink if their workstation the price of a car stops being supported. (And video editing software is notoriously slow to be ported - still some dragging their 32-bit heels I believe.)

The other thing I'll point out is you can just ssh into a Linux box with a decent GPU and work with that, surely?

pmj commented 4 years ago

@davesque indeed, NVIDIA dropped macOS support for CUDA. There are workarounds to do machine learning in macOS using PlaidML (which used OpenGL/Metal under the hood), but that is not as fast as CUDA of course and again doesn't work with Docker. Note however that there are currently no official NVIDIA drivers anymore for new macOS versions, especially for the eGPU support. You find workarounds to that, though. So, AMD would be favourable in this scenario (which is great, because they are cheaper) until Apple decides to support NVIDIA again.

I'll point out that macOS support for the card, whether AMD or NVIDIA, isn't needed, if you're just going to be passing it through to a VM.

pmj commented 4 years ago

Once @pmj decides to work on it and tells a little about the costs, I'd make sure to finance this party by providing money and finding supporters/investors.

Costs are effectively impossible to estimate because it's entirely possible that I end up sinking massive amounts of efforts into this and still don't succeed - there's a ton of unknown unknowns. There aren't many milestones along the way that produce something intrinsically useful by itself - aside from getting to the point of passing through simpler PCIe devices than a GPU, and that's already quite far along.

marcj commented 4 years ago

The other thing I'll point out is you can just ssh into a Linux box with a decent GPU and work with that, surely?

You could, but with machine learning this is highly unpleasant. Not only do you need to upload your source code for every change made (sure, IDEs support that and some special people out there prefer working in VIM through SSH) but also upload your data sets for every change (which might be several gigabytes). You basically are forced to work then via SSH only. Analysing results of ML experiments suffer the same: You need to download now the data in order to inspect it locally. You also need to figure out how to backup all the stuff on that server so nothing is lost when you stop/kill the server. Overall this introduces an additional management complexity which shouldn't be necessary when you just could use your local GPU using Docker.

Considering that many Macbooks have already an dedicated AMD GPU and have support for eGPU, it's only natural and the way with the least fraction to use them in defacto standard workflows using Docker. eGPUs are getting cheaper and can be used (in Windows bootcamp) for gaming/video editing. However, even even those already integrated AMD GPU chips in Macbooks are a lot faster than the integrated CPU for ML (up to several hundred times).

I'll point out that macOS support for the card, whether AMD or NVIDIA, isn't needed, if you're just going to be passing it through to a VM.

Is this also the case for eGPU via thunderbolt? AFAIK one of the biggest issues right now with NVIDIA eGPUs for macOS is that the OS crashes when disconnecting it while in use, which doesn't happen for AMD. I'm not an expert in this field though.

Costs are effectively impossible to estimate because it's entirely possible that I end up sinking massive amounts of efforts into this and still don't succeed

That's fine. I don't think anybody expected from you to say a concrete total $ value. It's common to tell people how much support you need (like monthly via paetron.com for example) in order to work on that product/feature in a way that people could expect something meaningful (and if the result is that it's inherently impossible to do because Apple sucks, then this is meaningful and valuable as well)

pmj commented 4 years ago

The other thing I'll point out is you can just ssh into a Linux box with a decent GPU and work with that, surely?

You could, but with machine learning this is highly unpleasant.

Fair enough! I’m not really plugged into the ML scene but your explanation makes sense.

However, even even those already integrated AMD GPU chips in Macbooks are a lot faster than the integrated CPU for ML (up to several hundred times).

I have no idea how macOS reacts to being denied access to the built in dGPU on a 15”/16” MacBook Pro. It might not like it. 😅 eGPU should be relatively safe though. eGPUs are likely the only option on iMacs even if the built in GPU is fairly beefy. The IGPU tends to be disabled in those. (or entirely missing by using “F” series Intel CPUs)

I'll point out that macOS support for the card, whether AMD or NVIDIA, isn't needed, if you're just going to be passing it through to a VM.

Is this also the case for eGPU via thunderbolt? AFAIK one of the biggest issues right now with NVIDIA eGPUs for macOS is that the OS crashes when disconnecting it while in use, which doesn't happen for AMD. I'm not an expert in this field though.

Most likely the driver just isn’t expecting the hardware to just disappear so reaches a bad state. With passthrough, the driver would essentially be the special passthrough driver which can be coded to support hotplug. Supporting unplugging while the VM is running is probably going to be a whole new world of pain, but shutting down the VM before unplugging should be easy enough to support once the hard part of getting passthrough working at all is done.

Costs are effectively impossible to estimate because it's entirely possible that I end up sinking massive amounts of efforts into this and still don't succeed

That's fine. I don't think anybody expected from you to say a concrete total $ value. It's common to tell people how much support you need (like monthly via paetron.com for example) in order to work on that product/feature in a way that people could expect something meaningful (and if the result is that it's inherently impossible to do because Apple sucks, then this is meaningful and valuable as well)

Fair enough. I’m in the middle of a project right now, plus with the pandemic things are generally a bit disrupted, but that also means things might get pretty quiet work wise in a few weeks.

davesque commented 4 years ago

I have no idea how macOS reacts to being denied access to the built in dGPU on a 15”/16” MacBook Pro. It might not like it. 😅 eGPU should be relatively safe though.

@pmj Yep, that's what I had in mind actually. I've got an eGPU that I'd like to use. :)

roolebo commented 4 years ago

macOS 10.15.4 is getting an API for user-space PCI drivers available with the latest XCode. Perhaps it can be considered for PCI passthrough:

$ pwd
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/DriverKit19.0.sdk/System/DriverKit/System/Library/Frameworks/PCIDriverKit.framework
$ cat PCIDriverKit.tbd | c++filt
--- !tapi-tbd-v3
archs:           [ x86_64 ]
uuids:           [ 'x86_64: F6C70D67-7C18-3885-98F8-FF28D50B0EA8' ]
platform:        driverkit
install-name:    '/System/DriverKit/System/Library/Frameworks/PCIDriverKit.framework/PCIDriverKit'
exports:
  - archs:           [ x86_64 ]
    symbols:         [ _IOPCIDevice_Class, IOPCIDevice::MemoryRead8(unsigned char, unsigned long long, unsigned char*), IOPCIDevice::MemoryRead16(unsigned char, unsigned long long, unsigned short*),
                       IOPCIDevice::MemoryRead32(unsigned char, unsigned long long, unsigned int*), IOPCIDevice::MemoryRead64(unsigned char, unsigned long long, unsigned long long*),
                       IOPCIDevice::MemoryWrite8(unsigned char, unsigned long long, unsigned char), IOPCIDevice::MemoryWrite16(unsigned char, unsigned long long, unsigned short),
                       IOPCIDevice::MemoryWrite32(unsigned char, unsigned long long, unsigned int), IOPCIDevice::MemoryWrite64(unsigned char, unsigned long long, unsigned long long),
                       IOPCIDevice::_MemoryAccess(unsigned long long, unsigned long long, unsigned long long, unsigned long long*, IOService*, unsigned int, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::_ManageSession(IOService*, bool, unsigned int, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::FindPCICapability(unsigned int, unsigned long long, unsigned long long*, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::ConfigurationRead8(unsigned long long, unsigned char*), IOPCIDevice::ConfigurationRead16(unsigned long long, unsigned short*),
                       IOPCIDevice::ConfigurationRead32(unsigned long long, unsigned int*), IOPCIDevice::ConfigurationWrite8(unsigned long long, unsigned char),
                       IOPCIDevice::ConfigurationWrite16(unsigned long long, unsigned short), IOPCIDevice::ConfigurationWrite32(unsigned long long, unsigned int),
                       IOPCIDevice::GetBusDeviceFunction(unsigned char*, unsigned char*, unsigned char*, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::_MemoryAccess_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned long long, unsigned long long, unsigned long long, unsigned long long*, IOService*, unsigned int)),
                       IOPCIDevice::HasPCIPowerManagement(unsigned long long, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::_ManageSession_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, IOService*, bool, unsigned int)),
                       IOPCIDevice::EnablePCIPowerManagement(unsigned long long, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::FindPCICapability_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned int, unsigned long long, unsigned long long*)),
                       IOPCIDevice::_CopyDeviceMemoryWithIndex(unsigned long long, IOMemoryDescriptor**, IOService*, int (*)(OSMetaClassBase*, IORPC)),
                       IOPCIDevice::GetBusDeviceFunction_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned char*, unsigned char*, unsigned char*)),
                       IOPCIDevice::HasPCIPowerManagement_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned long long)),
                       IOPCIDevice::EnablePCIPowerManagement_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned long long)),
                       IOPCIDevice::_CopyDeviceMemoryWithIndex_Invoke(IORPC, OSMetaClassBase*, int (*)(OSMetaClassBase*, unsigned long long, IOMemoryDescriptor**, IOService*)),
                       IOPCIDevice::Open(IOService*, unsigned int), IOPCIDevice::free(),
                       IOPCIDevice::init(), IOPCIDevice::Close(IOService*, unsigned int),
                       IOPCIDevice::Dispatch(IORPC), IOPCIDevice::_Dispatch(IOPCIDevice*, IORPC),
                       IOPCIDeviceMetaClass::New(OSObject*), IOPCIDeviceMetaClass::Dispatch(IORPC),
                       vtable for IOPCIDevice, vtable for IOPCIDeviceMetaClass, non-virtual thunk to IOPCIDevice::free(),
                       non-virtual thunk to IOPCIDevice::init(), _gIOPCIDeviceMetaClass ]

roolebo commented 4 years ago

It seems that Apple requires a specific PCI Vendor ID for PCIDriverKit entitlement request. I'm not sure how it might work for generic PCI passthrough of any vendor (that's what vfio-pci can do in Linux). Does it mean that users of PCI passthrough drivers always have to run with SIP disabled both for kext and system extension, huh?

pmj commented 4 years ago

As it stands right now, the DriverKit IOPCIDevice API is not suitable for PCI passthrough - for that, you need access to the IOMMU (VT-d). Right now, that only seems available to kexts. (with the caveat that the API available to kexts may still turn out to be insufficient)

At this point (10.15.4), as long as the kext is appropriately signed, you don’t need to disable SIP.

Apple is definitely trying to gain a monopoly on implementing kernel side security vulnerabilities though, and certain USB kexts now produce a warning and will no longer load “by default” in future OS versions. I don’t know what the method for turning off this default will be - it could well be turning off SIP. With DriverKit gaining some PCI capabilities, PCI kexts are probably next. And I could envisage kexts being completely blocked, with no way to turn off SIP, in ARM Macs. No inside information here BTW, just speculation.

zhiyuanzhai commented 3 years ago

Actually what we need is just a TB3-passthrough feature (it's definitely impossible to passthrough the dGPU for docker).

akohlsmith commented 2 years ago

Reviving this stale thread/issue, trying to understand the current state of PCIe passthrough to a Linux guest -- NOT VIDEO RELATED -- I am interested in passing through PCIe devices (through a Thunderbolt adapter) so that I may work on a Linux PCIe device driver for the device. As an experiment, I can see a coral.ai device showing up in OSX's PCI device list when plugged into a TB3-NVME adapter. I have thus far been unsuccessful in passing this through to a Linux guest using VMWare Fusion 11, Parallels Desktop Pro 17 nor libvirt+qemu (from homebrew). None of these seems to be able to pass through a PCI device to a guest, although libvirt seemed the most promising. xhyve looks like it could also work, but this issue thread seems to put a damper on that idea.

What is the state of xhyve being able to pass through a NON GPU PCIe device connected via Thunderbolt to a Linux guest?

pmj commented 2 years ago

What is the state of xhyve being able to pass through a NON GPU PCIe device connected via Thunderbolt to a Linux guest?

None of the virtualisation solutions support PCIe passthrough on macOS yet, to my knowledge. Qemu of course contains the user space bits for the Linux kernel‘s VFIO, and xhyve‘s ancestor bhyve supports FreeBSD‘s kernel API for this purpose. macOS‘s kernel (xnu) does not provide any PCIe passthrough API out of the box.

Previously, the solution would have been to implement a kext which drives the PCIe device. (with some special matching logic so the user can select the device, unlike a “real” driver which typically matches a specific hard coded set of devices.) The main challenge would be to talk to the IOMMU to select the remapped memory addresses as the guest OS will typically be expecting to select these, whereas when writing a driver on macOS, you just let the OS to get on with it.

Apple has made loading kexts a lot harder in recent macOS versions, and deprecated large swathes of kernel APIs for public use, including the PCIFamily. This means that a lot of PCI kexts will no longer load on macOS 12+. The replacement is DriverKit, which runs each driver in a sandboxed environment. There’s no direct access to the IOMMU from this sandbox, so I don’t see a way to implement what’s required for passthrough from DriverKit itself. (There’s also the challenge that Apple issues DriverKit entitlements for specific PCI vendor IDs, which doesn’t really fit with the passthrough model where you want to be able to pass through devices by any vendor.)

So the only currently viable solution continues to be a kext for this purpose and try to hack it such that the OS treats it as a “legacy” kext which falls outside the deprecation rules. I don’t know for how long this would be viable. Additionally, I’m not sure the IOMMU APIs required for selecting remap addresses is available to kexts at all on ARM-based Macs.

Once Apple cracks down further on kexts, you will likely only be able to load them on Macs with SIP disabled altogether.

akohlsmith commented 2 years ago

Thank you for such a detailed and thorough response! I don't particularly like what you said, but I absolutely appreciate and am grateful that you took the time to do so. :-)

If I understand you correctly, the (Intel) 10.15.7 kernel seems to support the idea (sysctl kern.hv_support reports 1) and the CPU also supports it (VMX is present in the machdep.cpu.features output string), but between DriverKit entitlements being specific for a PCI VID:PID and DriverKit not having any API to interface with the IOMMU directly anyway, it doesn't appear that Apple has a way to allow PCI passthrough that isn't a kludge and likely to disappear in the future.

Pity. It seems that the only "real" way to use OSX for hardware development is to virtualize it entirely so it's not in charge of hardware resources at all.

pmj commented 2 years ago

VMX is general virtualisation capability in the CPU. This is what xhyve/hyperkit, Parallels, VMWare, etc. already use for booting VMs. PCIe passthrough is predicated on VT-d. (what Intel call their IOMMU implementation) This is also present and active on all vaguely modern Macs. (~2015+) However, macOS purely uses it for security purposes, not for virtualisation. I think the kernel API for interfacing with the macOS driver for it should be good enough to implement passthrough, but yeah, it‘s mainly Apple‘s desire to lock down their platform that‘s making it less and less viable to actually use it for that purpose.

gilmarwsr commented 2 years ago

First I would like you all for a such excellent quality of comments. I'm in the same journey as you guys, I'm trying to work with Cuda on MacOS but, as you know Apple doesn't provide Nvidia drivers anymore.. After reading all newest comments I think this article could help https://developer.apple.com/documentation/kernel/hardware_families/pci/implementing_a_pcie_kext_for_a_thunderbolt_device

"Thunderbolt devices access system memory through a system-provided I/O memory management unit (IOMMU). On Intel-based Mac computers, the system provides a single IOMMU and gives all devices a shared view of system memory. On Macs with Apple silicon, the system gives each device its own IOMMU. Always implement your driver as if it has its own IOMMU, and never assume you have a shared view of system memory."

pmj commented 2 years ago

After reading all newest comments I think this article could help

Unfortunately, it does not. This is again about kexts, which seem to have a very limited future; besides, the IODMACommand API assumes the driver does not care what the mapping address in IOMMU/device address space is. This assumption is not true for passthrough, otherwise PCIDriverKit would be fine. The reason the assumption does not hold is that you essentially have to map the guest VM’s entire physical system memory for the device’s IOMMU domain, and the addresses have to match. You don’t control the guest OS’s code, so when it tells your device “use the buffer at 0x10000-0x20000 for DMA” you don’t get to intercept that at the VMM level, so the memory in that range from the device’s point of view must be the same as from the guest OS’s point of view. When you create a mapping using IODMACommand, you don’t get to choose the mapping addresses, however.

The main solution I can see is to directly call into the AppleVTD driver’s methods. I don’t think these are technically public APIs, but as far as I’m aware, they’re accessible to kexts. But crucially, not to dexts, and “pure” PCI kexts already won’t load on Monterey, although you could probably persuade it to load them by making bogus dependencies on non-deprecated KPIs. Who knows what happens for macOS 13+ though.

The only workaround I could see is some kind of partial cooperation between guest OS and host VMM passthrough code, by emulating an IOMMU in the guest whose driver “coincidentally” always creates mappings in the address ranges that we’re returned by IODMACommand on the host. I don’t know if that maps to Linux’s high-level idea of how IOMMUs work. If device drivers expect to choose mapping addresses themselves, this won’t work. (Plus it means writing a guest driver for the emulated IOMMU of course.) That leaves the issues of the PCIDriverKit entitlements being issued on a per-vendor-id basis, and configuring the matched devices at runtime.

(FWIW, I’ve been writing macOS device drivers for a living for over 10 years and also have done some development work on VMMs, specifically Qemu and uXen. Apple mostly only documents beginner-level stuff, which really doesn’t apply here I’m afraid.)

IComplainInComments commented 1 year ago

After reading all newest comments I think this article could help

Unfortunately, it does not. This is again about kexts, which seem to have a very limited future; besides, the IODMACommand API assumes the driver does not care what the mapping address in IOMMU/device address space is. This assumption is not true for passthrough, otherwise PCIDriverKit would be fine. The reason the assumption does not hold is that you essentially have to map the guest VM’s entire physical system memory for the device’s IOMMU domain, and the addresses have to match. You don’t control the guest OS’s code, so when it tells your device “use the buffer at 0x10000-0x20000 for DMA” you don’t get to intercept that at the VMM level, so the memory in that range from the device’s point of view must be the same as from the guest OS’s point of view. When you create a mapping using IODMACommand, you don’t get to choose the mapping addresses, however.

The main solution I can see is to directly call into the AppleVTD driver’s methods. I don’t think these are technically public APIs, but as far as I’m aware, they’re accessible to kexts. But crucially, not to dexts, and “pure” PCI kexts already won’t load on Monterey, although you could probably persuade it to load them by making bogus dependencies on non-deprecated KPIs. Who knows what happens for macOS 13+ though.

The only workaround I could see is some kind of partial cooperation between guest OS and host VMM passthrough code, by emulating an IOMMU in the guest whose driver “coincidentally” always creates mappings in the address ranges that we’re returned by IODMACommand on the host. I don’t know if that maps to Linux’s high-level idea of how IOMMUs work. If device drivers expect to choose mapping addresses themselves, this won’t work. (Plus it means writing a guest driver for the emulated IOMMU of course.) That leaves the issues of the PCIDriverKit entitlements being issued on a per-vendor-id basis, and configuring the matched devices at runtime.

(FWIW, I’ve been writing macOS device drivers for a living for over 10 years and also have done some development work on VMMs, specifically Qemu and uXen. Apple mostly only documents beginner-level stuff, which really doesn’t apply here I’m afraid.)

Ive been following this for awhile, and have been looking into this myself. I think I have some ideas that may allow this to possibly happen. Though I dont have your years of experience, so please excuse me if im talking nonsense.

The idea I was thinking was this. Instead of having MacOS control the device itself, why dont we allow the guest to control the device fully. As in give the hypervisors the ability to simpley take the device, and get full control of the device on the guest. Since what we are trying to achieve is simply using the device inside of a guest, as we dont really care how macOS sees the device -- as we just want the device to work in the guest. If we did it like this, can't we completely skip the whole IOMMU problem? Since we aren't mapping to the host at all, and thus allowing the guest to control the hardware? This was the best solution I could think of at the current time.

EDIT: I actually submitted a Code-Level support ticket to Apple Software engineering team with a link to this thread, as im getting VERY tired of this problem. This is something that needs implemented if Apple intends to pursue a virtualization-based solution to Multi-platform development. As NO-ONE on multiple projects has any idea how to do this. So only Apple's Software Engineers would ever have the knowledge or ability it seems... what a mess.

pmj commented 1 year ago

The idea I was thinking was this. Instead of having MacOS control the device itself, why dont we allow the guest to control the device fully. As in give the hypervisors the ability to simpley take the device, and get full control of the device on the guest.

That is exactly what we’d be doing with IOMMU based assignment.

In the absence of an IOMMU in a system, the device would “see” physical system memory addresses when it performs DMA. These buffer addresses are provided by the driver, and ultimately the kernel’a virtual memory system, which must pin the memory buffers to physical ranges for the duration of the DMA, and translate the virtual addresses to physical addresses. So far, this is entirely without virtualisation. If we tried to naively map the PCI device’s BARs into a range accessible by a VM guest, and let the guest drive the device, the guest OS’s drivers would attempt to get the device to perform DMA on guest-physical addresses. The guest’s “physical” address space is entirely independent from the bare metal physical address space, and interpreting the guest addresses as host-physical will cause reads and overwrites of effectively random bits of the host’s memory.

This really isn’t what you want, hence we have IOMMUs. (There’s a few more reasons: IOMMUs additionally protect against malicious hardware, and to some extent, against buggy drivers.)

On macOS, the question of IOMMU: yes/no is entirely hypothetical anyway, any Mac made in the last 10 years has the IOMMU enabled by default all the time. Apple Silicon Macs by default use an IOMMU

So it comes down to mediating between the VM and the IOMMU so when a guest VM’s device driver initiates DMA, that transfer ends up using the correct memory. Ideally, do so in the in a way that’s as efficient as possible and avoids frequent VM exits.

EDIT: I actually submitted a Code-Level support ticket to Apple Software engineering team with a link to this thread, as im getting VERY tired of this problem.

It’d definitely be interesting to see what Apple have to say about this and whether they identify another way to do it.

As NO-ONE on multiple projects has any idea how to do this.

Personally at least, this isn’t a case of having no idea of how to do this, just a question of funding/resource allocation.

IComplainInComments commented 1 year ago

The idea I was thinking was this. Instead of having MacOS control the device itself, why dont we allow the guest to control the device fully. As in give the hypervisors the ability to simpley take the device, and get full control of the device on the guest.

That is exactly what we’d be doing with IOMMU based assignment.

In the absence of an IOMMU in a system, the device would “see” physical system memory addresses when it performs DMA. These buffer addresses are provided by the driver, and ultimately the kernel’a virtual memory system, which must pin the memory buffers to physical ranges for the duration of the DMA, and translate the virtual addresses to physical addresses. So far, this is entirely without virtualisation. If we tried to naively map the PCI device’s BARs into a range accessible by a VM guest, and let the guest drive the device, the guest OS’s drivers would attempt to get the device to perform DMA on guest-physical addresses. The guest’s “physical” address space is entirely independent from the bare metal physical address space, and interpreting the guest addresses as host-physical will cause reads and overwrites of effectively random bits of the host’s memory.

5 minutes after I typed that I had a feeling this is what we were talking about all along lol.

This really isn’t what you want, hence we have IOMMUs. (There’s a few more reasons: IOMMUs additionally protect against malicious hardware, and to some extent, against buggy drivers.)

On macOS, the question of IOMMU: yes/no is entirely hypothetical anyway, any Mac made in the last 10 years has the IOMMU enabled by default all the time. Apple Silicon Macs by default use an IOMMU

So it comes down to mediating between the VM and the IOMMU so when a guest VM’s device driver initiates DMA, that transfer ends up using the correct memory. Ideally, do so in the in a way that’s as efficient as possible and avoids frequent VM exits.

This flat out sounds like this is something the kernel on the host should handle. Since the host kernel is the one ultimately in charge of memory allocation and mappings. Honestly, this is something that Apple themselves would have to enable support for on macOS. I just dont know why they actually thought USB/Serial support was enough, especially when PCI/Thunderbolt devices are ubiquitous, and thunderbolt was a HUGE marketing thing for Apple.

EDIT: I actually submitted a Code-Level support ticket to Apple Software engineering team with a link to this thread, as im getting VERY tired of this problem.

It’d definitely be interesting to see what Apple have to say about this and whether they identify another way to do it.

Well this is something the OS should have support for anyway. Linux and FreeBSD have supported it for a LONG time now, windows (I think) has support for it as well, macOS is the only one without it and that's honestly not okay -- especially since its UNIX space is based off of BSD, and container-based development is becoming an industry standard. I mean GPUs aside, there are a lot of PCI/Thunderbolt devices that are more than just block devices.

As NO-ONE on multiple projects has any idea how to do this.

Personally at least, this isn’t a case of having no idea of how to do this, just a question of funding/resource allocation.

Well even projects like UTM for example, the head developer is basically on the same page and even pointed out QEMU already has the user-space things ready. Its just macOS has never got the kernel side of things ready to go. Especially with the new AS Mac Pro, it should of been a feature of the OS release in the first place.

pmj commented 1 year ago

This flat out sounds like this is something the kernel on the host should handle.

I'm fairly confident it can be handled via a kext.

Convincing Apple they should implement it themselves is obviously worth a shot, but having worked in that ecosystem for almost 15 years now, I'm not going to hold my breath until they come around to that point of view. 😅 (Virtualization.framework doesn't even support USB passthrough - note that any USB passthrough supported by Parallels, VMWare, Qemu/UTM, etc. is built on top of the normal macOS USB stack, there's no special passthrough feature in the OS. This can be done for USB because USB doesn't do device-controlled DMA.)

IComplainInComments commented 1 year ago

This flat out sounds like this is something the kernel on the host should handle.

I'm fairly confident it can be handled via a kext.

Convincing Apple they should implement it themselves is obviously worth a shot, but having worked in that ecosystem for almost 15 years now, I'm not going to hold my breath until they come around to that point of view. 😅 (Virtualization.framework doesn't even support USB passthrough - note that any USB passthrough supported by Parallels, VMWare, Qemu/UTM, etc. is built on top of the normal macOS USB stack, there's no special passthrough feature in the OS. This can be done for USB because USB doesn't do device-controlled DMA.)

Welp... Apple was no Help, because the email is a non-disclosure I wont share the email, but I will state this. Apple has concluded there is no supported way to do PCI Passthrough with Ventura.

johnothwolo commented 1 year ago

he email is a non-disclosure I wont share the email

They made you sign an NDA? 😬

I'll take your word for it but not Apple's. They literally just made an AR headset/operating system, essentially bringing fictional computer interfaces to life. Therefore I take their response as, there's no supported way... without compromise (business, ecosystem or whatever).

IComplainInComments commented 12 months ago

he email is a non-disclosure I wont share the email

They made you sign an NDA? 😬

Bottom of the email disclaimer, so I took it with a grain of salt.

I'll take your word for it but not Apple's. They literally just made an AR headset/operating system, essentially bringing fictional computer interfaces to life. Therefore I take their response as, there's no supported way... without compromise (business, ecosystem or whatever).

Yeah that's my conclusion as well. Apple is following the NOOKs design to a painful level imho. Though, I personally think its from their SE team merging iOS/MacOS so close to each other that they forget that Darwin is also used by power users as well -- by 'Power User' I go by the industry definition and not Apple's AWFUL definition. Honestly if there was a UNIX that wasn't such a cluster f*ck for the desktop I would of moved by now, and not deal with this garbage.

machyve / xhyve

Device Passthrough ( Most notably, GPU ) #108