Exposing individual GPUs to a container using LocationPath erroneously exposes all GPUs

adamrehn commented 1 year ago

Describe the bug:

When using the vpci-location-path://PATH device string syntax with containerd v1.7.0 to expose an individual GPU to a process-isolated Windows container based on its PCIe location path, the container will see all GPUs that are present on the host system. This contradicts not only the intended behaviour, but also the behaviour observed when exposing other types of devices to containers using their PCIe location path.

Steps to reproduce the behaviour:

Provision a machine with Windows Server 2022 and at least two GPUs.
Ensure the appropriate device drivers are installed for the GPUs.
Install containerd version 1.7.0. For example, using Markus Lippert's containerd Windows installer you would run the following command:
```
containerd-installer.exe --containerd-version 1.7.0 --cni-plugin-version 0.3.0
```
Download and extract the binaries for the latest release of the Kubernetes Device Plugins for DirectX.
Pull a test image that will enumerate the DirectX devices that are visible to the container:
```
ctr images pull "index.docker.io/tensorworks/example-device-discovery:0.0.1"
```
Run the test-device-discovery-cpp.exe executable directly on the host system. You should see details listed for all GPUs that are present on the host system.
Run a container using the test image, replacing <PATH> with the PCIe location path for one of the GPUs, as reported in the output of the previous step:
```
ctr run --rm --device "vpci-location-path://<PATH>" "index.docker.io/tensorworks/example-device-discovery:0.0.1" testing
```
You will see that the container lists details for all of the GPUs present on the host system, instead of just the individual GPU that was exposed to the container.

Expected behaviour:

When enumerating DirectX devices, the container should only see the individual device that was exposed to it.

Configuration:

Edition: Windows Server 2022
Base Image being used: Windows Server Core (mcr.microsoft.com/windows/servercore:ltsc2022)
Container engine: containerd
Container Engine version: v1.7.0

Additional context:

When containerd parses the IDType://ID device string syntax, it places the values in the corresponding fields of the WindowsDevice structure from the OCI runtime specification. hcsshim then acts on these fields in hcsoci.parseAssignedDevices(), where it translates the IDType value to a member of the HCS schema DeviceType enum and assigns the value to the Type field of the HCS schema Device structure. In the case of an IDType value of vpci-location-path (which is translated to the DeviceInstance enum value), the code then assigns the ID value to the LocationPath field. As per the documentation for the HCS schema, the purpose of this field is to specify the PCIe location path of an individual device instance that should be exposed to the container.

It is important to note that the LocationPath field functions correctly when exposing individual instances of other device types, such as COM ports. Below is the output of the chgport command when I ran it directly on a Windows Server 2022 host system that had multiple COM ports:

> chgport

AUX = \DosDevices\COM1
COM1 = \Device\Serial0
COM3 = \Device\Serial1
COM4 = \Device\Serial2

When I exposed an individual COM port to a process-isolated Windows container using the vpci-location-path://PATH device string syntax, the chgport command running inside the container only saw the individual COM port that was exposed, as expected:

> ctr run --rm --device "vpci-location-path://ACPI(_SB_)#ACPI(PCI0)#ACPI(ISA_)#ACPI(COM1)" "mcr.microsoft.com/windows/servercore:ltsc2022" testing chgport

AUX = \DosDevices\COM1
COM1 = \Device\Serial0

The behaviour is entirely different when the allocated device is a GPU, and this appears to be a bug in either the HCS or the Windows kernel itself. Although I can only speculate as to the underlying cause, it is notable that the DirectX Graphics Kernel (located at \Device\DxgKrnl in the Object Manager namespace) is mounted into each container that accesses one or more GPUs. My understanding is that this component is ultimately responsible for DirectX device enumeration, so it is possible that there is a bug in the enumeration logic whereby it simply fails to filter devices based on the caller's Silo.

TBBle commented 1 year ago

I'm curious, is it just enumeration that's incorrect? Are all the visible devices usable when only one is listed in vpci-location-path, or do all but the explicitly-exposed-device fail later?

doctorpangloss commented 1 year ago

All devices are usable regardless of which is mounted / specified here.

In some environments, like AWS, this means that "Microsoft Basic Display Adapter" is sometimes selected as the rendering GPU for graphics applications.

adamrehn commented 1 year ago

@TBBle: Are all the visible devices usable when only one is listed in vpci-location-path, or do all but the explicitly-exposed-device fail later?

I've just run some tests with an Unreal Engine application, and I can confirm that it is indeed able to render with all of the devices that it enumerates. On an AWS g4dn.12xlarge EKS worker node with 4 GPUs, an Unreal Engine project running inside a container with only 1 GPU allocated was able to successfully render screenshots with all four GPUs. (I ran the project multiple times back-to-back within the same container to ensure that the allocated GPU remained constant, and incremented the graphics adapter index for each run using the -graphicsadapter=<INDEX> flag.)

@doctorpangloss: In some environments, like AWS, this means that "Microsoft Basic Display Adapter" is sometimes selected as the rendering GPU for graphics applications.

The Microsoft Basic Display Adapter device is actually the Windows Advanced Rasterization Platform (WARP) software renderer. It is always present in Windows containers that include the DirectX DLL files, even when no GPUs are exposed to the container from the host system. Its presence is not related to this bug.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. no assignees, please provide an update or close this issue.

doctorpangloss commented 1 year ago

Are all the visible devices usable when only one is listed in vpci-location-path

Yes despite the above. On machines with multiple proper GPUs the setting is ignored. The application in the container can successfully create a DirectX device for any GPU on the system. This is the case on my full updated, bare metal Windows 2022 machines on containerd 1.7.0.

michbern-ms commented 1 year ago

I had a good conversation with the Windows Graphics team about this. @adamrehn 's guess was spot-on:

"... it is possible that there is a bug in the enumeration logic whereby it simply fails to filter devices based on the caller's Silo."

The graphics kernel just isn't Silo aware. We're starting discussions about designing such a feature, but it will take time to design and the work is not committed yet.

It would help to prioritize the work if we understood the impact. @doctorpangloss commented that all the GPUs are usable. So, what is the impact of not having the filtering? I am very supportive of having correct and predictable programming models, but I'm wondering if there is more to it than that. Is anyone blocked on using GPUs in Containers as a result of this?

Thanks very much, Michael

doctorpangloss commented 1 year ago

So, what is the impact of not having the filtering?

Every application will need to be aware of which GPU it is actually assigned. This is tricky when the developer uses a library like libwebrtc, which is hard to configure. It may choose the wrong GPU for video encoding.

fady-azmy-msft commented 1 year ago

@doctorpangloss What is the impact to you/your app/your business if you choose the wrong GPU for encoding?

Any information you can provide on the cost/impact to you would help us capture the this need appropriately.

michbern-ms commented 1 year ago

Benjamin Berman was kind enough to send me some details over mail, which I'm re-posting here for general benefit:

Some impacts:

For applications that do support selecting graphics devices, the host abstraction leaks completely through. Every GPU equipped platform I've tested (bare metal, AWS and Azure) presents a different enumeration of graphics devices. So the application needs a specific behavior for every platform; or, a "decorator" application has to read the device-plugin-specific environment variable containing e.g. an assigned GPU and generate a configuration for the graphics using application.
For applications that do not support selecting a graphics device, Microsoft Basic Display Adapter is sometimes first in the enumeration, breaking applications that do not provide an easy way to select the graphics device they intend to use. This is surprising.
Everybody who tries to use DirectX in a container will test on their development machine, which basically never has Microsoft Basic Display Adapter and hence never sends it to the container. Then they'll deploy to Azure, and 100% of containers there will receive Microsoft Basic Display Adapter enumerated as the first device, including on hosts with dedicated graphics. So 100% of developers will experience a show-stopping, unanticipated and arcane surprise.

TBBle commented 1 year ago

It doesn't currently affect me, but the use-case for which I was intending my work on this was parallel non-realtime (independent) rendering jobs on a node, each assigned a unique GPU to the Pod via k8s and the Device Plugin system; if this doesn't work as expected, then I would need an extra way to ensure that multiple simultaneous containers don't try and share a GPU, overloading that GPU and leaving one idle.

adamrehn commented 1 year ago

@michbern-ms thanks for following up on this internally! The motivating impact behind this bug report is the way in which the inability to assign individual devices to containers impacts Kubernetes workloads that consume GPUs using a device plugin such as the Kubernetes Device Plugins for DirectX. At the moment, Kubernetes clusters are effectively limited to using worker nodes with a single GPU each, since most unmodified applications will simply use the first available hardware device.

As @TBBle mentioned, when multiple containers are running on a worker node with multiple GPUs, you can end up in a situation where all of the application instances pile onto a single GPU (the one that gets enumerated first) and the other GPUs on the node are left completely unused. The only available workaround is to implement additional configuration mechanisms (such as a wrapper application, as @doctorpangloss mentioned), but this solution is awkward at best and would only work for 100% of applications if you were to use a library like Detours to interpose DXGI device enumeration API calls and then filter the results that are returned to the calling application.

This doesn't outright prevent Windows containers from consuming GPUs in Kubernetes of course, but it does prevent users from enjoying the deployment density benefits that come from scheduling multiple GPU accelerated containers to each worker node. In the cloud, this largely eliminates one of the primary benefits of using containers rather than just running workloads directly in VMs. For bare metal on-premises Kubernetes clusters, this limits the physical deployment density of the hardware, and likely renders the use of GPU accelerated Windows containers a non-starter for most users.

If you're interested in additional context, I discuss the rationale behind individual device allocation in the blog post Bringing full GPU support to Windows containers in Kubernetes, which introduces the Kubernetes Device Plugins for DirectX and aims to provide readers with a comprehensive understanding of how devices are exposed to Windows containers.

fady-azmy-msft commented 1 year ago

I'm super pleased to see the quality of insights here. Thank you for sharing this. I will take a look into the blog post you referenced @adamrehn.

My (I think) final question for the folks here is what sort of apps are you trying to run on windows containers that would need access to individual GPUs? @adamrehn @TBBle @doctorpangloss

TBBle commented 1 year ago

I can't go into too much detail, but my use-case involved existing Windows-based 3D rendering software and video-encoding of the output; so two GPU uses that have different methods of dealing with multiple GPU environments.

adamrehn commented 1 year ago

@fady-azmy-msft some of the use cases that I was targeting when I created the Kubernetes Device Plugins for DirectX include:

Using the Unreal Engine's Pixel Streaming system to stream output in real-time to web browsers, running both standalone applications and the full Unreal Editor
Performing batch rendering tasks, either with the Unreal Engine or with other rendering systems
Running automated rendering tests for game projects as part of their CI/CD pipelines, either with the Unreal Engine or with another engine (Unity, O3DE, Godot, etc.)
Training machine learning models (e.g. with DirectML) using synthetic training data generated by a game engine such as the Unreal Engine (I've seen deployments of this type before where the same GPU is used for both rendering and for training the ML model, as GPU memory is shared between processes to minimise communication overheads and maximise throughput)

For most of my use cases, the Unreal Engine is the primary application that will be using the GPU. My goal is to make it possible for GPU accelerated Windows containers to run all of the workloads that are supported by GPU accelerated Linux containers today. If you're interested in the various use cases for the Unreal Engine in containers, you can find more information in the whitepaper published by Epic Games.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 90 days with no updates. @spronovo, please provide an update or close this issue.

fady-azmy-msft commented 1 year ago

I'm going to close this issue since this is more an AKS feature than a windows container one.

Please see this AKS roadmap item to track the progress of this feature and share any input you have on the AKS issue.

adamrehn commented 1 year ago

I'm going to close this issue since this is more an AKS feature than a windows container one.

@fady-azmy-msft it is worth noting that this issue has far a broader reach than just one managed Kubernetes service such as AKS. It impacts developers looking to run Windows GPU workloads under Kubernetes on any cloud platform that supports Windows containers, and also in on-premises clusters or hybrid cloud scenarios. That being said, enhancements to the Windows graphics kernel to make it Silo aware for the purposes of facilitating an AKS feature will equally benefit all other impacted scenarios, so if you believe the AKS issue is the most appropriate channel for tracking this work then I'll continue the discussion there.

NAWhitehead commented 6 months ago

@adamrehn

Just to let you know, AKS has just released Windows GPU on AKS in public preview. Please take a look at our documentation to test it out! If you have any feedback, please let us know.

tzifudzi commented 1 week ago

After reading through this thread and encountering the same issue, I think this issue may have been erroneously closed. May you kindly re-open this issue? I want to emphasize the earlier comment from @adamrehn,

"...it is worth noting that this issue has far a broader reach than just one managed Kubernetes service such as AKS..."

Since the multi-GPU use case with containers can be considered a Windows containers capability and not just an AKS one, this issue should probably be considered as something that needs to be implemented as part of the Windows container product. Per my understanding this is what this repository was created for. Per the README.md

"This repository...is dedicated to tracking features and issues related to Windows containers."

While I acknowledge the limitation described in this thread doesn’t block GPU usage, for the multi-GPU use case workarounds are required and these hinder adoption of multi-GPU usage in on-premise and other cloud environments other than AKS. Given its a broader Windows containers issue, closing this issue based on AKS having released the GPU capability might not be a reasonable reason to close this issue.

I attempted to re-open the issue myself but I realized that permissions to manage issues is restricted.

thecloudtaylor commented 1 week ago

Re-opening...

While AKS does have GPU support, it's the same support as any K8 environment thus if you have multiple GPU devices, they are all exposed and thus the app in the container must enumerate and select the GPU device to render from. We do have a feature request in our backlog to filter devices at container creation time (i.e. only the GPU device(s) in the container spec would be shown though) but this is not yet a committed feature.

microsoft / Windows-Containers