google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.38k stars 1.27k forks source link

3090, 3090ti, & 4090 support #10624

Closed 16x3b closed 1 week ago

16x3b commented 2 weeks ago

Description

I and many home enthusiasts would greatly appreciate if you supported these consumer cards. If no official support is planned perhaps, you could guild me on how to potentially add support.

Is this feature related to a specific bug?

No response

Do you have a specific solution in mind?

Please add support for the 3 cards I mentioned.

EtiennePerot commented 2 weeks ago

Hi there @16x3b! I assume you are opening this issue because the gVisor page on GPU support asks to open a bug for other GPU types, so fair enough.

Let's start with the bad news: the reason why these GPUs are not listed as supported on this page is because these GPUs are not available on the hosting provider used for gVisor continuous integration testing, which is Google Cloud. This isn't a Google-Cloud-specific issue. The RTX 3090/4090 GPUs are gaming/consumer-focused GPUs, and so a lot of the die space on these GPUs is dedicated to graphics and video processing, not CUDA-type compute or VRAM. Therefore these GPUs don't make a lot of sense to have in a datacenter, because they would not be worth it in terms of TCO and power efficiency relative to compute-optimized GPUs like the A100/H100/etc. Therefore, even if gVisor were to claim to support these gaming GPUs, this would be a hollow guarantee, because there would not be any automated testing to ensure that this is and remains the case.

The good news is that I have these exact GPU models, and gVisor works well on these GPUs as it is. I can confirm that software such as Ollama (even with large models like llama3), OpenVoice, MeloTTS, and Comfy-UI all work well. I use them often, and as a gVisor contributor I care that they work on these GPUs, so if they were to break I would probably submit a PR to unbreak them.

The only problem you may run into is a driver version incompatibility problem. This is because gVisor's driver version check is very strict (matching on exact version number) as it cannot assume ABI compatibility between even minor driver versions. However, in practice, most driver version changes are ABI-compatible, so you can use the --nvproxy-driver-version flag to force runsc to use a specific supported ABI version. For example, my GPU rig is running NVIDIA driver 550.76, but gVisor only supports 550.54.15 and 550.90.07, so I am passing --nvproxy-driver-version=550.90.07 to runsc and it works fine.

As proof of support for these GPUs, here's a picture of a gVisor-like logo generated on my 4090 in Comfy-UI running in gVisor :)

ComfyUI_gVisor

If you do encounter an unsupported workload on your 3090/4090, chances are that it would also be broken on one of the supported GPU types, because these GPUs share the same underlying architecture. Namely, the 3090 and 3090 TI are both Ampere-based while the 4090 is Lovelace-based, which are the same as the supported GPU types A100 and L4 respectively.

Therefore, creating a reproducer test for these workloads like the ones here would be helpful, because then these can be tested against A100 and L4 GPUs, and once they are, then adding support for them for A100 and L4 will most likely make them work on 3090 and 4090 as well.

Another approach you can look into is adding compatibility by yourself. There's a new tool recently submitted to the repository by @AC-Dap called ioctl_sniffer which is meant to help with detecting what functionality is missing in gVisor to be able to support new CUDA workloads. From its output when running that workload outside of gVisor (unsandboxed), you can look at which CUDA ioctl calls would be missing for it to work in gVisor. From there, adding support for them is a matter of following how other CUDA ioctls are implemented in the codebase. By the way @AC-Dap, it would be nice to expand the README to add some end-to-end documentation on how to add support for an ioctl once the tool does find that a workload is calling something unsupported.

Hope this helps!

16x3b commented 2 weeks ago

Thank you @EtiennePerot for the really descript response!! I am sure I and many others will make good use of this over the next few months!

I did not try it initially because I did not see the GPU's listed and assumed they were unsupported, but should have given it a shot first. I will follow your lead and use the method described through --nvproxy-driver-version.

If I need another driver version I will potentially try to add support in the future, using the method you mentioned to debug and get better observability of the missing ioctl calls.

@AC-Dap, documentation that outlines a method towards adding support would be appreciated! I'm sure I could learn a thing or two.