kasmtech / workspaces-images

Other
730 stars 241 forks source link

Vulkan Not Present or Passed Through in RetroArch Container #58

Closed Zanathoz closed 1 year ago

Zanathoz commented 1 year ago

I'm setting up a RetroArch workspace and do not have Vulkan support being passed through to my workspace container. If I select Vulkan as the driver in retroarch and restart it from the menu, it will enter an infinite loop until I change the driver back to "gl" in the local config file.

I do have the Nvidia drivers, Nvidia Container runtime and Vulkan libraries on my Ubuntu 22 host and passed through to the container. I know games are utilizing it. I can run the nvidia-smi command from within a retroarch container once spun up as a workspace, but the vulkaninfo command is not available inside the container:

default:~$ vulkaninfo
bash: vulkaninfo: command not found

The Vulkan library is passed through to the container properly from the host:

default:~$ ls /usr/share/vulkan/icd.d
dzn_icd.x86_64.json  intel_hasvk_icd.x86_64.json  intel_icd.x86_64.json  lvp_icd.x86_64.json  nvidia_icd.json  radeon_icd.x86_64.json  virtio_icd.x86_64.json

Installing the vulkan-tools is not enough to get Vulkan working:

default:~$ vulkaninfo
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0.  Skipping ICD.
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 1.  Skipping ICD.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
/build/vulkan-tools-KEbD_A/vulkan-tools-1.2.131.1+dfsg1/vulkaninfo/vulkaninfo.h:926: failed with ERROR_UNKNOWN

From what I've read, the Nvidia drivers also need installed inside the container, but I'm having an issue installing because the Nvidia driver is in use via the host passthrough:

default:~$ apt install libnvidia-gl-525
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  libllvm12
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  libnvidia-gl-525
0 upgraded, 1 newly installed, 0 to remove and 8 not upgraded.
Need to get 188 MB of archives.
After this operation, 457 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 libnvidia-gl-525 amd64 525.89.02-0ubuntu0.20.04.1 [188 MB]
Fetched 188 MB in 9s (20.7 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
(Reading database ... 75722 files and directories currently installed.)
Preparing to unpack .../libnvidia-gl-525_525.89.02-0ubuntu0.20.04.1_amd64.deb ...
Unpacking libnvidia-gl-525:amd64 (525.89.02-0ubuntu0.20.04.1) ...
dpkg: error processing archive /var/cache/apt/archives/libnvidia-gl-525_525.89.02-0ubuntu0.20.04.1_amd64.deb (--unpack):
 unable to make backup link of './usr/share/glvnd/egl_vendor.d/10_nvidia.json' before installing new version: Invalid cross-device link
Errors were encountered while processing:
 /var/cache/apt/archives/libnvidia-gl-525_525.89.02-0ubuntu0.20.04.1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

nvidia-smi command working:

default:~$ nvidia-smi
Wed Apr  5 00:27:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0 N/A |                  N/A |
| 50%   47C    P0    N/A /  N/A |    113MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
default:~$

I will try and build my own retroarch image off your core image, but that would be a new venture for me and I will likely fumble around with it for a while. If someone can create a new dev image for me to test I'd appreciate it!

If I do get an image built and tested I will post results.

j-travis commented 1 year ago

Hi, I'm far from an expert on this, but I don't think its fully necessary to have vulkaninfo available inside the container. One of the purposes of the nvidia container runtime/toolkit is to map in the drivers for you automatically. For what its worth, in this steam example, you'll see we mapped in the vulkan configs, relaxed security settings on the container and asked NVIDIA to expose more capabilities to the container. We were able to get tombraider running under vulkan

https://www.reddit.com/r/kasmweb/comments/zvee3q/nvidia_gpu_with_steam_workspace/

Here is another example that may be helpful. This was a test to get the beta version of GODOT running which required vulkan. In this case the vulkan sdk was needed so we built a custom image that based from the official vulkan images. https://github.com/kasmtech/workspaces-issues/issues/264#issuecomment-1259476362

Consider linking this in reddit to see if others have more to contribute to the convo.

Zanathoz commented 1 year ago

I can post to reddit, but this issue is easily recreatable with your own retroarch image if a compatible card is available. If you change the retroarch driver to use Vulkan in the retroarch configuration, it will go into an infinite loading loop and will never actually load until you change the configuration back to "gl". As I confirmed in my example above, the Nvidia driver is presented to the container already, and I have the required vulkan dependencies installed on the host, and confirmed the GT 710 is Vulkan compatible: https://www.khronos.org/conformance/adopters/conformant-products#vulkan

image

I did find an issue with Vulkan passthrough to my containers that is now resolved, but the issue with the retroarch container still remains.

Vulkan Error:

kasm@dj-kasmws:~$ sudo docker run --gpus all \
   -e NVIDIA_DISABLE_REQUIRE=1 \
   -e NVIDIA_DRIVER_CAPABILITIES=all --device /dev/dri \
   -v /etc/vulkan/icd.d/nvidia_icd.json:/etc/vulkan/icd.d/nvidia_icd.json \
   -v /etc/vulkan/implicit_layer.d/nvidia_layers.json:/etc/vulkan/implicit_layer.d/nvidia_layers.json \
   -v /usr/share/glvnd/egl_vendor.d/10_nvidia.json:/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
   -it nvidia/vulkan:1.3-470 \
    bash

root@ff8e3fd9b902:/# vulkaninfo
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
ERROR at /vulkan-sdk/1.3.204.1/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:649:vkCreateInstance failed with ERROR_INCOMPATIBLE_DRIVER

The fix was found here - I had to remove a directory file mistakenly made by driver install, and make it a file with the following contents. Changed permission to +x afterwards: https://forums.developer.nvidia.com/t/vulkan-not-working-solved/220255

image
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.204"
    }
}

After fixing this, I can run the Nvidia Vulkan container and get the correct vulkaninfo output:

sudo docker run --gpus all \
   -e NVIDIA_DISABLE_REQUIRE=1 \
   -v $HOME/.Xauthority:/root/.Xauthority \
   -e DISPLAY -e NVIDIA_DRIVER_CAPABILITIES=all --device /dev/dri --net host \
   -v /etc/vulkan/icd.d/nvidia_icd.json:/etc/vulkan/icd.d/nvidia_icd.json \
   -v /etc/vulkan/implicit_layer.d/nvidia_layers.json:/etc/vulkan/implicit_layer.d/nvidia_layers.json \
   -v /usr/share/glvnd/egl_vendor.d/10_nvidia.json:/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
   -it nvidia/vulkan:1.3-470 \ 
   bash

root@50fa2a13315c:/# vulkaninfo
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR not set in the environment.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.204

Instance Extensions: count = 18
===============================
(extra lines omitted)

If I install vulkan-tools and run vulkaninfo inside the retroarch container, vulkaninfo gives the following error and I'm not sure why as my google-fu has reached it's limits this morning. I think it is due to potential driver issue as I see the container is using a MESA driver for video and not nvidia, but nvidia-smi is still working in the container:

default:~$ vulkaninfo
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 1.  Skipping ICD.
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 2.  Skipping ICD.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
/build/vulkan-tools-KEbD_A/vulkan-tools-1.2.131.1+dfsg1/vulkaninfo/vulkaninfo.h:926: failed with ERROR_UNKNOWN
default:~$ nvidia-smi
Wed Apr  5 14:32:06 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0 N/A |                  N/A |
| 50%   46C    P0    N/A /  N/A |    113MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
default:~$ glxinfo -B
name of display: :1
display: :1  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Mesa/X.org (0xffffffff)
    Device: llvmpipe (LLVM 15.0.7, 256 bits) (0xffffffff)
    Version: 22.3.7
    Accelerated: no
    Video memory: 11967MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.5
    Max compat profile version: 4.5
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Mesa/X.org
OpenGL renderer string: llvmpipe (LLVM 15.0.7, 256 bits)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 22.3.7 - kisak-mesa PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.3.7 - kisak-mesa PPA
OpenGL shading language version string: 4.50
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.3.7 - kisak-mesa PPA
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

default:~$

I was initially thinking a display needed passed through to the container, but the vulkaninfo on the official vulkan container from Nvidia shows the same XDG_RUNTIME_DIR error at the top of it's output.

I think perhaps there is another variable that needs passed to the containers on creation, but I'm not sure what else to try here.

Here is another post I found with a similar issue for Vulkan, claiming a display is not present for the container, although again I don't think this is an issue: https://github.com/NVIDIA/nvidia-container-toolkit/issues/140

Zanathoz commented 1 year ago

I am also tracking this issue on Reddit here - https://www.reddit.com/r/kasmweb/comments/12cifwe/comment/jf28x34/?context=3

I was able to get my container to recognize my video card and render using it properly by re-deploying a desktop distribution of Ubuntu 22, but selecting the Vulkan driver in Retroarch still leads to an endless boot loop with the same errors I've posted above from within the container.

Zanathoz commented 1 year ago

Thanks to Justin over on the subreddit, this is resolved. Adding these items to the Workspace got the Vulkan driver working.

Volume Mapping:

{
  "/usr/share/vulkan/icd.d/nvidia_icd.json": {
    "bind": "/etc/vulkan/icd.d/nvidia_icd.json",
    "mode": "ro",
    "uid": 1000,
    "gid": 1000,
    "required": true,
    "skip_check": true
  },
  "/usr/share/vulkan/implicit_layer.d/nvidia_layers.json": {
    "bind": "/etc/vulkan/implicit_layer.d/nvidia_layers.json",
    "mode": "ro",
    "uid": 1000,
    "gid": 1000,
    "required": true,
    "skip_check": true
  },
  "/usr/share/glvnd/egl_vendor.d/10_nvidia.json": {
    "bind": "/usr/share/glvnd/egl_vendor.d/10_nvidia.json",
    "mode": "ro",
    "uid": 1000,
    "gid": 1000,
    "required": true,
    "skip_check": true
  }
}

Docker Run Config Override

{
  "shm_size": "1gb",
  "security_opt": [
    "seccomp=unconfined"
  ],
  "privileged": true,
  "environment": {
    "NVIDIA_DISABLE_REQUIRE": "1",
    "NVIDIA_DRIVER_CAPABILITIES": "all"
  }
}