kasmtech / workspaces-issues

18 stars 3 forks source link

[Bug] - NVIDIA UnRaid - Unable to recognize GPU #535

Open mfoti opened 3 months ago

mfoti commented 3 months ago

Existing Resources

Describe the bug On UnRaid installation Wizard the GPU is not recognized.

I've updated the installation template including: in extra parameters: --runtime=nvidia as variable: NVIDIA_VISIBLE_DEVICES = all as variable: NVIDIA_DRIVER_CAPABILITIES = all and nvidia-smi works.

# docker exec kasm nvidia-smi

Wed Mar 20 11:59:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.07              Driver Version: 550.40.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P400                    Off |   00000000:AF:00.0 Off |                  N/A |
| 56%   54C    P0             N/A /  N/A  |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

but drm_info don't include /dev/dri/card1 my GPU Dev

# docker exec kasm ./gpuinfo.sh`

{"/dev/dri/card0":"MGA G200 SE"}

I've tried to force this card during the installation process (with an hardcoded mod of this script that output: {"/dev/dri/card1":"NVIDIA P400"} and {"/dev/dri/card1":"Quadro P400"} ), but after installation was done I'm unable to start any workspace, I have the error:

error gathering device information while adding custom device "/dev/dri/renderD129": no such file or directory

Full log:

Error during Create request for Server(a89aa3ec-ede1-4152-8a43-1dc99cb1950b) : (Exception creating Kasm: Traceback (most recent call last):
  File "docker/api/client.py", line 268, in _raise_for_status
  File "requests/models.py", line 1021, in raise_for_status
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.44/containers/f683e85b8fb6c257831f3a664eac0adc36d1ccfcd8f63075d69f732c88c9765f/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "__init__.py", line 573, in post
  File "provision.py", line 1871, in provision
  File "provision.py", line 1863, in provision
  File "docker/models/containers.py", line 818, in run
  File "docker/models/containers.py", line 404, in start
  File "docker/utils/decorators.py", line 19, in wrapped
  File "docker/api/container.py", line 1111, in start
  File "docker/api/client.py", line 270, in _raise_for_status
  File "docker/errors.py", line 31, in create_api_error_from_http_exception
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.44/containers/f683e85b8fb6c257831f3a664eac0adc36d1ccfcd8f63075d69f732c88c9765f/start: Internal Server Error ("error gathering device information while adding custom device "/dev/dri/renderD129": no such file or directory")
)

The device is not present in kasm_agent container as device:

# docker exec kasm_agent ls /dev/dri/card1

ls: cannot access '/dev/dri/card1': No such file or directory

# docker exec kasm_agent ls /dev/dri/renderD129

ls: cannot access '/dev/dri/renderD129': No such file or directory

But I can find it in proc:

# docker exec kasm_agent cat /proc/driver/nvidia/gpus/0000\:af\:00.0/information

Model:       Quadro P400
IRQ:         304
GPU UUID:    GPU-226266ed-48f0-0e03-4d64-780bc2e08ccb
Video BIOS:      86.07.8f.00.02
Bus Type:    PCIe
DMA Size:    47 bits
DMA Mask:    0x7fffffffffff
Bus Location:    0000:af:00.0
Device Minor:    0
GPU Excluded:    No

To Reproduce Steps to reproduce the behavior:

  1. Add kasm App from UnRaid installation
  2. Open the select for GPU
  3. You will not find any NVIDIA Card

Expected behavior Be able to use nvidia card on kasm/UnRaid

Workspaces Version Version 1.15

Workspaces Installation Method UnRaid

Workspace Server Information (please provide the output of the following commands):

Server: Containers: 9 Running: 8 Paused: 0 Stopped: 1 Images: 9 Server Version: 25.0.4 Storage Driver: fuse-overlayfs Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb runc version: v1.1.12-0-g51d5e94 init version: de40ad0 Security Options: seccomp Profile: builtin cgroupns Kernel Version: 6.1.74-Unraid Operating System: Ubuntu 22.04.2 LTS (containerized) OSType: linux Architecture: x86_64 CPUs: 88 Total Memory: 251.5GiB Name: fe5d658a8112 ID: d62537f3-97b0-482e-a489-4e00a573cd4c Docker Root Dir: /opt/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false


 - `sudo docker ps | grep kasm`

4feeb4b6b2cf kasmweb/nginx:1.25.3 "/docker-entrypoint.…" 15 hours ago Up 14 hours 80/tcp, 0.0.0.0:6333->6333/tcp kasm_proxy 261d67c5ccc3 kasmweb/agent:1.15.0 "/bin/sh -c '/usr/bi…" 15 hours ago Up 14 hours (healthy) 4444/tcp kasm_agent ad3e62cd7871 kasmweb/share:1.15.0 "/bin/sh -c '/usr/bi…" 15 hours ago Up 14 hours (healthy) 8182/tcp kasm_share b1f718129357 kasmweb/kasm-guac:1.15.0 "/dockerentrypoint.sh" 15 hours ago Up 16 seconds (health: starting) kasm_guac 6150582c13bb kasmweb/api:1.15.0 "/bin/sh -c '/usr/bi…" 15 hours ago Up 14 hours (healthy) 8080/tcp kasm_api a95638e0e39a kasmweb/manager:1.15.0 "/bin/sh -c '/usr/bi…" 15 hours ago Up 14 hours (healthy) 8181/tcp kasm_manager bdfc0ef3df36 redis:5-alpine "docker-entrypoint.s…" 15 hours ago Up 14 hours 6379/tcp kasm_redis 8436c39024bc postgres:12-alpine "docker-entrypoint.s…" 15 hours ago Up 14 hours (healthy) 5432/tcp kasm_db


**Additional context**
I'd like to try to add this to my boot modprobe config:

cat /boot/config/modprobe.d/nvidia.conf options nvidia-drm modeset=1 options nvidia-drm fbdev=1



but I need to shutdown the server and is not something I can do easily 
mfoti commented 3 months ago

I've fixed running this:

docker exec -ti kasm nvidia-ctk runtime configure --runtime=docker
docker restart kasm

and updating the Chrome Workspace in "Docker Run Config Override (JSON)"

with this configuration:

{
  "device_requests": [
    {
      "capabilities": [
        [
          "gpu"
        ]
      ],
      "count": -1,
      "device_ids": null,
      "driver": "",
      "options": {}
    }
  ],
  "devices": [
    "/dev/dri/card1:/dev/dri/card1:rwm",
    "/dev/dri/renderD128:/dev/dri/renderD128:rwm"
  ],
  "environment": {
    "KASM_EGL_CARD": "/dev/dri/card1",
    "KASM_RENDERD": "/dev/dri/renderD128"
  },
  "hostname": "kasm"
}

But I have a black screen and at least chrome doesn't starts

tknz commented 6 days ago

But I have a black screen and at least chrome doesn't starts

Remove your Docker run config with:

{ "environment": { "NVIDIA_DRIVER_CAPABILITIES": "all" } }

I think you had it - had to scrounge around to figure out what the issues were but step 1 is:

Add the variables to the container:

Variables:

NVIDIA_DRIVER_CAPABILITIES=all NVIDIA_VISIBLE_DEVICES=all (or GPUID on visible devices)

Argument: --runtime=nvidia

Command: docker exec -ti kasm nvidia-ctk runtime configure --runtime=docker (as long as container name is kasm - run it from the CLI of the host, or alternatively run nvidia-ctk runtime configure --runtime=docker within the container.

Set the docker json to:

{ "environment": { "NVIDIA_DRIVER_CAPABILITIES": "all" } }