rapidsai / node

GPU-accelerated data science and visualization in node
https://rapidsai.github.io/node/
Apache License 2.0
178 stars 20 forks source link

[Error (docker)]: response from daemon: Unknown runtime specified nvidia AND could not select device driver "" with capabilities: [[gpu]]. #324

Open Luxcium opened 2 years ago

Luxcium commented 2 years ago

Docker Error

I am unable to troubleshoot this issue can you let me know what information could be helpful to help me ???

docker: Error response from daemon: Unknown runtime specified nvidia.

❯ REPO=ghcr.io/rapidsai/node
VERSIONS="21.12.00-runtime-node16.10.0-cudagl11.4.2-ubuntu20.04"

# Be sure to pass either the `--runtime=nvidia` or `--gpus` flag!
docker run --rm \
    --runtime=nvidia \
    -e "DISPLAY=$DISPLAY" \
    -v "/etc/fonts:/etc/fonts:ro" \
    -v "/tmp/.X11-unix:/tmp/.X11-unix:rw" \
    -v "/usr/share/fonts:/usr/share/fonts:ro" \
    -v "/usr/share/icons:/usr/share/icons:ro" \
    $REPO:$VERSIONS-demo-amd64 \
    npx @rapidsai/demo-graph
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
❯ echo $DISPLAY
:0

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

❯ REPO=ghcr.io/rapidsai/node
VERSIONS="21.12.00-runtime-node16.10.0-cuda11.4.2-ubuntu20.04"

# Be sure to pass either the `--runtime=nvidia` or `--gpus` flag!
docker run --rm --gpus=0 $REPO:$VERSIONS-cudf-amd64 \
    -p "const {Series, DataFrame} = require('@rapidsai/cudf');\
        new DataFrame({ a: Series.new([0, 1, 2]) }).toString()"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
AjayThorve commented 2 years ago

Hey @Luxcium do you have nvidia-docker2 installed on your system?

Might be related to that!

If you do, may be this discussion might help

Luxcium commented 2 years ago
❯ sudo lspci | grep -i nvidia
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
### [...] audio deleted from the capture

❯ uname -m && cat /etc/*release

x86_64
Fedora release 34 (Thirty Four)
NAME=Fedora
VERSION="34 (KDE Plasma)"
ID=fedora
VERSION_ID=34
VERSION_CODENAME=""
PLATFORM_ID="platform:f34"
PRETTY_NAME="Fedora 34 (KDE Plasma)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:34"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f34/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=34
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=34
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="KDE Plasma"
VARIANT_ID=kde
Fedora release 34 (Thirty Four)
Fedora release 34 (Thirty Four)

❯ uname -r
5.14.13-200.fc34.x86_64
❯ uname -a
Linux corsairone-neb401-com 5.14.13-200.fc34.x86_64 #1 SMP Mon Oct 18 12:39:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

❯ dnf list  'kernel*headers'
Dernière vérification de l’expiration des métadonnées effectuée il y a 0:40:39 le ven 22 oct 2021 18:35:31.
Paquets installés
kernel-headers.x86_64                                      5.14.9-200.fc34                                @updates
Paquets disponibles
kernel-cross-headers.x86_64                                5.14.9-200.fc34                                updates 

❯ gcc --version
gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1)
Copyright © 2021 Free Software Foundation, Inc.
Ce logiciel est un logiciel libre; voir les sources pour les conditions de copie.  Il n'y a
AUCUNE GARANTIE, pas même pour la COMMERCIALISATION ni L'ADÉQUATION À UNE TÂCHE PARTICULIÈRE.
Luxcium commented 2 years ago

I think Fedora team hates people using NVIDIA or NVIDIA team hates people using Fedora

Luxcium commented 2 years ago

Hey @Luxcium do you have nvidia-docker2 installed on your system?

Might be related to that!

If you do, may be this discussion might help

Thanks @AjayThorve do you know if I can get it except from https://rpms.if-not-true-then-false.com/inttf.repo (link to the blog post)

I use Fedora release 34 (Thirty Four) as shown in the hidden post above ...

Luxcium commented 2 years ago

Screenshot_20211022_193256

I am doing it then...

Luxcium commented 2 years ago

Using nvidia-docker2

I have a new error message now

nvidia-container-cli: container error: cgroup subsystem devices not found: unknown

❯ REPO=ghcr.io/rapidsai/node
VERSIONS="21.12.00-runtime-node16.10.0-cudagl11.4.2-ubuntu20.04"

docker run --rm --runtime=nvidia -e "DISPLAY=$DISPLAY" -v "/etc/fonts:/etc/fonts:ro" \
              -v "/tmp/.X11-unix:/tmp/.X11-unix:rw" -v "/usr/share/fonts:/usr/share/fonts:ro" \ 
              -v "/usr/share/icons:/usr/share/icons:ro" $REPO:$VERSIONS-demo-amd64 npx @rapidsai/demo-graph
docker: Error response from daemon: 
OCI runtime create failed: 
container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: 
Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: 
container error: cgroup subsystem devices not found: unknown.
❯ REPO=ghcr.io/rapidsai/node
VERSIONS="21.12.00-runtime-node16.10.0-cuda11.4.2-ubuntu20.04"

docker run --rm --gpus=0 $REPO:$VERSIONS-cudf-amd64 -p \
        "const {Series, DataFrame} = require('@rapidsai/cudf');\
        new DataFrame({ a: Series.new([0, 1, 2]) }).toString()"
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: 
starting container process caused: process_linux.go:545: container init caused: 
Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: 
container error: cgroup subsystem devices not found: unknown.
Luxcium commented 2 years ago

https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-767103492 ― @klueska said: I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

Luxcium commented 2 years ago

after one hour of googling and trying to find a solution I must admit that I will wait to see if someone could help me here I was looking into the container error: cgroup subsystem devices not found: unknown but maybe I am starting to be blind to solution if you know the solution just let me know or please ask me more details about my system or configuration

trxcllnt commented 2 years ago

@Luxcium not entirely sure what you've tried, but generally the 3 things you will need (in addition to the driver) are:

I know it's possible to use GPUs in docker in RHEL, because we publish RHEL (Centos) images for the core RAPIDS libraries. Let me know if it still doesn't work after installing the above. I don't have a box with Centos right now, but I could put it on one of my spare machines to test if I need to.

klueska commented 2 years ago

Please see my comment here about the error of container error: cgroup subsystem devices not found: unknown regarding the lack of cgroupv2 support.

trxcllnt commented 2 years ago

@Luxcium does this work for you? https://github.com/NVIDIA/nvidia-docker/issues/706#issuecomment-851816502