mlcommons / inference_results_v4.0

This repository contains the results and code for the MLPerf™ Inference v4.0 benchmark.
https://mlcommons.org/benchmarks/inference-datacenter/
Apache License 2.0
9 stars 13 forks source link

docker: Error response from daemon: unknown or invalid runtime name: nvidia #7

Closed mahmoodn closed 1 week ago

mahmoodn commented 1 week ago

With CUDA-11.8 on Ubuntu 22.04 and RTX 3080 device, I tried to run make prebuild inference_v4 in NVIDIA folder. After about two hours of compilation, it ended with an error which I don't understand the message:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

$ nvidia-smi 
Mon Jun 24 13:46:50 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+

$ docker images
REPOSITORY                               TAG                                                       IMAGE ID       CREATED         SIZE
mlperf-inference                         mahmood-x86_64                                           3ef82bce4e33   4 minutes ago   46.5GB
mlperf-inference                         mahmood-x86_64-latest                                    6539c6892f0d   5 minutes ago   46.5GB
nvcr.io/nvidia/mlperf/mlperf-inference   mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public   34b056f25fae   4 months ago    14.5GB

$ dpkg -l | grep docker
ii  docker                                     1.5-2                                   all          transitional package
ii  docker-buildx                              0.12.1-0ubuntu1~22.04.1                 amd64        Docker CLI plugin for extended build capabilities with BuildKit
ii  docker.io                                  24.0.7-0ubuntu2~22.04.1                 amd64        Linux container runtime
ii  wmdocker                                   1.5-2                                   amd64        System tray for KDE3/GNOME2 docklet applications

$ dpkg -l | grep nvidia
ii  libnvidia-container-tools                  1.15.0-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.15.0-1                                amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit                   1.15.0-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base              1.15.0-1                                amd64        NVIDIA Container Toolkit Base

The output is:

 => [37/40] RUN mkdir -p /opt/fp8/faster-transformer-bert-fp8-weights-scales/     && tar -zxvf /tmp/faster-transformer-bert-f  8.4s 
 => [38/40] RUN apt install -y libgl1-mesa-glx                                                                                 6.6s 
 => [39/40] RUN apt install -y python3.8-venv                                                                                  1.7s 
 => [40/40] WORKDIR /work                                                                                                      0.0s 
 => exporting to image                                                                                                        46.7s 
 => => exporting layers                                                                                                       46.7s
 => => writing image sha256:6539c6892f0d0f4bd49b4234de1fe16cf03fcfdc22ea40f04dbea4509f0c61a7                                   0.0s
 => => naming to docker.io/library/mlperf-inference:mahmood-x86_64-latest                                                     0.0s
make[1]: Leaving directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
make[1]: Entering directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
make[2]: Entering directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
Adding user account into image
DOCKER_BUILDKIT=1 docker build -t mlperf-inference:mahmood-x86_64 --network host \
    --build-arg BASE_IMAGE=mlperf-inference:mahmood-x86_64-latest \
    --build-arg GID=1000 --build-arg UID=1000 --build-arg GROUP=mahmood --build-arg USER=mahmood \
    - < docker/Dockerfile.user
[+] Building 0.4s (6/6) FINISHED                                                                                     docker:default
 => [internal] load .dockerignore                                                                                              0.0s
 => => transferring context: 2B                                                                                                0.0s
 => [internal] load build definition from Dockerfile                                                                           0.0s
 => => transferring dockerfile: 1.05kB                                                                                         0.0s
 => [internal] load metadata for docker.io/library/mlperf-inference:mahmood-x86_64-latest                                     0.0s
 => [1/2] FROM docker.io/library/mlperf-inference:mahmood-x86_64-latest                                                       0.2s
 => [2/2] RUN echo root:root | chpasswd  && groupadd -f -g 1000 mahmood  && useradd -G sudo -g 1000 -u 1000 -m mahmood  &&   0.1s
 => exporting to image                                                                                                         0.0s
 => => exporting layers                                                                                                        0.0s
 => => writing image sha256:3ef82bce4e3303961c0d1b896e1c84228f241881c697965a584946443283b6e7                                   0.0s
 => => naming to docker.io/library/mlperf-inference:mahmood-x86_64                                                            0.0s
make[2]: Leaving directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
make[2]: Entering directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
/bin/bash: line 1: [: ==: unary operator expected
/bin/bash: line 1: [: !=: unary operator expected
docker run --gpus=all --runtime=nvidia --rm -it -w /work \
    -v /disk1/mahmood/inference_results_v4.0/closed/NVIDIA:/work -v /home/mahmood:/mnt//home/mahmood \
    --cap-add SYS_ADMIN --cap-add SYS_TIME \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -e HISTFILE=/mnt//home/mahmood/.mlperf_bash_history \
    --shm-size=32gb \
    --ulimit memlock=-1 \
    -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \
    --security-opt apparmor=unconfined --security-opt seccomp=unconfined \
    --name mlperf-inference-mahmood-x86_64-28459 -h mlperf-inference-mahmood-x86-64-28459 --add-host mlperf-inference-mahmood-x86_64-28459:127.0.0.1 \
    --cpuset-cpus 0-15 \
    --user 1000 --net host --device /dev/fuse \
    -v /disk1/scratch_v4:/disk1/scratch_v4  \
    -e MLPERF_SCRATCH_PATH=/disk1/scratch_v4 \
    -e HOST_HOSTNAME=rtx3080 \
     \
    mlperf-inference:mahmood-x86_64 
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
make[2]: *** [Makefile.docker:311: launch_docker] Error 125
make[2]: Leaving directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'
make[1]: *** [Makefile.docker:299: attach_docker] Error 2
make[1]: Leaving directory '/disk1/mahmood/inference_results_v4.0/closed/NVIDIA'

Any idea on how to fix that?

mahmoodn commented 1 week ago

With the following commands, I was able to define and enable nvidia runtime in docker.

$ sudo apt install nvidia-container-toolkit nvidia-container-runtime
$ sudo nvidia-ctk runtime configure --runtime=docker
$ docker info      # Verify that nvidia is listed in the runtime section