qcr / benchbot

BenchBot is a tool for seamlessly testing & evaluating semantic scene understanding tools in both realistic 3D simulation & on real robots
BSD 3-Clause "New" or "Revised" License
110 stars 12 forks source link

Unable to install benchbot #13

Closed gmuraleekrishna closed 3 years ago

gmuraleekrishna commented 3 years ago

When I run benchbot_run --robot carter --env miniroom:1 --task semantic_slam:active:ground_truth I get the following error

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

The nvidia-smi output is as follows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   35C    P8    17W / 260W |      6MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:67:00.0  On |                  Off |
| 34%   45C    P8    22W / 260W |    360MiB / 48598MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
btalb commented 3 years ago

Thanks for reporting this to us @gmuraleekrishna.

This error seems to be coming from the NVIDIA Container Toolkit, which passes the GPU from your host OS to Docker containers.

It may be being caused by the two GPUs, or simply a bad state. I'd recommend the following:

  1. Restarting the service (and then the computer if the error still occurs):
    sudo systemctl restart docker.service
  2. Checking the following command works:
    docker run --rm --gpus all -it benchbot/backend:base
  3. Trying some of the examples here to confirm GPU sharing is working
  4. Posting the full output of benchbot_install so I can try and get a deeper understanding of what's going on:
    benchbot_install

Let me know how you go.

gmuraleekrishna commented 3 years ago

@btalb Thanks for the reply. The steps 1-3 didn't work. We still get the error. The ouput for Step 4 is as follows.

$ benchbot_install

################################################################################
###################### CHECKING BENCHBOT SCRIPTS VERSION #######################
################################################################################

Fetching latest hash for Benchbot scripts ... 
        25459b5a9ae2bf8d5372cce9d1c3802c5e80cb16.

BenchBot scripts are up-to-date.

################################################################################
######################## PART 1: EXAMINING SYSTEM STATE ########################
################################################################################

Core host system checks:
    Ubuntu version >= 18.04:                                  Passed (18.04)

Running Nvidia related system checks:
    Nvidia GPU available:                     Found card of type '10de:1e30'
    Nvidia driver is running:                                          Found
    Nvidia driver version valid:                           Valid (460.32.03)
    Nvidia driver from a standard PPA:                          PPA is valid
    CUDA drivers installed:                                    Drivers found
    CUDA drivers version valid:                          Valid (460.32.03-1)
    CUDA drivers from the Nvidia PPA:                           PPA is valid
    CUDA is installed:                                            CUDA found
    CUDA version valid:                                         Valid (11.2)
    CUDA is from the Nvidia PPA:                                PPA is valid
    Isaac SDK archive in 'isaac' folder:                       Found archive

Running Docker related system checks:
    Docker is available:                                               Found
    Docker version valid:                                    Valid (20.10.4)
    Nvidia Container Toolkit installed:                        Found (1.3.3)
    Docker runs without root:                                         Passed

Running checks of filesystem used for Docker:
    /var/lib/docker on ext4 filesystem:                 Yes (/dev/nvme0n1p6)
    /var/lib/docker supports suid:                                   Enabled
    /var/lib/docker driver space check:               Sufficient space (84G)

Miscellaneous requirements:
    Pip python package manager available:                     Found (21.0.1)
    Tkinter for Python installed:                                      Found

All requirements & dependencies fulfilled. Docker containers for the BenchBot
software stack will now be built (which may take anywhere from a few seconds
to many hours). Would you like to proceed (y/N)? y

Proceeding with install ... 

################################################################################
################ PART 2: FETCHING LATEST BENCHBOT VERSION INFO #################
################################################################################

Fetching latest hash for BenchBot Simulator ... 
        d98690360b0c55d1b25c367d8b7d6ede1c618b0f.
Fetching latest hash for BenchBot Robot Controller ... 
        41f176ead1f972d54939b2120720d2fba8a401a0.
Fetching latest hash for BenchBot Supervisor ... 
        9e1cd82a3951feb204af1bfb134704d779a295ab.
Fetching latest hash for BenchBot API ... 
        785740ae3fc5dcfab1ec69fd5bf7eb5b741f995d.
Fetching latest hash for BenchBot ROS Messages ... 
        54487de28fae52c0d54a207b0c776736550fbd93.

################################################################################
######################## PART 3: BUILDING DOCKER IMAGES ########################
################################################################################

BUILDING BENCHBOT CORE DOCKER IMAGE:
Sending build context to Docker daemon  173.7MB
Step 1/12 : FROM ubuntu:bionic
 ---> c090eaba6b94
Step 2/12 : SHELL ["/bin/bash", "-c"]
 ---> Using cache
 ---> f9776749195f
Step 3/12 : ARG TZ
 ---> Using cache
 ---> 69a597014262
Step 4/12 : RUN echo "$TZ" > /etc/timezone && ln -s /usr/share/zoneinfo/"$TZ"     /etc/localtime && apt update && apt -y install tzdata
 ---> Using cache
 ---> 5c8e4ae48d17
Step 5/12 : RUN apt update && apt install -yq wget gnupg2 software-properties-common git     vim ipython3 tmux iputils-ping
 ---> Using cache
 ---> 21354ae08f05
Step 6/12 : ARG NVIDIA_DRIVER_VERSION
 ---> Using cache
 ---> 695adac5d748
Step 7/12 : ARG CUDA_DRIVERS_VERSION
 ---> Using cache
 ---> a724f0c9f331
Step 8/12 : ARG CUDA_VERSION
 ---> Using cache
 ---> 50fdf2a543ef
Step 9/12 : ENV NVIDIA_VISIBLE_DEVICES="all"
 ---> Using cache
 ---> e33abcd7cf5d
Step 10/12 : ENV NVIDIA_DRIVER_CAPABILITIES="compute,display,graphics,utility"
 ---> Using cache
 ---> 548a8d84a163
Step 11/12 : RUN add-apt-repository ppa:graphics-drivers &&     wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin &&     mv -v cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 &&     apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub &&     add-apt-repository -n "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" &&     apt update
 ---> Using cache
 ---> a90c9654a281
Step 12/12 : RUN CUDA_NAME="cuda-$(echo "${CUDA_VERSION}" |     sed 's/\([0-9]*\)\.\([0-9]*\).*/\1\.\2/; s/\./-/')" &&     NVIDIA_NAME="nvidia-driver-$(echo "${NVIDIA_DRIVER_VERSION}" |     sed 's/\(^[0-9]*\).*/\1/')" &&     NVIDIA_DEPS="$(apt depends "$NVIDIA_NAME=$NVIDIA_DRIVER_VERSION" 2>/dev/null |     grep '^ *Depends:' | sed 's/.*Depends: \([^ ]*\) (.\?= \([^)]*\))/\1 \2/' |     while read d; do read a b <<< "$d"; v=$(apt policy "$a" 2>/dev/null |     grep "$b" | grep -vE "(Installed|Candidate)" | sed "s/.*\($b[^ ]*\).*/\1/");     echo "$a=$v"; done)" &&     CUDA_DRIVERS_DEPS="$(apt depends "cuda-drivers=$CUDA_DRIVERS_VERSION" 2>/dev/null |     grep '^ *Depends:' | sed 's/.*Depends: \([^ ]*\) (.\?= \([^)]*\))/\1 \2/' |     while read d; do read a b <<< "$d"; v=$(apt policy "$a" 2>/dev/null |     grep "$b" | grep -vE "(Installed|Candidate)" | sed "s/.*\($b[^ ]*\).*/\1/");     echo "$a=$v"; done)" &&     CUDA_DEPS="$(apt depends "$CUDA_NAME=$CUDA_VERSION" 2>/dev/null |     grep '^ *Depends:' | sed 's/.*Depends: \([^ ]*\) (.\?= \([^)]*\))/\1 \2/' |     while read d; do read a b <<< "$d"; v=$(apt policy "$a" 2>/dev/null |     grep "$b" | grep -vE "(Installed|Candidate)" | sed "s/.*\($b[^ ]*\).*/\1/");     echo "$a=$v"; done)" &&     TARGETS="$(echo "$NVIDIA_DEPS $NVIDIA_NAME=$NVIDIA_DRIVER_VERSION"     "$CUDA_DRIVERS_DEPS cuda-drivers=$CUDA_DRIVERS_VERSION"     "$CUDA_DEPS $CUDA_NAME=$CUDA_VERSION" |     tr '\n' ' ')" &&     DEBIAN_FRONTEND=noninteractive apt install -yq $TARGETS
 ---> Using cache
 ---> a7e3bdfeb4ab
Successfully built a7e3bdfeb4ab
Successfully tagged benchbot/core:base

BUILDING BENCHBOT BACKEND DOCKER IMAGE:
Sending build context to Docker daemon  173.7MB
Step 1/36 : FROM benchbot/core:base
 ---> a7e3bdfeb4ab
Step 2/36 : ENV ROS_WS_PATH="/benchbot/ros_ws"
 ---> Using cache
 ---> ca8a10b43fc9
Step 3/36 : RUN echo "deb http://packages.ros.org/ros/ubuntu bionic main" >     /etc/apt/sources.list.d/ros-latest.list &&     apt-key adv --keyserver 'hkp://keyserver.ubuntu.com:80' --recv-key     C1CF6E31E6BADE8868B172B4F42ED6FBAB17C654 &&     apt update && apt install -y ros-melodic-desktop-full python-rosdep     python-rosinstall python-rosinstall-generator python-wstool     python-catkin-tools python-pip build-essential
 ---> Using cache
 ---> 271ba6b926e1
Step 4/36 : RUN wget -qO - http://packages.lunarg.com/lunarg-signing-key-pub.asc |     apt-key add - && wget -qO /etc/apt/sources.list.d/lunarg-vulkan-bionic.list     http://packages.lunarg.com/vulkan/lunarg-vulkan-bionic.list &&     apt update && DEBIAN_FRONTEND=noninteractive apt install -yq vulkan-sdk
 ---> Using cache
 ---> 450a78a7197d
Step 5/36 : RUN useradd --create-home --password "" benchbot && passwd -d benchbot &&     apt update && apt install -yq sudo && usermod -aG sudo benchbot &&     usermod -aG root benchbot && mkdir /benchbot &&     chown benchbot:benchbot /benchbot
 ---> Using cache
 ---> 5eddfc751153
Step 6/36 : USER benchbot
 ---> Using cache
 ---> 7af7b354b5d6
Step 7/36 : WORKDIR /benchbot
 ---> Using cache
 ---> 5adf59d019af
Step 8/36 : RUN sudo rosdep init && rosdep update &&     mkdir -p ros_ws/src && source /opt/ros/melodic/setup.bash &&     pushd ros_ws && catkin_make && source devel/setup.bash && popd
 ---> Using cache
 ---> c732faeb1ee4
Step 9/36 : ARG SIMULATORS
 ---> Using cache
 ---> 505630edaf0e
Step 10/36 : ARG ISAAC_SDK_DIR
 ---> Using cache
 ---> 93880f7fb3e6
Step 11/36 : ARG ISAAC_SDK_TGZ
 ---> Using cache
 ---> a60213a63064
Step 12/36 : ENV ISAAC_SDK_SRCS="/isaac_srcs"
 ---> Using cache
 ---> 04fb0e505eb3
Step 13/36 : COPY --chown=benchbot:benchbot ${ISAAC_SDK_DIR} ${ISAAC_SDK_SRCS}
 ---> Using cache
 ---> c2bb22c0e1ef
Step 14/36 : ENV ISAAC_SDK_PATH="/benchbot/isaac_sdk"
 ---> Using cache
 ---> 63adcd0b4893
Step 15/36 : RUN [ -z "$SIMULATORS" ] && exit 0 || mkdir "$ISAAC_SDK_PATH" &&     tar -xf "$ISAAC_SDK_SRCS/$ISAAC_SDK_TGZ" -C "$ISAAC_SDK_PATH" &&     pushd "$ISAAC_SDK_PATH" && engine/build/scripts/install_dependencies.sh
 ---> Using cache
 ---> b105a45ce0ec
Step 16/36 : ARG BENCHBOT_MSGS_GIT
 ---> Using cache
 ---> fceca57ddea1
Step 17/36 : ARG BENCHBOT_MSGS_HASH
 ---> Using cache
 ---> 19fa4ea6e09d
Step 18/36 : ENV BENCHBOT_MSGS_HASH="$BENCHBOT_MSGS_HASH"
 ---> Using cache
 ---> 14e8981de35d
Step 19/36 : ENV BENCHBOT_MSGS_PATH="/benchbot/benchbot_msgs"
 ---> Using cache
 ---> af6d04bc11b4
Step 20/36 : RUN git clone $BENCHBOT_MSGS_GIT $BENCHBOT_MSGS_PATH &&     pushd $BENCHBOT_MSGS_PATH && git checkout $BENCHBOT_MSGS_HASH &&     pip install -r requirements.txt && pushd $ROS_WS_PATH &&     ln -sv $BENCHBOT_MSGS_PATH src/ && source devel/setup.bash && catkin_make
 ---> Using cache
 ---> 54835d4338e8
Step 21/36 : ARG BENCHBOT_SIMULATOR_GIT
 ---> Using cache
 ---> 791e6b09c0da
Step 22/36 : ARG BENCHBOT_SIMULATOR_HASH
 ---> Using cache
 ---> f4d03e20d4de
Step 23/36 : ENV BENCHBOT_SIMULATOR_PATH="/benchbot/benchbot_simulator"
 ---> Using cache
 ---> b98a6dac0979
Step 24/36 : RUN [ -z "$SIMULATORS" ] && exit 0 ||     git clone $BENCHBOT_SIMULATOR_GIT $BENCHBOT_SIMULATOR_PATH &&     pushd $BENCHBOT_SIMULATOR_PATH && git checkout $BENCHBOT_SIMULATOR_HASH &&     .isaac_patches/apply_patches && source $ROS_WS_PATH/devel/setup.bash &&     ./bazelros build //apps/benchbot_simulator &&     pip install -r requirements.txt
 ---> Using cache
 ---> 0dee5400d73d
Step 25/36 : ARG BENCHBOT_SUPERVISOR_GIT
 ---> Using cache
 ---> c5cd589b2f76
Step 26/36 : ARG BENCHBOT_SUPERVISOR_HASH
 ---> Using cache
 ---> 164df2d0b8ea
Step 27/36 : ENV BENCHBOT_SUPERVISOR_PATH="/benchbot/benchbot_supervisor"
 ---> Using cache
 ---> c3d0a98f8474
Step 28/36 : RUN git clone $BENCHBOT_SUPERVISOR_GIT $BENCHBOT_SUPERVISOR_PATH &&     pushd $BENCHBOT_SUPERVISOR_PATH && git checkout $BENCHBOT_SUPERVISOR_HASH &&     pip3 install .
 ---> Using cache
 ---> 7ea168466cbe
Step 29/36 : ARG BENCHBOT_CONTROLLER_GIT
 ---> Using cache
 ---> 3a2396412ae6
Step 30/36 : ARG BENCHBOT_CONTROLLER_HASH
 ---> Using cache
 ---> e4c2da225bda
Step 31/36 : ENV BENCHBOT_CONTROLLER_PATH="/benchbot/benchbot_robot_controller"
 ---> Using cache
 ---> c5533dc6c289
Step 32/36 : RUN git clone $BENCHBOT_CONTROLLER_GIT $BENCHBOT_CONTROLLER_PATH &&     pushd $BENCHBOT_CONTROLLER_PATH && git checkout $BENCHBOT_CONTROLLER_HASH &&     pip install -r requirements.txt && pushd $ROS_WS_PATH &&     pushd src && git clone https://github.com/eric-wieser/ros_numpy.git && popd &&     ln -sv $BENCHBOT_CONTROLLER_PATH src/ && source devel/setup.bash && catkin_make
 ---> Using cache
 ---> 7ffbb3359917
Step 33/36 : ARG ADDONS_PATH
 ---> Using cache
 ---> 54f1e2ad31ca
Step 34/36 : ENV BENCHBOT_ADDONS_PATH=$ADDONS_PATH
 ---> Using cache
 ---> f193e875a6d2
Step 35/36 : RUN mkdir -p $BENCHBOT_ADDONS_PATH && pip3 install pyyaml
 ---> Using cache
 ---> bafb0e8b0754
Step 36/36 : ENV BENCHBOT_SIMULATORS="${SIMULATORS}"
 ---> Using cache
 ---> 20d4b1e92bc4
Successfully built 20d4b1e92bc4
Successfully tagged benchbot/backend:base

BUILDING BENCHBOT SUBMISSION DOCKER IMAGE:
Sending build context to Docker daemon  173.7MB
Step 1/7 : FROM benchbot/core:base
 ---> a7e3bdfeb4ab
Step 2/7 : RUN apt update && apt install -y libsm6 libxext6 libxrender-dev python3     python3-pip python3-tk python-pip python-tk
 ---> Using cache
 ---> 797320adc160
Step 3/7 : RUN pip3 install --upgrade pip
 ---> Using cache
 ---> a0b632b93b08
Step 4/7 : ARG BENCHBOT_API_GIT
 ---> Using cache
 ---> 560399873d71
Step 5/7 : ARG BENCHBOT_API_HASH
 ---> Using cache
 ---> 98438610f0c2
Step 6/7 : RUN git clone $BENCHBOT_API_GIT && pushd benchbot_api &&     git checkout $BENCHBOT_API_HASH && pip3 install .
 ---> Using cache
 ---> ef77a5329a1f
Step 7/7 : WORKDIR /benchbot_submission
 ---> Using cache
 ---> 87e3524e8021
Successfully built 87e3524e8021
Successfully tagged benchbot/submission:base

CLEANING UP OUTDATED BENCHBOT REMNANTS:

Deleted the following containers:
Deleted Containers:
be898e05fa3a63871627ca5e54caef3808d11a09377123b3b5db841c2b98f123

Deleted Networks:
benchbot_network

Deleted Images:
deleted: sha256:d8e173bb19dbfab0c7f4d3b4f4a45bb7d3c50e50eeac1e99e4c1201ed4ebcfb0
deleted: sha256:e33a4c0b9ea2ad45966c8be44cf5f25aaa8cb48861f0185592863537ed77b07e

Total reclaimed space: 67.5MB

Finished cleaning!

################################################################################
#################### PART 4: RUNNING POST-BUILD HOST CHECKS ####################
################################################################################

Validating the build against the host system:
    CUDA / Nvidia versions match:                                    Matches

Validating BenchBot libraries on the host system:
    BenchBot Add-ons Manager cloned:                                     Yes
    BenchBot Add-ons Manager up-to-date:                          Up-to-date
    BenchBot Add-ons Manager installed:                            Available
    BenchBot API cloned:                                                 Yes
    BenchBot API up-to-date:                                      Up-to-date
    BenchBot API installed:                                        Available
    BenchBot evaluation cloned:                                          Yes
    BenchBot evaluation up-to-date:                               Up-to-date
    BenchBot evaluation installed:                                 Available

Integrating BenchBot with the host system:
    BenchBot hosts available:                                          Found
    BenchBot symlinks available:                                       Found

################################################################################
##################### PART 5: INSTALLING BENCHBOT ADD-ONS ######################
################################################################################

Installing add-ons based on the request string 'benchbot-addons/ssu':

Installing addon 'benchbot-addons/ssu' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/ssu'.
    No action - latest already installed.
Installing addon 'benchbot-addons/tasks_ssu' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/tasks_ssu'.
    No action - latest already installed.
Installing addon 'benchbot-addons/formats_object_map' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/formats_object_map'.
    No action - latest already installed.
Installing addon 'benchbot-addons/envs_isaac_develop' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/envs_isaac_develop'.
    No action - latest already installed.
    Found remote content to install to 'environments': https://cloudstor.aarnet.edu.au/plus/s/jKWBTkrj2Bppr6q/download
    No action - remote content is already installed.
Installing addon 'benchbot-addons/robots_isaac' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/robots_isaac'.
    No action - latest already installed.
Installing addon 'benchbot-addons/envs_isaac_challenge' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/envs_isaac_challenge'.
    No action - latest already installed.
    Found remote content to install to 'environments': https://cloudstor.aarnet.edu.au/plus/s/dbfz0ol7fuWKDVP/download
    No action - remote content is already installed.
Installing addon 'benchbot-addons/robots_isaac' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/robots_isaac'.
    No action - latest already installed.
Installing addon 'benchbot-addons/robots_isaac' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/robots_isaac'.
    No action - latest already installed.
Installing addon 'benchbot-addons/ground_truths_isaac_develop' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/ground_truths_isaac_develop'.
    No action - latest already installed.
Installing addon 'benchbot-addons/batches_isaac' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/batches_isaac'.
    No action - latest already installed.
Installing addon 'benchbot-addons/eval_omq' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/eval_omq'.
    No action - latest already installed.
Installing addon 'benchbot-addons/formats_object_map' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/formats_object_map'.
    No action - latest already installed.
Installing addon 'benchbot-addons/examples_ssu' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/examples_ssu'.
    No action - latest already installed.
Installing addon 'benchbot-addons/examples_base' in '/home/ecu/anaconda3/lib/python3.7/site-packages/benchbot_addons-2.0.0-py3.7.egg/benchbot_addons':
    Found install path './benchbot-addons/examples_base'.
    No action - latest already installed.

Installing external add-on dependencies:

Running the following pip install command:
    pip3 install shapely scipy matplotlib numpy

Requirement already satisfied: shapely in /home/ecu/anaconda3/lib/python3.7/site-packages (1.7.1)
Requirement already satisfied: scipy in /home/ecu/anaconda3/lib/python3.7/site-packages (1.4.1)
Requirement already satisfied: matplotlib in /home/ecu/anaconda3/lib/python3.7/site-packages (3.1.3)
Requirement already satisfied: numpy in /home/ecu/anaconda3/lib/python3.7/site-packages (1.19.5)
Requirement already satisfied: cycler>=0.10 in /home/ecu/anaconda3/lib/python3.7/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/ecu/anaconda3/lib/python3.7/site-packages (from matplotlib) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ecu/anaconda3/lib/python3.7/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/ecu/anaconda3/lib/python3.7/site-packages (from matplotlib) (2.4.6)
Requirement already satisfied: six in /home/ecu/anaconda3/lib/python3.7/site-packages (from cycler>=0.10->matplotlib) (1.15.0)
Requirement already satisfied: setuptools in /home/ecu/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (54.0.0)

    Done.

Baking external add-on dependencies into the Docker backend:

5058f8d47b2baf74549413a5c48215c42ace97be379bbd3bb4143807e0d30080
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Collecting shapely
  Downloading Shapely-1.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.0 MB)
     |################################| 1.0 MB 10.5 MB/s 
Collecting matplotlib
  Downloading matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl (11.5 MB)
     |################################| 11.5 MB 27.9 MB/s 
Requirement already satisfied: scipy in /home/benchbot/.local/lib/python3.6/site-packages (1.5.4)
Requirement already satisfied: numpy in /home/benchbot/.local/lib/python3.6/site-packages (1.19.5)
Collecting cycler>=0.10
  Downloading cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/benchbot/.local/lib/python3.6/site-packages (from matplotlib) (2.4.7)
Collecting python-dateutil>=2.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pillow>=6.2.0
  Downloading Pillow-8.1.1-cp36-cp36m-manylinux1_x86_64.whl (2.2 MB)
     |################################| 2.2 MB 29.4 MB/s 
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)
     |################################| 1.1 MB 47.1 MB/s 
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from cycler>=0.10->matplotlib) (1.11.0)
Installing collected packages: python-dateutil, pillow, kiwisolver, cycler, shapely, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 pillow-8.1.1 python-dateutil-2.8.1 shapely-1.7.1
sha256:2a076daf715a30d3428864e13b51bd5ddced4b1896e313f263a162c11d8db4db
tmp

Finished!
btalb commented 3 years ago

Thanks for providing the logs @gmuraleekrishna . Sorry I can't provide an immediate solution; your configuration of 2xQuadros is not something we have available to test on in the lab.

Next debugging step is to narrow down the scope as much as possible. The command below isolates whether it is an issue with BenchBot, or the Nvidia Container Toolkit:

docker run --gpus all nvidia/cuda:11.1-base nvidia-smi

Can you show me the results of that command please? If it returns something like below it is a BenchBot issue, otherwise we need to figure out what is going on with Nvidia's Container Toolkit

ben@pc:~$ docker run --gpus all nvidia/cuda:11.1-base nvidia-smi
Unable to find image 'nvidia/cuda:11.1-base' locally
11.1-base: Pulling from nvidia/cuda
da7391352a9b: Pull complete 
14428a6d4bcd: Pull complete 
2c2d948710f2: Pull complete 
0ebd322634c1: Pull complete 
36520dd466ac: Pull complete 
fe6ccac2e64b: Pull complete 
Digest: sha256:c6bb47a62ad020638aeaf66443de9c53c6dc8a0376e97b2d053ac774560bd0fa
Status: Downloaded newer image for nvidia/cuda:11.1-base
Wed Mar  3 23:56:19 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 27%   32C    P8     9W / 180W |    492MiB /  8111MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
hagianga21 commented 3 years ago

Hi @btalb, I am in his team. Thanks for your help. When running the cmd, I got:

Unable to find image 'nvidia/cuda:11.1-base' locally
11.1-base: Pulling from nvidia/cuda
da7391352a9b: Pull complete 
14428a6d4bcd: Pull complete 
2c2d948710f2: Pull complete 
0ebd322634c1: Pull complete 
36520dd466ac: Pull complete 
fe6ccac2e64b: Pull complete 
Digest: sha256:c6bb47a62ad020638aeaf66443de9c53c6dc8a0376e97b2d053ac774560bd0fa
Status: Downloaded newer image for nvidia/cuda:11.1-base
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0010] error waiting for container: context canceled

BTW, the CUDA version we have in the machine is 11.2, does it cause any problem?

btalb commented 3 years ago

Thanks @hagianga21 . That information is good, it confirms that the error is being caused by the NVIDIA Container Toolkit (example bug).

I'm trying to piece through this, but can't reproduce errors on our side.

  1. Double check you have restarted the machine since installing the toolkit?
  2. Did you install everything through the BenchBot installer? Or were some things installed manually?
  3. You can try the previous command again with nvidia/cuda:11.2-base if you like, but I suspect it will also fail.

Otherwise, let's start digging into the NVIDIA Container Toolkit. Can you post the output of:

ben@pc:~$ nvidia-container-cli info
NVRM version:   455.45.01
CUDA version:   11.1

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 1080
Brand:          GeForce
GPU UUID:       GPU-1ec26850-dd95-3255-3b1d-b2a944d1e50e
Bus Location:   00000000:01:00.0
Architecture:   6.1

Then if that works:

ben@pc:~$ nvidia-container-cli -d nvidia.log list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.455.45.01
/usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvoptix.so.455.45.01
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.455.45.01
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.455.45.01
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.455.45.01
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.455.45.01
/usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.455.45.01
/run/nvidia-persistenced/socket

And lastly paste the contents of the created nvidia.log file back here.

hagianga21 commented 3 years ago
  1. Yes, we restart the machine several times
  2. Yes, I installed everything (including docker) through the BenchBot
  3. 
    (base) ecu@lab18:~/Projects/SSCD/benchbot$ nvidia-container-cli info
    NVRM version:   460.32.03
    CUDA version:   11.2

Device Index: 0 Device Minor: 0 Model: Quadro RTX 8000 Brand: Quadro GPU UUID: GPU-ac3f4f1a-a44f-f914-2b53-a4d9520f060a Bus Location: 00000000:1a:00.0 Architecture: 7.5

Device Index: 1 Device Minor: 1 Model: Quadro RTX 8000 Brand: Quadro GPU UUID: GPU-638d034f-175e-324d-d52b-20d48f304932 Bus Location: 00000000:67:00.0 Architecture: 7.5

(base) ecu@lab18:~/Projects/SSCD/benchbot$ nvidia-container-cli -d nvidia.log list /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset /dev/nvidia0 /dev/nvidia1 /usr/bin/nvidia-smi /usr/bin/nvidia-debugdump /usr/bin/nvidia-persistenced /usr/bin/nvidia-cuda-mps-control /usr/bin/nvidia-cuda-mps-server /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.32.03 /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.32.03 /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.32.03 /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.32.03 /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.32.03 /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.32.03 /run/nvidia-persistenced/socket


The content of nvidia.log file

-- WARNING, the following logs are for debugging purposes only --

I0304 07:05:27.649697 32889 nvc.c:372] initializing library context (version=1.3.3, build=bd9fc3f2b642345301cb2e23de07ec5386232317) I0304 07:05:27.649782 32889 nvc.c:346] using root / I0304 07:05:27.649800 32889 nvc.c:347] using ldcache /etc/ld.so.cache I0304 07:05:27.649815 32889 nvc.c:348] using unprivileged user 1000:1000 I0304 07:05:27.649860 32889 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0304 07:05:27.650289 32889 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment I0304 07:05:27.650637 32891 driver.c:101] starting driver service I0304 07:05:27.656392 32889 nvc_info.c:680] requesting driver information with '' I0304 07:05:27.659339 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.32.03 I0304 07:05:27.659500 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.32.03 I0304 07:05:27.659591 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.32.03 I0304 07:05:27.659690 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.32.03 I0304 07:05:27.659819 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.32.03 I0304 07:05:27.659961 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.32.03 I0304 07:05:27.660055 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.32.03 I0304 07:05:27.660146 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.32.03 I0304 07:05:27.660273 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.32.03 I0304 07:05:27.660403 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.32.03 I0304 07:05:27.660491 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.32.03 I0304 07:05:27.660582 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.32.03 I0304 07:05:27.660668 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.32.03 I0304 07:05:27.660800 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.32.03 I0304 07:05:27.660928 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.32.03 I0304 07:05:27.661014 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.32.03 I0304 07:05:27.661102 32889 nvc_info.c:171] skipping /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.450.80.02 I0304 07:05:27.661193 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.32.03 I0304 07:05:27.661326 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.32.03 I0304 07:05:27.661419 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.32.03 I0304 07:05:27.661554 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.32.03 I0304 07:05:27.662460 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.32.03 I0304 07:05:27.662957 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.32.03 I0304 07:05:27.663057 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.32.03 I0304 07:05:27.663156 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.32.03 I0304 07:05:27.663254 32889 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.32.03 W0304 07:05:27.663362 32889 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so W0304 07:05:27.663380 32889 nvc_info.c:350] missing library libvdpau_nvidia.so W0304 07:05:27.663396 32889 nvc_info.c:354] missing compat32 library libnvidia-ml.so W0304 07:05:27.663412 32889 nvc_info.c:354] missing compat32 library libnvidia-cfg.so W0304 07:05:27.663427 32889 nvc_info.c:354] missing compat32 library libcuda.so W0304 07:05:27.663443 32889 nvc_info.c:354] missing compat32 library libnvidia-opencl.so W0304 07:05:27.663472 32889 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so W0304 07:05:27.663488 32889 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so W0304 07:05:27.663504 32889 nvc_info.c:354] missing compat32 library libnvidia-allocator.so W0304 07:05:27.663519 32889 nvc_info.c:354] missing compat32 library libnvidia-compiler.so W0304 07:05:27.663535 32889 nvc_info.c:354] missing compat32 library libnvidia-ngx.so W0304 07:05:27.663550 32889 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so W0304 07:05:27.663566 32889 nvc_info.c:354] missing compat32 library libnvidia-encode.so W0304 07:05:27.663581 32889 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so W0304 07:05:27.663597 32889 nvc_info.c:354] missing compat32 library libnvcuvid.so W0304 07:05:27.663612 32889 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so W0304 07:05:27.663628 32889 nvc_info.c:354] missing compat32 library libnvidia-glcore.so W0304 07:05:27.663644 32889 nvc_info.c:354] missing compat32 library libnvidia-tls.so W0304 07:05:27.663659 32889 nvc_info.c:354] missing compat32 library libnvidia-glsi.so W0304 07:05:27.663675 32889 nvc_info.c:354] missing compat32 library libnvidia-fbc.so W0304 07:05:27.663690 32889 nvc_info.c:354] missing compat32 library libnvidia-ifr.so W0304 07:05:27.663706 32889 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so W0304 07:05:27.663721 32889 nvc_info.c:354] missing compat32 library libnvoptix.so W0304 07:05:27.663737 32889 nvc_info.c:354] missing compat32 library libGLX_nvidia.so W0304 07:05:27.663753 32889 nvc_info.c:354] missing compat32 library libEGL_nvidia.so W0304 07:05:27.663768 32889 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so W0304 07:05:27.663784 32889 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so W0304 07:05:27.663799 32889 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so W0304 07:05:27.663815 32889 nvc_info.c:354] missing compat32 library libnvidia-cbl.so I0304 07:05:27.665036 32889 nvc_info.c:276] selecting /usr/bin/nvidia-smi I0304 07:05:27.665118 32889 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump I0304 07:05:27.665168 32889 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced I0304 07:05:27.665216 32889 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control I0304 07:05:27.665265 32889 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server I0304 07:05:27.665328 32889 nvc_info.c:438] listing device /dev/nvidiactl I0304 07:05:27.665346 32889 nvc_info.c:438] listing device /dev/nvidia-uvm I0304 07:05:27.665363 32889 nvc_info.c:438] listing device /dev/nvidia-uvm-tools I0304 07:05:27.665380 32889 nvc_info.c:438] listing device /dev/nvidia-modeset I0304 07:05:27.665448 32889 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket W0304 07:05:27.665488 32889 nvc_info.c:321] missing ipc /tmp/nvidia-mps I0304 07:05:27.665507 32889 nvc_info.c:745] requesting device information with '' I0304 07:05:27.672996 32889 nvc_info.c:628] listing device /dev/nvidia0 (GPU-ac3f4f1a-a44f-f914-2b53-a4d9520f060a at 00000000:1a:00.0) I0304 07:05:27.680118 32889 nvc_info.c:628] listing device /dev/nvidia1 (GPU-638d034f-175e-324d-d52b-20d48f304932 at 00000000:67:00.0) I0304 07:05:27.680385 32889 nvc.c:427] shutting down library context I0304 07:05:27.681761 32891 driver.c:156] terminating driver service I0304 07:05:27.682570 32889 driver.c:196] driver service terminated successfully

btalb commented 3 years ago

Sorry about the delay in sorting this. I'm having troubles finding a cause in any of the logs you've provided me. It definitely appears to be an issue with the Nvidia Container Toolkit. I've been scouring their documentation trying to find some hints, but haven't had any luck.

I've got a few possible things to try, but they are more hoping than based on any intuition:

  1. Can I confirm at least one of your Quadros is connected to the display server? Can I see the full output of nvidia-smi on your host?

  2. Confirm the BenchBot image definitely works without passing a GPU:

    docker run --rm -it benchbot/backend:base
  3. Try messing with the GPU argument to see if you can get something to work:

    docker run --rm --gpus 0 nvidia/cuda:11.1-base nvidia-smi
    docker run --rm --gpus 1 nvidia/cuda:11.1-base nvidia-smi
    docker run --rm --gpus 0,1 nvidia/cuda:11.1-base nvidia-smi
    docker run --rm --gpus 2 nvidia/cuda:11.1-base nvidia-smi
    docker run --rm --gpus 3 nvidia/cuda:11.1-base nvidia-smi
  4. Turn on debugging in /etc/nvidia-container-runtime/config.toml, restart the Docker service, and see if you can get any meaningful errors to appear in the log (see https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting),

  5. Install the entire NVIDIA Docker stack even though it shouldn't be needed:

    sudo apt install nvidia-docker2

Sorry your first experiences with BenchBot have been this troublesome. Unfortunately, it seems to be caused by a dependency we have no control over.

My next step is to file an issue over at NVIDIA/nvidia-docker, but I'll need those debugging logs from step 4 before I can.

hagianga21 commented 3 years ago

1.

nvidia-smi
Fri Mar  5 15:54:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:1A:00.0 Off |                  Off |
| 33%   34C    P8    16W / 260W |      6MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:67:00.0  On |                  Off |
| 34%   45C    P8    33W / 260W |    226MiB / 48598MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1769      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1769      G   /usr/lib/xorg/Xorg                162MiB |
|    1   N/A  N/A      3107      G   /usr/bin/compiz                    31MiB |
|    1   N/A  N/A      3157      G   ...mviewer/tv_bin/TeamViewer       15MiB |
|    1   N/A  N/A      5073      G   /usr/lib/firefox/firefox            3MiB |
|    1   N/A  N/A      5311      G   /usr/lib/firefox/firefox            3MiB |
|    1   N/A  N/A     13389      G   /usr/lib/firefox/firefox            3MiB |
|    1   N/A  N/A     13672      G   /usr/lib/firefox/firefox            3MiB |
+-----------------------------------------------------------------------------+
  1. Yes, it works without GPU
    docker run --rm -it benchbot/backend:base
    benchbot@8aa96d76dbd0:/benchbot$
  2. All commands got the same error:
    docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
  3. I uncommented the debug line, but then when restarting docker and try commands in step 3, no log file appears. Do I have to restart the machine?
  4. Nothing new happens after running this line.
btalb commented 3 years ago

I've opened the issue above to try and get to the bottom of this. Sorry it's been anything but straightforward.

I've struggled with finding clarity in what's openly available describing the NVIDIA Docker stack.

Feel free to participate / add any extra info you think may be helpful directly to that issue.

hagianga21 commented 3 years ago

Hi, thanks so much for your support, I appreciate it.

btalb commented 3 years ago

Any further progress on this @hagianga21 ?

Someone in the lab had exactly the same error come up yesterday. They were using Docker for something completely unrelated to BenchBot. But they were able to fix it by installing nvidia-container-toolkit, and rebooting.

Can I just double check what packages you have installed and where they are installed from. We're having real trouble being able to reproduce this error:

apt list --installed | grep nvidia
apt policy nvidia-container-toolkit

Here's the output of those commands on one of our working machines for reference:

ben@pc:~$ apt list --installed | grep nvidia                       

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-455/unknown,unknown,now 455.45.01-0ubuntu1 all [installed,automatic]
libnvidia-compute-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,now 1.3.3-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.3.3-1 amd64 [installed,automatic]                             
libnvidia-decode-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
libnvidia-extra-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]          
libnvidia-fbc1-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]             
libnvidia-ifr1-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
nvidia-compute-utils-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]     
nvidia-container-toolkit/bionic,now 1.4.2-1 amd64 [installed]
nvidia-dkms-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]              
nvidia-driver-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed]
nvidia-kernel-common-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]     
nvidia-kernel-source-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]                      
nvidia-prime/now 0.8.15.3~0.20.04.1 all [installed,upgradable to: 0.8.16~0.20.04.1]
nvidia-settings/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]                      
nvidia-utils-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-455/unknown,unknown,now 455.45.01-0ubuntu1 amd64 [installed,automatic]

ben@pc:~$ apt policy nvidia-container-toolkit 
nvidia-container-toolkit:
  Installed: 1.4.2-1
  Candidate: 1.4.2-1
  Version table:
 *** 1.4.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
        100 /var/lib/dpkg/status
     1.4.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.3.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.5-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.4-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.3-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
hagianga21 commented 3 years ago

Hi, sorry for the delay. We 've already solved that but now, another problem happen. The stimulation window appears and then disappears immediately.

About the output of double check:

ecu@lab18:~$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-455/unknown,unknown,now 455.45.01-0ubuntu1 all [installed,auto-removable]
libnvidia-common-460/unknown,now 460.32.03-0ubuntu1 all [installed,automatic]
libnvidia-compute-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,now 1.3.3-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.3.3-1 amd64 [installed,automatic]
libnvidia-decode-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-extra-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
libnvidia-ifr1-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-compute-utils-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-container-runtime/bionic,now 3.4.2-1 amd64 [installed]
nvidia-container-toolkit/bionic,now 1.4.2-1 amd64 [installed]
nvidia-dkms-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-source-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-prime/bionic-updates,bionic-updates,now 0.8.8.2 all [installed,automatic]
nvidia-settings/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-460/unknown,now 460.32.03-0ubuntu1 amd64 [installed,automatic]
 ecu@lab18:~$ apt policy nvidia-container-toolkit
nvidia-container-toolkit:
  Installed: 1.4.2-1
  Candidate: 1.4.2-1
  Version table:
 *** 1.4.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
        100 /var/lib/dpkg/status
     1.4.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.3.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.5-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.4-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.3-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
btalb commented 3 years ago

Thanks @hagianga21 , I'm going to close this issue.

But feel free to create another describing your setup if that simulator issue continues.