Cluster issues - "Unable to determine the device handle for GPU 0000:43:00.0: Unknown Error"

I open an issue here to track what might happen. I contacted the Cluster support team and hope we get some help.

I noticed that recently the Clara cluster jobs fail often because no GPU is assigned to the requested node. For example, when I test-request a node and then print nvidia-smi I get this error message:

Unable to determine the device handle for GPU 0000:43:00.0: Unknown Error

Which triggers this in a train job :

AssertionError: batch-size X not multiple of GPU count X/n-1

Example:

AssertionError: batch-size 64 not multiple of GPU count 7

This is a test shell script, which requests a node with 8 GPU and then runs nvidia-smi:

#!/bin/bash
#SBATCH --job-name=nvidia_smi_check # name for the job;
#SBATCH --partition=clara-job # Request for the Clara cluster;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=32 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:8 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=0:01:00 # requested time in d-hh:mm:ss
#SBATCH --output=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;

nvidia-smi
# Send script to slurm job manager like this:
# sbatch ~/PAI/scripts/cluster/nvidia_smi_check.sh

The expected results should be something like:

Thu Jul 21 08:54:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:05:00.0 Off |                  N/A |
|  0%   25C    P8     5W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:06:00.0 Off |                  N/A |
|  0%   27C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:28:00.0 Off |                  N/A |
|  0%   28C    P8    15W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:29:00.0 Off |                  N/A |
|  0%   24C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:43:00.0 Off |                  N/A |
|  0%   22C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:44:00.0 Off |                  N/A |
|  0%   23C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:63:00.0 Off |                  N/A |
|  0%   26C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:64:00.0 Off |                  N/A |
|  0%   27C    P8    13W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

stark-t / PAI

Cluster issues - "Unable to determine the device handle for GPU 0000:43:00.0: Unknown Error" #48