I open an issue here to track what might happen. I contacted the Cluster support team and hope we get some help.
I noticed that recently the Clara cluster jobs fail often because no GPU is assigned to the requested node.
For example, when I test-request a node and then print nvidia-smi I get this error message:
Unable to determine the device handle for GPU 0000:43:00.0: Unknown Error
Which triggers this in a train job :
AssertionError: batch-size X not multiple of GPU count X/n-1
Example:
AssertionError: batch-size 64 not multiple of GPU count 7
This is a test shell script, which requests a node with 8 GPU and then runs nvidia-smi:
#!/bin/bash
#SBATCH --job-name=nvidia_smi_check # name for the job;
#SBATCH --partition=clara-job # Request for the Clara cluster;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=32 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:8 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=0:01:00 # requested time in d-hh:mm:ss
#SBATCH --output=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;
nvidia-smi
# Send script to slurm job manager like this:
# sbatch ~/PAI/scripts/cluster/nvidia_smi_check.sh
I open an issue here to track what might happen. I contacted the Cluster support team and hope we get some help.
I noticed that recently the Clara cluster jobs fail often because no GPU is assigned to the requested node. For example, when I test-request a node and then print
nvidia-smi
I get this error message:Which triggers this in a train job :
Example:
This is a test shell script, which requests a node with 8 GPU and then runs
nvidia-smi
:The expected results should be something like: