montefiore-institute / alan-cluster

Documentation and guidelines for the Alan GPU cluster at the University of Liège.
BSD 3-Clause "New" or "Revised" License
21 stars 1 forks source link

GPU allocation #24

Closed digirak closed 5 years ago

digirak commented 5 years ago

Describe the issue I have written the following script to submit a job to two gtx2080ti. I get this error

1020391 gpu2080ti pythonjo rnath PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

Can you check if the script is ok? The files are all in order and the python code is fine. Screenshots

!/bin/bash

SBATCH --job-name=pythonjob

SBATCH --time=00:10:00 # hh:mm:ss

SBATCH --output=output_val.txt

SBATCH --ntasks=1

SBATCH --gres=gpu:2

SBATCH --mem-per-cpu=10240 # 10GB

SBATCH --mail-user=rakesh.nath@uliege.be

SBATCH --mail-type=ALL

SBATCH --partition=gpu2080ti

SBATCH --comment=DeepLSpectra

python /home/rnath/Code/DeepNet.py

JoeriHermans commented 5 years ago

This script is actually fine.

The error message "(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)" basically says that the gpu2080ti partition is fully allocated, and that you have to wait until the resources are released. :) You might have to wait a bit longer since you requested 2 GPU's. It might be quicker if you use the default queue which also includes the GTX 1080 Ti's.

I'll close this issue for now, but feel free to reopen it.

digirak commented 5 years ago

Ah ok that makes sense. Thanks

JoeriHermans commented 5 years ago

Some other tip: I do not know what kind of framework you are using, but if you are using asynchronous data loaders it might be better to allocate more CPU cores. You are using just a single one now. Especially if you need to feed 2 GPU's. I would recommend something like 3, of 4 asynchronous loaders per GPU. Keep in mind that you have set #SBATCH --mem-per-cpu=10240 # 10GB. So if you allocate 8 CPU cores, you will actually allocate 80GB's of RAM. To be fair against the other users, just allocate as much as you need.

digirak commented 5 years ago

oh I see so when I say 1 cpu it just uses one core? I was under the impression it will use the whole node. Now I understand what you say,