Fix bridges GPU gres flag - Githubissues

radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.

http://radical-cybertools.github.io/saga-python/

Other

83 stars 34 forks source link

Fix bridges GPU gres flag #792

Closed iparask closed 4 years ago

iparask commented 4 years ago

Bridges require to define the GPUs per node in the gres flag. So when we are requesting p100 the gres should look like --gres=gpu:p100:2 for any number of GPUS.

This is a hot fix since we are not supporting any other GPU type on Bridges.

andre-merzky commented 4 years ago

I am not sure I understand this one: is the idea to prevent the user from requesting an unavailable number of GPUs? Because from the code it would look like that nothing changes if the user requests, say, 2 nodes and 4 GPUs?

iparask commented 4 years ago

You can request as many nodes with gpus as you want. For example, an interact command to start a GPU job on 4 P100 nodes for 30 minutes is

interact -p GPU --gres=gpu:p100:2 -N 4 -t 30:00

and the user will get 8 GPUs. Based on their documentation --gres=gpu:type:n Specifies the type and number of GPUs requested. 'type' is one of: volta32, volta16, p100 or k80. For the GPU, GPU-shared and GPU-small partitions, type is either k80 or p100. The default is k80. For the GPU-AI partition, type is either volta16 or volta32. 'n' is the number of GPUs. Valid choices are 1-4, when type=k80 1-2, when type=p100 1-8 when type=volta16 1-16 when type=volta32

iparask commented 4 years ago

Check if taking a shared node the user can utilize all cores

iparask commented 4 years ago

The P100 GPU nodes have 32 cores and 2 GPUs and the K80 nodes have 4 GPUs and 28 cores. This is half a P100 GPU-node:

[paraskev@login005 pilot.0000]$ interact -p GPU-shared --gres=gpu:p100:1

A command prompt will appear when your session begins
"Ctrl+d" or "exit" will end your session

srun: job 9915045 queued and waiting for resources
srun: job 9915045 has been allocated resources
[paraskev@gpu042 pilot.0000]$ echo $SLURM_CPUS_ON_NODE
16
[paraskev@gpu042 pilot.0000]$

I also asked 1 K80 GPU and I got 7 cores.

iparask commented 4 years ago

I just finished testing this fix. Based on the job description cpu_architecture we can select both k80 and p100 gpus on Bridges.