The current way that in TPS we assign a GPU to each MPI rank is (see here ):
device_id = mpi_rank % numGpusPerRank
where numGpusPerRank is set from the .ini file.
The default value of this variable is 1; see here. None of the *.ini input files in our testsuite changes the default value, so I am assuming all local jobs are running on a single GPU.
This makes TPS hard to port across different clusters and local machines. Some schedulers (e.g. those on TACC) make all GPUs on a node available to all tasks on that node, while other schedulers (e.g. flux) restrict which GPUs are visible to each task (e.g. through the variable ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, NVIDIA_VISIBLE_DEVICES, or `CUDA_VISIBLE_DEVICES).
I propose a more flexible way to handle this by introducing the command line argument --gpu-affinity (short-hand -ga).
Three affinity policies will be available:
default: Set the deviceID to 0. This is perfect for local resources with a single GPU or when the scheduler restricts which devices are visible to a task (like flux does).
direct (default): Set the deviceID equal to the mpi-rank. This is perfect on a single node (local or on the cluster) when the number of mpi-tasks is less or equal to the number of GPUs.
env-localid: the device id is set through an environmental variable defined with --localid-varname. Many schedulers set an environmental variable that provides a local numbering of the tasks running on a specific node. In slurm, this variable is called SLURM_LOCALID, in flux FLUX_TASK_LOCAL_ID. See also: https://docs.nersc.gov/jobs/affinity/#gpus
The current way that in TPS we assign a GPU to each MPI rank is (see here ):
where
numGpusPerRank
is set from the.ini
file. The default value of this variable is1
; see here. None of the*.ini
input files in our testsuite changes the default value, so I am assuming all local jobs are running on a single GPU.This makes TPS hard to port across different clusters and local machines. Some schedulers (e.g. those on TACC) make all GPUs on a node available to all tasks on that node, while other schedulers (e.g. flux) restrict which GPUs are visible to each task (e.g. through the variable
ROCR_VISIBLE_DEVICES
,HIP_VISIBLE_DEVICES
,NVIDIA_VISIBLE_DEVICES
, or`CUDA_VISIBLE_DEVICES
).I propose a more flexible way to handle this by introducing the command line argument
--gpu-affinity
(short-hand-ga
).Three affinity policies will be available:
default
: Set the deviceID to 0. This is perfect for local resources with a single GPU or when the scheduler restricts which devices are visible to a task (likeflux
does).direct
(default): Set the deviceID equal to the mpi-rank. This is perfect on a single node (local or on the cluster) when the number of mpi-tasks is less or equal to the number of GPUs.env-localid
: the device id is set through an environmental variable defined with--localid-varname
. Many schedulers set an environmental variable that provides a local numbering of the tasks running on a specific node. In slurm, this variable is calledSLURM_LOCALID
, in fluxFLUX_TASK_LOCAL_ID
. See also:https://docs.nersc.gov/jobs/affinity/#gpus