On Polaris, different ranks will use overlapping GPU

I have the following test code in python, which prints the GPU id for each rank on each node:

import socket, os

hostname = socket.gethostname()
size = int(os.getenv("PMI_SIZE"))
rank = int(os.getenv("PMI_RANK"))
local_rank = int(os.getenv("PMI_LOCAL_RANK"))
gpu_id = os.getenv("CUDA_VISIBLE_DEVICES")
rank_2 = int(os.getenv("PALS_RANKID"))

print("PMI_SIZE = {}, PMI_RANK = {}, PMI_LOCAL_RANK = {}, PALS_RANKID = {}, Hostname = {}, gpu_id = {}".format(size, rank, local_rank, rank_2, hostname, gpu_id))

and I use the following entk script to launch the job with 8 processes and 2 nodes:

import radical.pilot as rp
import radical.entk  as re

t = re.Task({
    'executable'    : 'python',
    'arguments'     : ['/home/twang3/myWork/rct-unit-test/mpiexec-ppn/main.py'],
    'cpu_reqs'      : {'cpu_processes'  : 8,
                       'cpu_threads'    : 8,
                       'cpu_thread_type': rp.OpenMP},
    'gpu_reqs'      : {'gpu_processes'  : 1,
                       'gpu_process_type': rp.CUDA},
    })

s = re.Stage()
s.add_tasks([t])
p = re.Pipeline()
p.add_stages(s)

amgr = re.AppManager()
n_nodes = 2
amgr.resource_desc = {
        'resource'  :   'anl.polaris',
        'project'   :   'RECUP',
        'queue'     :   'debug',
        'cpus'      :   n_nodes * 32,
        'gpus'      :   n_nodes * 4,
        'walltime'  :   5
        }
amgr.workflow = [p]
amgr.run()

and I got the following results:

PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 1, Hostname = x3006c0s19b0n0, gpu_id = 3
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, Hostname = x3006c0s19b0n0, gpu_id = 3
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 0, Hostname = x3006c0s19b0n0, gpu_id = 1
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 2, Hostname = x3006c0s19b0n0, gpu_id = 1
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 2, Hostname = x3006c0s13b1n0, gpu_id = 0
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 3, Hostname = x3006c0s13b1n0, gpu_id = 2
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, Hostname = x3006c0s13b1n0, gpu_id = 0
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 1, Hostname = x3006c0s13b1n0, gpu_id = 2

We can notice that on node x3006c0s19b0n0, only GPU 1 and 3 are used, and they are used twice. Similar thing happens on the other node. However what we actually want to do is to use all GPUs on both nodes, and one GPU per process without overlapping.

I found that this issue is mainly because mpiexec on Polaris uses different logic to map PMI_RANK with PMI_LOCAL_RANK and node index, while rct assign GPU id (CUDA_VISIBLE_DEVICE) based on some rank index. Actually if I add a --ppn flag for mpiexec, then this issue is solved, and we can look at the difference between the logic of how mpiexec maps PMI_RANK to PMI_LOCAL_RANK and node index (just ignore gpu_id and focus on PMI_RANK!):

twang3@x3004c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 8 --ppn 4 --hostfile $PBS_NODEFILE python main.py
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 1, PALS_RANKID = 5, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, PALS_RANKID = 7, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 0, PALS_RANKID = 4, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 2, PALS_RANKID = 6, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 2, PALS_RANKID = 2, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, PALS_RANKID = 0, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 3, PALS_RANKID = 3, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 1, PALS_RANKID = 1, Hostname = x3004c0s7b0n0, gpu_id = None

twang3@x3004c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 8 --hostfile $PBS_NODEFILE python main.py
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 0, PALS_RANKID = 1, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 2, PALS_RANKID = 5, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 1, PALS_RANKID = 3, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, PALS_RANKID = 7, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, PALS_RANKID = 0, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 1, PALS_RANKID = 2, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 2, PALS_RANKID = 4, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 3, PALS_RANKID = 6, Hostname = x3004c0s7b0n0, gpu_id = None

As we can see, with --ppn flag, PMI_SIZE will be assigned in a "greedy way", assigning consecutive PMI_RANK index for all ranks on the same node. However, without --ppn flag, PMI_SIZE will be assigned in a "round-robin way", and assign consecutive PMI_RANK index for all ranks with the same local rank index

radical-cybertools / radical.pilot

On Polaris, different ranks will use overlapping GPU #3022