Extend PALS MPIEXEC LM with `ppn` option

On Polaris, different ranks will use overlapping GPU

I have the following test code in python, which prints the GPU id for each rank on each node:

import socket, os

hostname = socket.gethostname()
size = int(os.getenv("PMI_SIZE"))
rank = int(os.getenv("PMI_RANK"))
local_rank = int(os.getenv("PMI_LOCAL_RANK"))
gpu_id = os.getenv("CUDA_VISIBLE_DEVICES")
rank_2 = int(os.getenv("PALS_RANKID"))

print("PMI_SIZE = {}, PMI_RANK = {}, PMI_LOCAL_RANK = {}, PALS_RANKID = {}, Hostname = {}, gpu_id = {}".format(size, rank, local_rank, rank_2, hostname, gpu_id))

and I use the following entk script to launch the job with 8 processes and 2 nodes:

import radical.pilot as rp
import radical.entk  as re

t = re.Task({
    'executable'    : 'python',
    'arguments'     : ['/home/twang3/myWork/rct-unit-test/mpiexec-ppn/main.py'],
    'cpu_reqs'      : {'cpu_processes'  : 8,
                       'cpu_threads'    : 8,
                       'cpu_thread_type': rp.OpenMP},
    'gpu_reqs'      : {'gpu_processes'  : 1,
                       'gpu_process_type': rp.CUDA},
    })

s = re.Stage()
s.add_tasks([t])
p = re.Pipeline()
p.add_stages(s)

amgr = re.AppManager()
n_nodes = 2
amgr.resource_desc = {
        'resource'  :   'anl.polaris',
        'project'   :   'RECUP',
        'queue'     :   'debug',
        'cpus'      :   n_nodes * 32,
        'gpus'      :   n_nodes * 4,
        'walltime'  :   5
        }
amgr.workflow = [p]
amgr.run()

and I got the following results:

PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 1, Hostname = x3006c0s19b0n0, gpu_id = 3
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, Hostname = x3006c0s19b0n0, gpu_id = 3
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 0, Hostname = x3006c0s19b0n0, gpu_id = 1
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 2, Hostname = x3006c0s19b0n0, gpu_id = 1
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 2, Hostname = x3006c0s13b1n0, gpu_id = 0
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 3, Hostname = x3006c0s13b1n0, gpu_id = 2
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, Hostname = x3006c0s13b1n0, gpu_id = 0
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 1, Hostname = x3006c0s13b1n0, gpu_id = 2

We can notice that on node x3006c0s19b0n0, only GPU 1 and 3 are used, and they are used twice. Similar thing happens on the other node. However what we actually want to do is to use all GPUs on both nodes, and one GPU per process without overlapping.

I found that this issue is mainly because mpiexec on Polaris uses different logic to map PMI_RANK with PMI_LOCAL_RANK and node index, while rct assign GPU id (CUDA_VISIBLE_DEVICE) based on some rank index. Actually if I add a --ppn flag for mpiexec, then this issue is solved, and we can look at the difference between the logic of how mpiexec maps PMI_RANK to PMI_LOCAL_RANK and node index (just ignore gpu_id and focus on PMI_RANK!):

twang3@x3004c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 8 --ppn 4 --hostfile $PBS_NODEFILE python main.py
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 1, PALS_RANKID = 5, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, PALS_RANKID = 7, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 0, PALS_RANKID = 4, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 2, PALS_RANKID = 6, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 2, PALS_RANKID = 2, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, PALS_RANKID = 0, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 3, PALS_RANKID = 3, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 1, PALS_RANKID = 1, Hostname = x3004c0s7b0n0, gpu_id = None

twang3@x3004c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 8 --hostfile $PBS_NODEFILE python main.py
PMI_SIZE = 8, PMI_RANK = 1, PMI_LOCAL_RANK = 0, PALS_RANKID = 1, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 5, PMI_LOCAL_RANK = 2, PALS_RANKID = 5, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 3, PMI_LOCAL_RANK = 1, PALS_RANKID = 3, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 7, PMI_LOCAL_RANK = 3, PALS_RANKID = 7, Hostname = x3004c0s7b1n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 0, PMI_LOCAL_RANK = 0, PALS_RANKID = 0, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 2, PMI_LOCAL_RANK = 1, PALS_RANKID = 2, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 4, PMI_LOCAL_RANK = 2, PALS_RANKID = 4, Hostname = x3004c0s7b0n0, gpu_id = None
PMI_SIZE = 8, PMI_RANK = 6, PMI_LOCAL_RANK = 3, PALS_RANKID = 6, Hostname = x3004c0s7b0n0, gpu_id = None

As we can see, with --ppn flag, PMI_SIZE will be assigned in a "greedy way", assigning consecutive PMI_RANK index for all ranks on the same node. However, without --ppn flag, PMI_SIZE will be assigned in a "round-robin way", and assign consecutive PMI_RANK index for all ranks with the same local rank index

@GKNB Hi Tianle, we did discuss with Andre about having option --ppn and we couldn't find the way how to bring it in as a general use case (i.e., it will work for specific use cases only). But during the discussion we were able to find a quick and robust solution - this is not about to have a certain task placement regarding cores assignment, but rather about to set correct GPUs per rank, thus we can reconfigure GPU assignment and make it "manually".

First, don't set 'gpu_process_type': rp.CUDA, since it will trigger RP to set CUDA_VISIBLE_DEVICES. Second, we will reconfigure Polaris resource configuration (and will use ALCF guidance regarding GPUs)

mkdir -p ~/.radical/pilot/configs
cat > ~/.radical/pilot/configs/resource_anl.json <<EOF
{
    "polaris": {
        "task_pre_exec" : [
            "export CUDA_VISIBLE_DEVICES=\$((3 - \$PMI_LOCAL_RANK % 4))"
        ]
    }
}
EOF

If that works for you then we can move it into default config for Polaris within RP

(*) this approach ensures that CUDA_VISIBLE_DEVICES will be set for all tasks, but you also can remove this configuration and have this in pre_exec ("export CUDA_VISIBLE_DEVICES=$((3 - $PMI_LOCAL_RANK % 4))") per each task that would need to use GPU

Issue with ppn flag

I have submitted a ticket to Polaris help desk, and it seems like the -ppn flag is important and relatively easy to use. Let me summarize their reply:

1). -ppn flag controls "node-depth" versus "core-depth" process placement. If we submit mpiexec command with "-ppn" option, it will place the processes within the same node first. If we do not give "-ppn" option, it will place processes on the available nodes one by one and round robin when each available node is populated.

2). According to https://bluewaters.ncsa.illinois.edu/topology-considerations, there is an env variable called MPICH_RANK_REORDER_METHOD which can allow user to control the process placement. However I tested it on Polaris and it seems like this env variable is useless.

3). What if num_rank is not an integer multiple of num_nodes in nodefile: If you add "-ppn", it will start placing "ppn" number of ranks on node 0, then on node 1, and so forth until the nodes are in a pool. If you run out of the pool, you fail. If you run out of the ranks, the last node will be underutilized. I tested it on Polaris and it seems like the result is consistent with this.

4). (Important!) If -ppn flag is not given, while we provide a cpu bind list, the behavior will be that, assuming that we have n nodes, all n nodes will put rank 0 on them according to the first item in the list, and then the second item, and so on. In other words, if we have m ranks and n nodes, then we only need to give cpu bind list [m/n] items! See the example below:

twang3@x3102c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 6 --cpu-bind=verbose,list:0-7:8-15:16-23:24-31:0-7:8-15 python main.py
cpubind:list x3102c0s7b1n0 pid 4165 rank 1 0: mask 0x000000ff
cpubind:list x3102c0s7b1n0 pid 4166 rank 3 1: mask 0x0000ff00
cpubind:list x3102c0s7b1n0 pid 4167 rank 5 2: mask 0x00ff0000
cpubind:list x3102c0s7b0n0 pid 9598 rank 0 0: mask 0x000000ff
cpubind:list x3102c0s7b0n0 pid 9599 rank 2 1: mask 0x0000ff00
cpubind:list x3102c0s7b0n0 pid 9600 rank 4 2: mask 0x00ff0000

In this example, we have 6 ranks and 2 nodes, and do not have -ppn flag. In this case, each node will use core 0-7 for local rank 0 (that is, rank 0 and 1), and then use core 8-15 for local rank 1 (that is, rank 2 and 3), and then use core 16-23 for local rank 2 (that is, rank 4 and 5). The rest 3 items in the cpu-bind list are ignored. You can see that in the mask.

However, if we add -ppn flag, and assign 4 ranks on the first node, then we will see the behavior as we desired:

twang3@x3102c0s7b0n0:~/myWork/rct-unit-test/mpiexec-ppn> mpiexec -n 6 --ppn 4 --cpu-bind=verbose,list:0-7:8-15:16-23:24-31:0-7:8-15 python main.py
cpubind:list x3102c0s7b0n0 pid 9565 rank 2 2: mask 0x00ff0000
cpubind:list x3102c0s7b0n0 pid 9563 rank 0 0: mask 0x000000ff
cpubind:list x3102c0s7b0n0 pid 9564 rank 1 1: mask 0x0000ff00
cpubind:list x3102c0s7b0n0 pid 9566 rank 3 3: mask 0xff000000
cpubind:list x3102c0s7b1n0 pid 4136 rank 5 1: mask 0x0000ff00
cpubind:list x3102c0s7b1n0 pid 4135 rank 4 0: mask 0x000000ff

In this case, the first 4 ranks on the first node will be assigned to the core according to the first four items in cpu bind list, and the last 2 ranks on the second node will be assigned according to the last two items in cpu bind list. We can also see -ppn flag works here: when we assign it to be 4, it will first assign 4 ranks on the first node, and since only 2 ranks remains, they will be put on the second node.

Issue with your temporary solution above

I think this solution could be useful in the exalearn project. However, there are two issues which makes this fix less robust:

1). As mentioned above, we must add -ppn flag to our mpiexec command on Polaris. This is because we currently use cpu bind list, and if we do not add -ppn flag, the current logic of implementing cpu bind list is actually incorrect!

2). This fix always uses and only uses one GPU per rank. However, a). some applications use more then one GPU per rank (some hpc applications use multiple GPUs per rank to do linear algebra, for example, use cublasXt library), b). some applications are CPU-only applications, but they are now assigned CUDA_VISIBLE_DEVICE, and c). some applications like to first read all GPU on that node, and then assign GPU internally using cuda_set_device(local_rank). But with this temporary fix we can not handle those scenarios (If we do not use this fix, then we just need to a). set gpu_processes to be more than 1, and b). do not set GPU related attribute, and c). do not set gpu_process_type to be rp.CUDA)

@GKNB Hi Tianle,

thank you for clarification and explanation regarding --ppn (useful and clear), we'll add this option later next week.

Regarding GPU IDs, I agree it is for certain use cases and shouldn't be a general approach (I rushed with that idea), this part I'll add to the documentation (supported section, for Polaris) as a tip, so if user would want to have a control over assigning IDs then the command above can be provided in task.pre_exec.

@GKNB Tianle, can you please try the branch from this PR https://github.com/radical-cybertools/radical.pilot/pull/3035

@GKNB I know we have a deadline this week (Oct 2), maybe look at it next week (Oct 9)?

radical-cybertools / radical.pilot

Extend PALS MPIEXEC LM with `ppn` option #3014

Issue with ppn flag

Issue with your temporary solution above