Proper process distribution for parallel transfer benchmarks?

sjpb commented 3 years ago

I must be missing something here:

I'm running e.g. uniband via slurm + openmpi. I have 2x 32-core nodes, so I want to run 64 processes with the 1st half of them on Node#1 and the 2nd half on Node#2 so the pair-wise transfers go across the network

Setting sbatch options of --ntasks=64 and --ntasks-per-node=32, running:

srun --distribution=block IMB-MPI uniband

does the right thing for the 64-process case, with ranks 0-31 on node#1 and 32-63 on node#2. However, uniband also generates results for 2, 4, 8, 16 and 32-processes. Which seems helpful, except that all the communication there is within node1, which isn't really measuring what I want.

Is this the intended usage and behavior? If so, is there a way of disabling the runs on less than all processes, so I can control placement properly?

sjpb commented 3 years ago

Ah I found the npmin option - but map seems a better option here. Any suggestions on how to use it for the above case?

VinnitskiV commented 3 years ago

Hi @sjpb You can find more information about map option using: IMB-MPI1 -help map In your case i assume you need to use next option: IMB-MPI uniband -map COUNT_RANKS_PER_NODExCOUNT_NODES

a-v-medvedev commented 3 years ago

Ah I found the npmin option - but map seems a better option here. Any suggestions on how to use it for the above case?

Just leaving a note on -map usage. To understand better the idea behind -map and the way it can be used to measure the inter-node communications only, one may check out Examples in section "-map PxQ Option" from here: https://software.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top/benchmark-methodology/command-line-control.html

Also -npmin is typically used all the time for every IMB run to eliminate these np=2,4,... executions.

intel / mpi-benchmarks

Proper process distribution for parallel transfer benchmarks? #28