pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
551 stars 281 forks source link

coll: count and displacement calculations MPIR_Reduce_redscat_gather #2637

Open raffenet opened 7 years ago

raffenet commented 7 years ago

Hello,

In the current implementation of Rabenseifner’s algorithm (MPIR_Reduce_redscat_gather) for MPI_Reduce there is a load imbalance across the processes. The reason is non-uniform distribution of count elements across pof2 processes on the reduce-scatter and the gather steps. The last process pof2-1 is loaded more than others:

for (i=0; i<(pof2-1); i++) cnts[i] = count/pof2; cnts[pof2-1] = count - (count/pof2)*(pof2-1);

For example, commsize = 13, count=13; pof2=8, cnts[0..6] = 1, cnts[7] = 6. commsize = 1600, count=100000; pof2=1024, cnts[0..1022] = 97, cnts[1023] = 769.

In addition, initial calculation of the cnts[i] and disps[i], before reduce-scatter step, takes O(log(p)) time. Summation of the cnts[i] on each step of the reduce-scatter also takes O(log(p)) time. Storing the cnts[i] and displs[i] requires O(p) bytes of memory.

I implemented a new version of MPIR_Reduce_redscat_gather following the description in the original papers [1, 2]. Input vectors are halving uniformly, without using of arrays cnts[i] and disps[i]. Thus, the memory consumption and computational complexity is reduced.

[1] Rajeev Thakur, Rolf Rabenseifner and William Gropp. Optimization of Collective Communication Operations in MPICH // The Int. Journal of High Performance Computing Applications. Vol 19, Issue 1, pp. 49--66. [2] http://www.hlrs.de/mpi/myreduce.html.

With best regards, Mikhail Kurnosov -- Computer Systems Department Siberian State University of Telecommunications and Information Sciences Address: 630102, 86 Kirova str., Novosibirsk, Russia WWW: www.mkurnosov.net

https://gist.github.com/raffenet/ca9692db8b24bbdc2932a186430c29b2

mkurnosov commented 7 years ago

It is required a special processing for the case of comm_size = 1. I will prepare a fix.

raffenet commented 7 years ago

@Bayyapu there is a second part to this ticket that involved the construction of the cnts and displs buffers. Can you look into that part?

mkurnosov commented 7 years ago

Hi, summation of the cnts[i] on each step of the reduce-scatter takes O(log(p)) time. This algorithm can be implemented without using of this arrays. See original paper and implementation: R. Rabenseifner -- https://fs.hlrs.de/projects/par/mpi/myreduce.html

Bayyapu commented 7 years ago

The present implementation is also based on Rabenseifner’s algorithm. In my opinion the current implementation is good as it is and does not cause any performance degradation or increases the memory usage.

pavanbalaji commented 7 years ago

@mkurnosov We use O(log(P)) arrays, not O(P) arrays for storing counts. I'm not sure I understand the space concern you are raising. Can you point to the exact code location?

mkurnosov commented 7 years ago

@pavanbalaji MPIR_CHKLMEM_MALLOC(cnts, int *, pof2*sizeof(int), mpi_errno, "counts"); pof2 ~ P

pavanbalaji commented 7 years ago

No, it’s proportional to log(P)

mkurnosov commented 7 years ago

@pavanbalaji IMHO, in worst case: pof2 == P == commsize /* find nearest power-of-two less than or equal to comm_size */ pof2 = 1; while (pof2 <= comm_size) pof2 <<= 1; pof2 >>=1;

pavanbalaji commented 7 years ago

@mkurnosov Oh, you are right. I was interpreting pof2 as x in 2^x == comm_size, but pof2 is actually 2^x not x, which is O(P). Nevermind. We'll look into it.

akhillanger commented 6 years ago

I think PR #2639 addressed this issue, but the code from that PR has not been merged even though that PR was closed.