Open raffenet opened 7 years ago
It is required a special processing for the case of comm_size = 1. I will prepare a fix.
@Bayyapu there is a second part to this ticket that involved the construction of the cnts and displs buffers. Can you look into that part?
Hi, summation of the cnts[i] on each step of the reduce-scatter takes O(log(p)) time. This algorithm can be implemented without using of this arrays. See original paper and implementation: R. Rabenseifner -- https://fs.hlrs.de/projects/par/mpi/myreduce.html
The present implementation is also based on Rabenseifner’s algorithm. In my opinion the current implementation is good as it is and does not cause any performance degradation or increases the memory usage.
@mkurnosov We use O(log(P)) arrays, not O(P) arrays for storing counts. I'm not sure I understand the space concern you are raising. Can you point to the exact code location?
@pavanbalaji MPIR_CHKLMEM_MALLOC(cnts, int *, pof2*sizeof(int), mpi_errno, "counts");
pof2 ~ P
No, it’s proportional to log(P)
@pavanbalaji IMHO, in worst case: pof2 == P == commsize
/* find nearest power-of-two less than or equal to comm_size */ pof2 = 1; while (pof2 <= comm_size) pof2 <<= 1; pof2 >>=1;
@mkurnosov Oh, you are right. I was interpreting pof2
as x
in 2^x == comm_size
, but pof2
is actually 2^x
not x
, which is O(P)
. Nevermind. We'll look into it.
I think PR #2639 addressed this issue, but the code from that PR has not been merged even though that PR was closed.
Hello,
In the current implementation of Rabenseifner’s algorithm (MPIR_Reduce_redscat_gather) for MPI_Reduce there is a load imbalance across the processes. The reason is non-uniform distribution of count elements across pof2 processes on the reduce-scatter and the gather steps. The last process pof2-1 is loaded more than others:
for (i=0; i<(pof2-1); i++) cnts[i] = count/pof2; cnts[pof2-1] = count - (count/pof2)*(pof2-1);
For example, commsize = 13, count=13; pof2=8, cnts[0..6] = 1, cnts[7] = 6. commsize = 1600, count=100000; pof2=1024, cnts[0..1022] = 97, cnts[1023] = 769.
In addition, initial calculation of the cnts[i] and disps[i], before reduce-scatter step, takes O(log(p)) time. Summation of the cnts[i] on each step of the reduce-scatter also takes O(log(p)) time. Storing the cnts[i] and displs[i] requires O(p) bytes of memory.
I implemented a new version of MPIR_Reduce_redscat_gather following the description in the original papers [1, 2]. Input vectors are halving uniformly, without using of arrays cnts[i] and disps[i]. Thus, the memory consumption and computational complexity is reduced.
[1] Rajeev Thakur, Rolf Rabenseifner and William Gropp. Optimization of Collective Communication Operations in MPICH // The Int. Journal of High Performance Computing Applications. Vol 19, Issue 1, pp. 49--66. [2] http://www.hlrs.de/mpi/myreduce.html.
With best regards, Mikhail Kurnosov -- Computer Systems Department Siberian State University of Telecommunications and Information Sciences Address: 630102, 86 Kirova str., Novosibirsk, Russia WWW: www.mkurnosov.net
https://gist.github.com/raffenet/ca9692db8b24bbdc2932a186430c29b2