Closed mpichbot closed 7 years ago
Originally by thakur on 2014-05-20 13:10:46 -0500
Also see #1952
Originally by thakur on 2014-05-20 14:03:59 -0500
Also need to check scalability of MPI_Comm_dup.
Originally by jczhang on 2014-05-21 14:27:28 -0500
From the email thread, it seems people blame quick sort since it hit the worst O(N^2) complexity with presorted inputs, which is common in MPI_Comm_split. I find it depends on qsort implementation. It only happens if qsort picks the leftmost or rightmost pivot. If it uses the median, it can reach O(NlogN).
Originally by thakur on 2014-05-21 14:33:26 -0500
The code calls the C library function qsort(). I don't know if it is randomized, or there is a way to randomize it, to avoid the worst case complexity.
Originally by apenya on 2014-05-21 14:42:27 -0500
Note that "qsort" is not actually implemented as a quick sort currently. The specs don't force it:
http://www.gnu.org/software/libc/manual/html_node/Array-Sort-Function.html
I seem to remember that current implementations did a merge sort, but I can't find the reference now.
Originally by thakur on 2014-05-21 14:44:51 -0500
It could be compiler dependent as well. The bug report is from BG/Q, probably using XL compilers. The bug report also mentions the K computer, which might even be a different MPI implementation.
Originally by jczhang on 2014-05-21 14:55:59 -0500
mergesort and heapsort are not in Standard C library, though I found them on my Mac. Why we bother with these things? I'd like to copy & paste a quick sort code or a merge sort code to MPICH. See http://www.jbox.dk/sanos/source/lib/qsort.c.html, which is less then 100 lines.
Originally by thakur on 2014-05-21 15:07:04 -0500
The library function was used to avoid introducing additional bugs and for potentially higher performance if the function was optimized by the library writer. It is ok to implement a sort there from scratch. Only need to make sure the sort is stable, as a stable sort is required (ordering must be maintained for keys of the same value).
Originally by jczhang on 2014-05-22 07:33:16 -0500
I don't understand why MPI_Comm_dup is also a problem. I looked at the code, but did not find sort algorithms in it.
Originally by thakur on 2014-05-22 07:55:59 -0500
As Sam said in a separate mail that I forwarded: "Btw, it's worth examining mpi_comm_dup as well, as on the K machine, where it was explicitly timed, it's time scales the same as comm_split."
We need to test it to see whether there is a problem or not. The problem, if there is one, may not be sorting related.
Originally by thakur on 2014-05-22 13:39:56 -0500
I asked Sam if he had a test program. His reply is below:
"You can use my hpgmg-fv benchmark. https://bitbucket.org/hpgmg/hpgmg/ my FV code is in the ./finite-volume/source subdirectory.
I recently added some timers to measure comm_split/dup directly. These show the maximum time for any comm_split/dup at any level of the v-cycle.
On Intel, you can use the make file or directly compile with cc -O3 -fopenmp level.c operators.7pt.c mg.c solvers.c hpgmg.c timer.omp.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_CHEBY -DUSE_BICGSTAB -o run.avx -DSTENCIL_FUSE_BC
On Edison, I run with export OMP_NUM_THREADS=12 aprun -n ### -N 16 -S 8 -ss -cc numa_node ./hpgmg-fv 7 1 where ### is some integer cubed... i.e. 1, 27, 64, 512, ... 4096, ... 64000
You can also run with flat MPI (be sure to set OMP_NUM_THREADS=1) to get more processes per node and hit the potential bottleneck earlier. You might run with ./hpgmg-fv 6 1 to make it a little faster."
Originally by jczhang on 2014-05-22 13:56:19 -0500
Thanks, Rejeev. That is the thing I also want to ask for. I have implemented merge sort for comm_split and have an account on Edison (also in process of getting my crypto card on Mira) I hope at least I can fix the comm_split problem in MPICH3.1.1 (next Monday?)
Originally by jczhang on 2014-08-19 15:00:22 -0500
We found the problem lies in an IBM PAMI library call, PAMI_Geometry_create_taskrange(), which is unscalable and is called by MPIR_Comm_commit at the very end of Comm_split/split. We have already told IBM about this issue and hope they will find the reason. A workaround is setting environment variable PAMID_COLLECTIVES=0.
Through profiling, we also found the qsort() called in MPICH code is actually using the merge sort algorithm in the libc library on Mira. So qsort is irrelevant to the problem.
See more discussion at the thread http://lists.mpich.org/pipermail/devel/2014-July/000427.html See even more discussion and profiling results in the archive of the mailing list ibm@lists.mpich.org, with subject "MPI_Comm_split/dup scales badly on Mira"
Originally by thakur on 2014-05-19 20:11:04 -0500
See http://lists.mpich.org/pipermail/devel/2014-May/000391.html