change quick sort to merge/heap sort in MPI_Comm_split

mpichbot commented 7 years ago

Originally by thakur on 2014-05-19 20:11:04 -0500

See http://lists.mpich.org/pipermail/devel/2014-May/000391.html

mpichbot commented 7 years ago

Originally by thakur on 2014-05-20 13:10:46 -0500

Also see #1952

mpichbot commented 7 years ago

Originally by thakur on 2014-05-20 14:03:59 -0500

Also need to check scalability of MPI_Comm_dup.

mpichbot commented 7 years ago

Originally by jczhang on 2014-05-21 14:27:28 -0500

From the email thread, it seems people blame quick sort since it hit the worst O(N^2) complexity with presorted inputs, which is common in MPI_Comm_split. I find it depends on qsort implementation. It only happens if qsort picks the leftmost or rightmost pivot. If it uses the median, it can reach O(NlogN).

mpichbot commented 7 years ago

Originally by thakur on 2014-05-21 14:33:26 -0500

The code calls the C library function qsort(). I don't know if it is randomized, or there is a way to randomize it, to avoid the worst case complexity.

mpichbot commented 7 years ago

Originally by apenya on 2014-05-21 14:42:27 -0500

Note that "qsort" is not actually implemented as a quick sort currently. The specs don't force it:

http://www.gnu.org/software/libc/manual/html_node/Array-Sort-Function.html

I seem to remember that current implementations did a merge sort, but I can't find the reference now.

mpichbot commented 7 years ago

Originally by thakur on 2014-05-21 14:44:51 -0500

It could be compiler dependent as well. The bug report is from BG/Q, probably using XL compilers. The bug report also mentions the K computer, which might even be a different MPI implementation.

mpichbot commented 7 years ago

Originally by jczhang on 2014-05-21 14:55:59 -0500

mergesort and heapsort are not in Standard C library, though I found them on my Mac. Why we bother with these things? I'd like to copy & paste a quick sort code or a merge sort code to MPICH. See http://www.jbox.dk/sanos/source/lib/qsort.c.html, which is less then 100 lines.

mpichbot commented 7 years ago

Originally by thakur on 2014-05-21 15:07:04 -0500

The library function was used to avoid introducing additional bugs and for potentially higher performance if the function was optimized by the library writer. It is ok to implement a sort there from scratch. Only need to make sure the sort is stable, as a stable sort is required (ordering must be maintained for keys of the same value).

mpichbot commented 7 years ago

Originally by jczhang on 2014-05-22 07:33:16 -0500

I don't understand why MPI_Comm_dup is also a problem. I looked at the code, but did not find sort algorithms in it.

mpichbot commented 7 years ago

Originally by thakur on 2014-05-22 07:55:59 -0500

As Sam said in a separate mail that I forwarded: "Btw, it's worth examining mpi_comm_dup as well, as on the K machine, where it was explicitly timed, it's time scales the same as comm_split."

We need to test it to see whether there is a problem or not. The problem, if there is one, may not be sorting related.

mpichbot commented 7 years ago

Originally by thakur on 2014-05-22 13:39:56 -0500

I asked Sam if he had a test program. His reply is below:

"You can use my hpgmg-fv benchmark. https://bitbucket.org/hpgmg/hpgmg/ my FV code is in the ./finite-volume/source subdirectory.

I recently added some timers to measure comm_split/dup directly. These show the maximum time for any comm_split/dup at any level of the v-cycle.

On Intel, you can use the make file or directly compile with cc -O3 -fopenmp level.c operators.7pt.c mg.c solvers.c hpgmg.c timer.omp.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_CHEBY -DUSE_BICGSTAB -o run.avx -DSTENCIL_FUSE_BC

On Edison, I run with export OMP_NUM_THREADS=12 aprun -n ### -N 16 -S 8 -ss -cc numa_node ./hpgmg-fv 7 1 where ### is some integer cubed... i.e. 1, 27, 64, 512, ... 4096, ... 64000

You can also run with flat MPI (be sure to set OMP_NUM_THREADS=1) to get more processes per node and hit the potential bottleneck earlier. You might run with ./hpgmg-fv 6 1 to make it a little faster."

mpichbot commented 7 years ago

Originally by jczhang on 2014-05-22 13:56:19 -0500

Thanks, Rejeev. That is the thing I also want to ask for. I have implemented merge sort for comm_split and have an account on Edison (also in process of getting my crypto card on Mira) I hope at least I can fix the comm_split problem in MPICH3.1.1 (next Monday?)

mpichbot commented 7 years ago

Originally by jczhang on 2014-08-19 15:00:22 -0500

We found the problem lies in an IBM PAMI library call, PAMI_Geometry_create_taskrange(), which is unscalable and is called by MPIR_Comm_commit at the very end of Comm_split/split. We have already told IBM about this issue and hope they will find the reason. A workaround is setting environment variable PAMID_COLLECTIVES=0.

Through profiling, we also found the qsort() called in MPICH code is actually using the merge sort algorithm in the libc library on Mira. So qsort is irrelevant to the problem.

See more discussion at the thread http://lists.mpich.org/pipermail/devel/2014-July/000427.html See even more discussion and profiling results in the archive of the mailing list ibm@lists.mpich.org, with subject "MPI_Comm_split/dup scales badly on Mira"

pmodels / mpich

change quick sort to merge/heap sort in MPI_Comm_split #2093