nerscadmin / IPM

Integrated Performance Monitoring for High Performance Computing
http://ipm-hpc.org
GNU Lesser General Public License v2.1
81 stars 35 forks source link

Huge overhead of IPM #29

Open dkuzmin72 opened 6 years ago

dkuzmin72 commented 6 years ago

Profiling an application working on 1024 processes with IPM 2.0.6 we get ~5x overhead. Analysis showed that ~50% of time application spent in PMPI_Group_compare called from PMPI_Comm_compare(). This application calls a lot of MPI_Isend() for a communicator created by MPI_Cart_create(). Despite the fact that new communicator has the same size and the same process placement as MPI_COMM_WORLD, MPI_Comm_compare() doesn't return MPI_IDENT consuming a lot of computation power for comparison (might be algorithm of PMPI_Group_compare() is not optimal). Since we still need to call PMPI_Group_translate_ranks() we could compare a communicator with MPI_COMM_WORLD. Something like this (mod_mpi.h):

This modification significantly reduces overhead but it's still huge: wallclock time with IPM 2.0.2 - 2470s wallclock time with IPM 2.0.6 - 3400s (with modification above) Regards! ---Dmitry

cdaley commented 6 years ago

Thanks for reporting the issue Dmitry.

If possible can you please send a diff of your fix so we can be sure we understand your changes correctly.

Also, it would be really helpful if you can point us to the simplest application or benchmark you have which reproduces the slow performance.

Thanks, Chris.

dkuzmin72 commented 6 years ago

Hi Chris! The diff is below. Well, we don't have the simplest application. The application should create Communicators and communicators should be quite big. I also got complains from colleagues like: "I've noticed NAMD having the same issue. With one of the dataset that I'm running, without IPM it finished in 21 minutes; while with IPM now I'm estimating at least 9 hours to complete" NAMD uses charm++ which exploits MPI_Isend a lot. NAMD: http://www.ks.uiuc.edu/Research/namd/development.html example1: http://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz example2: http://www.ks.uiuc.edu/Research/namd/utilities/apoa1.tar.gz I understand that it requires a lot of time to build and run NAMD examples. Maybe easier to create a simple example which will create a communicator by means of MPI_Cart_create and call a lot of MPI_Isend. We used OpenMPI for testing.

$git diff
diff --git a/include/mod_mpi.h b/include/mod_mpi.h
index 135a558..a03b676 100755
--- a/include/mod_mpi.h
+++ b/include/mod_mpi.h
@@ -27,9 +27,7 @@ extern MPI_Group ipm_world_group;

 #define IPM_MPI_MAP_RANK(rank_out_, rank_in_, comm_) \
   do { \
-    int comm_cmp_; \
-    PMPI_Comm_compare(MPI_COMM_WORLD, comm_, &comm_cmp_); \
-    if (comm_cmp_ == MPI_IDENT || rank_in_ == MPI_ANY_SOURCE) { \
+    if (comm_ == MPI_COMM_WORLD || rank_in_ == MPI_ANY_SOURCE) { \
       rank_out_=rank_in_; \
     } else { \
       MPI_Group group_; \
cdaley commented 6 years ago

Thanks. I'm happy to build and run NAMD. It usually works better to use the production application rather than creating a synthetic benchmark without first referring to the production application. Can you give me the smallest and shortest running NAMD test problem which has high IPM performance overhead (even if it is not scientifically meaningful)? I'm also a little puzzled because I thought NAMD uses Charm++ and not MPI for communication. Perhaps we can move our conversation to email?

Chris

nerscadmin commented 6 years ago

Sorry to chime in late. Sounds a little like NAMD woes of years gone by.

Especially where it regards multiple COMMs you can over-run the hash table (or send it into a woeful state) if the call pressure becomes to high.

Especially where it regards an MPI call you don't care about, e.g. it's not pushing data, then one "out" is simply to de-stub it and use the underlying call. I.e. don't nameshift for the problem call.

Best,

David

On Fri, Mar 2, 2018 at 9:32 AM, cdaley notifications@github.com wrote:

Thanks. I'm happy to build and run NAMD. It usually works better to use the production application rather than creating a synthetic benchmark without first referring to the production application. Can you give me the smallest and shortest running NAMD test problem which has high IPM performance overhead (even if it is not scientifically meaningful)? I'm also a little puzzled because I thought NAMD uses Charm++ and not MPI for communication. Perhaps we can move our conversation to email?

Chris

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nerscadmin/IPM/issues/29#issuecomment-369992938, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUN4-ZL-z2hNMKaR3FqfMRmU62PU_q0ks5taYJIgaJpZM4SX260 .

dkuzmin72 commented 6 years ago

Hi, I created a small test case which can show difference in performance. comm.txt I ran it on 29 nodes, 900 processes. (You need to use big core count to see overhead of MPI_Comm_compare) With the original code I got wallclock around 1.4 -1.5s while with the modified code I got wallclock : 1.15 -1.20s How to reproduce:

  1. rename comm.txt into comm.c
  2. compile: mpicc -o comm comm.c
  3. run: LD_PRELOAD=libipm.so mpirun -np 900 ./comm

You can measure time for MPI_Isend() and for MPI_Comm_compare() separately to understand overhead.

Again, it's hardly possible to get anything visible with small core-count.

Regards! ---Dmitry

cdaley commented 6 years ago

Thanks Dmitry,

I ran your comm.c application on Intel KNL nodes of Cori supercomputer (my only customization was to add an additional timer between MPI_Init and MPI_Finalize to measure run time with and without IPM). Cori has cray-mpich-7.6.2. I used 15 nodes with 68 MPI ranks per node to give a total of 1020 MPI ranks. I found minimal overhead added by IPM in this configuration:

Without IPM: time between MPI_Init and MPI_Finalize = 0.24 seconds With IPM: time between MPI_Init and MPI_Finalize = 0.11 seconds. IPM wallclock = 0.53 seconds (The IPM wallclock is higher than my custom timer because the IPM time includes the time that IPM spends in MPI_Finalize)

I then built OpenMPI-3.0.0 on Cori. There is now a definite slowdown when using IPM: Run 1: Without IPM: time between MPI_Init and MPI_Finalize = 0.70 seconds With IPM: time between MPI_Init and MPI_Finalize = 5.30 seconds. IPM wallclock = 26.05 seconds Run 2: Without IPM: time between MPI_Init and MPI_Finalize = 0.64 seconds With IPM: time between MPI_Init and MPI_Finalize = 5.72 seconds. IPM wallclock = 26.51 seconds

I will investigate further. Which version of MPI did you use? There is no monitored MPI_Comm_compare call in either of my configurations.

See OpenMPI results:

#
# command   : /global/cscratch1/sd/csdaley/ipm-overhead/openmpi/./comm.ipm 
# start     : Tue Mar 06 11:08:37 2018   host      : nid12452        
# stop      : Tue Mar 06 11:09:03 2018   wallclock : 26.05
# mpi_tasks : 1020 on 15 nodes           %comm     : 20.01
# mem [GB]  : 31.66                      gflop/sec : 0.00
#
#           :       [total]        <avg>          min          max
# wallclock :      26517.39        26.00        25.96        26.05 
# MPI       :       5307.36         5.20         0.20         5.25 
# %wall     :
#   MPI     :                      20.01         0.77        20.17 
# #calls    :
#   MPI     :          9178            8            8         1026
# mem [GB]  :         31.66         0.03         0.03         0.03 
#
#                             [time]        [count]        <%wall>
# MPI_Wait                   2683.75           1019          10.12
# MPI_Barrier                2622.88           1020           9.89
# MPI_Irecv                     0.44           1019           0.00
# MPI_Isend                     0.20           1019           0.00
# MPI_Comm_free                 0.09           1020           0.00
# MPI_Comm_rank                 0.00           1020           0.00
# MPI_Comm_size                 0.00           1020           0.00
# MPI_Waitall                   0.00              1           0.00
# MPI_Init                      0.00           1020           0.00
# MPI_Finalize                  0.00           1020           0.00
#
###################################################################
dkuzmin72 commented 6 years ago

Hi Chris,

I haven't tried Intel MPI yet, we used OpenMPI. They may have different algorithms for MPI_Comm_compare.

Each IPMMPI* function has this macro: IPM_MPI_RANK_DEST_C(irank) where RANK_DEST_C is:

define IPM_MPI_RANK_DESTC(rank) IPM_MPI_MAPRANK(rank, dest, comm_in);

MAP_RANK is:

define IPM_MPI_MAP_RANK(rankout, rankin, comm_) \

do { \ int commcmp; \ PMPI_Comm_compare(MPI_COMMWORLD, comm, &commcmp);

Calling PMPI_Comm_compare for each function leads to huge overhead.

Yes, we can move further conversation to email. I made my email public.

Regards! ---Dmitry

paklui commented 5 years ago

Hi Dmitry and Chris, I am also seeing very high overhead when using IPM to profile. I think Dmitry's suggested fix is working me too. Seems like it has been sometime since it has an update, is there a chance to putback Dmitry's fix? Thanks

lcebaman commented 3 years ago

I see this has been opened for a while now. What is the status of this fix?