intel / mpi-benchmarks

145 stars 63 forks source link

Performance regression using SGI MPT library #17

Closed adrianjhpc closed 3 years ago

adrianjhpc commented 5 years ago

I have a performance regression for some of the benchmarks between commits c3ef058515b0f1c4d1d26d031243cade7f174bf1 and ebb564671ce52fc208b591f291714798daa35447.

When running benchmarks between two processes on the same node but on different sockets, like this:

mpiexec_mpt -ppn 2 -n 2 omplace -nt 32 ./IMB-MPI1 biband -npmin 2

I get this performance for the old source code:

# Benchmarking Biband
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions   Mbytes/sec      Msg/sec
            0         1000         0.00      4988798
            1         1000         6.37      6373589
            2         1000        13.62      6810416
            4         1000        25.55      6388573
            8         1000        50.81      6351491
           16         1000       101.49      6342847
           32         1000       203.61      6362683
           64         1000       378.28      5910609
          128         1000       293.17      2290367
          256         1000       624.56      2439692
          512         1000      1212.22      2367611
         1024         1000      2078.85      2030125
         2048         1000      8515.61      4158012
         4096         1000     15032.28      3669991
         8192         1000     25693.00      3136352
        16384         1000     25555.42      1559779
        32768         1000     25122.75       766685
        65536          640     34725.89       529875
       131072          320     31540.43       240634
       262144          160     18311.99        69855
       524288           80     15432.22        29435
      1048576           40     14296.57        13634
      2097152           20     14836.32         7075
      4194304           10     14914.15         3556

And this performance for the newer version of IMB:

#---------------------------------------------------
# Benchmarking Biband
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions   Mbytes/sec      Msg/sec
            0         1000         0.00      5147985
            1         1000         6.29      6290787
            2         1000        14.39      7193543
            4         1000        26.45      6613563
            8         1000        53.03      6629307
           16         1000       102.32      6395217
           32         1000       207.49      6483996
           64         1000       440.25      6878981
          128         1000       312.46      2441079
          256         1000       608.54      2377097
          512         1000      1200.30      2344330
         1024         1000      1927.01      1881845
         2048         1000      8097.04      3953630
         4096         1000      4591.55      1120983
         8192         1000      5532.52       675356
        16384         1000      5822.15       355356
        32768         1000      5488.27       167489
        65536          640      5508.57        84054
       131072          320      4271.12        32586
       262144          160      4208.83        16055
       524288           80      4139.70         7896
      1048576           40      4085.87         3897
      2097152           20      4054.54         1933
      4194304           10      4042.60          964

I'm using the same compilers between both source codes and the network setup isn't changed. Is this expected with the latest IMB?

thanks

SergeyGubanov commented 5 years ago

Hi @adrianjhpc,

Thank you for the question. This is most likely due to the fact that the latest IMB uses MPI_Alloc_mem / MPI_Free_mem instead of system's malloc / free, and perhaps the SGI MPT library has some features in the MPI_Alloc_mem / MPI_Free_mem implementation that lead to the difference.