AVX-based MPI_OP performance regression

rajachan commented 3 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source tarball, default configuration built with GCC 4.8.5.

Please describe the system on which you are running

Operating system/version: Amazon Linux
Computer hardware: Intel Skylake
Network type: EFA

Details of the problem

We noticed an OS-specific regression with LAMMPS (in.chute.scaled case) with 4.1.0. Bisecting through the commits, this seems to have been introduced with the AVX-based MPI_OP changes that got backported into this series. Specifically, the commit which moved to the unaligned SSE memory access primitives for reduce OPs seems to be causing it: https://github.com/open-mpi/ompi/pull/7957

This was added to address the Accumulate issue, so it is a necessary correctness fix (https://github.com/open-mpi/ompi/issues/7954)

The actual PR which introduced the SSE-based MPI_OP in the first place was backported from master: https://github.com/open-mpi/ompi/pull/7935

Broadly, allreduce performance seems to have taken a hit in 4.1.0 compared to 4.0.5 in this environment because of these changes. We do not see this with Amazon Linux 2 (which has a 7.x series GCC) or Ubuntu 18, for instance.

Tried with https://github.com/open-mpi/ompi/pull/8322 just in case, that does not help either.

@bosilca does anything obvious stand out to you?

ggouaillardet commented 3 years ago

@rajachan how much performance degradation did you measure?

My understanding is that on recent processors, unaligned load/store are as efficient as aligned load/store when the data is aligned.

Is it fair to say that the performance hit is caused by the combined use of unaligned load/store and gcc 4.8.5? (e.g. no performance hit when running the very same code on the very same hardware with GCC >= 7)

shijin-aws commented 3 years ago

@ggouaillardet it's a ~10% degradation. The average loop time of lammps (in seconds) jumps from 7.7 to 8.6.

ggouaillardet commented 3 years ago

@shijin-aws thanks,

well, 10% of the user app (e.g. not 10% of the MPI_Reduce() performance) looks like a pretty massive hit.

what if you disable the op/avx component (e.g. mpirun --mca op ^avx ...) ?

shijin-aws commented 3 years ago

@ggouaillardet Using mpirun --mca op ^avx can make the performance back to the level of ompi 4.0.5 (7.7 seconds).

ggouaillardet commented 3 years ago

@shijin-aws Thanks

can you please confirm GCC 4.8.5 is to be blamed here? in that case, should be simply not build the op/avx component with GCC < 7?

shijin-aws commented 3 years ago

@ggouaillardet Sure, @rajachan suggested the same thing. I am trying to build it with gcc7 on my machine to confirm the root cause is gcc 4.8.5.

bosilca commented 3 years ago

you can always use the reduce_local on the test/datatype with the type and op of your liking to see if there is any performance degradation in the AVX part of the reduction operation.

On a skylake machine I see a difference of about 15% between the double SUM local reduction operation compiled with gcc 4.8.5 and gcc 10.2.0. Looking a little deeper into this it seems that gcc 4.8.5 does not understand the -march=skylake-avx512 option, so AVX512 is always turned off.

ggouaillardet commented 3 years ago

@bosilca I am afraid this is a different issue.

The reported issue is op/avx is slower than the default op component on skylake with gcc 4.8.5

I applied the patch below to the reduce_local test (move MPI_Wtime(), add -d <aligments>, typically to reduce_local -d 1 and only use aligned data, implement -r <repeats> to make the test longer.

Then I found op/avx is a bit faster than the default component (and yes, op/avx is only using AVX2 because gcc 4.8.5 does not know skylake)

diff --git a/test/datatype/reduce_local.c b/test/datatype/reduce_local.c
index 97890f9..9a69b1b 100644
--- a/test/datatype/reduce_local.c
+++ b/test/datatype/reduce_local.c
@@ -115,11 +115,12 @@ do { \
     const TYPE *_p1 = ((TYPE*)(INBUF)), *_p3 = ((TYPE*)(CHECK_BUF)); \
     TYPE *_p2 = ((TYPE*)(INOUT_BUF)); \
     skip_op_type = 0; \
-    for(int _k = 0; _k < min((COUNT), 4); +_k++ ) { \
-        memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
-        tstart = MPI_Wtime(); \
-        MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
-        tend = MPI_Wtime(); \
+    tstart = MPI_Wtime(); \
+    for(int _k = 0; _k < min((COUNT), d); +_k++ ) { \
+        for(int _r = 0; _r < repeats; _r++) { \
+            memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
+            MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
+        } \
         if( check ) { \
             for( i = 0; i < (COUNT)-_k; i++ ) { \
                 if(((_p2+_k)[i]) == (((_p1+_k)[i]) OPNAME ((_p3+_k)[i]))) \
@@ -131,6 +132,7 @@ do { \
             } \
         } \
     } \
+    tend = MPI_Wtime(); \
     goto check_and_continue; \
 } while (0)

@@ -163,15 +165,21 @@ int main(int argc, char **argv)
 {
     static void *in_buf = NULL, *inout_buf = NULL, *inout_check_buf = NULL;
     int count, type_size = 8, rank, size, provided, correctness = 1;
-    int repeats = 1, i, c;
+    int repeats = 1, i, c, d = 4;
     double tstart, tend;
     bool check = true;
     char type[5] = "uifd", *op = "sum", *mpi_type;
     int lower = 1, upper = 1000000, skip_op_type;
     MPI_Op mpi_op;

-    while( -1 != (c = getopt(argc, argv, "l:u:t:o:s:n:vfh")) ) {
+    while( -1 != (c = getopt(argc, argv, "d:l:u:t:o:s:n:vr:fh")) ) {
         switch(c) {
+        case 'd':
+            d = atoi(optarg);
+            if( d < 1 ) {
+                fprintf(stderr, "Disalignment must be greater than zero\n");
+                exit(-1);
+            }
         case 'l':
             lower = atoi(optarg);
             if( lower <= 0 ) {

shijin-aws commented 3 years ago

@ggouaillardet @rajachan I rebuilt the application with gcc/g++ 7.2.1 on the same os (alinux1), but the performance does not go back to the level of open mpi 4.0.5.

rajachan commented 3 years ago

Here's what I am seeing with vanilla reduce_local and 32-bit float sums:

		gcc482 (seconds)	gcc721 (seconds)
1	time	0.000017	0.000011
2	time	0	0
4	time	0	0
8	time	0	0
16	time	0	0
32	time	0	0
64	time	0	0
128	time	0	0
256	time	0	0
512	time	0.000001	0
1024	time	0.000001	0.000001
2048	time	0.000002	0.000001
4096	time	0.000003	0.000002
8192	time	0.000007	0.000004
16384	time	0.000013	0.000008
32768	time	0.000025	0.000015
65536	time	0.000052	0.000032
131072	time	0.000109	0.000074
262144	time	0.000221	0.000161
524288	time	0.000457	0.00034

jsquyres commented 3 years ago

So are we coming down to determining that this is a compiler issue? I.e., certain versions of gcc give terrible performance?

If so, is there a way we can detect this in configure and react appropriately?

rajachan commented 3 years ago

That's what it is looking like to me. I'm going to try @shijin-aws's test with the actual application again to make sure he wasn't inadvertently running with the older compiler.

rajachan commented 3 years ago

I've reproduced @shijin-aws's observation. With the LAMMPS application, even with the newer gcc (7.2.1), runs using op/avx perform poorer than the ones without.

$ /shared/ompi/install/bin/mpirun --mca op ^avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 Loop time of 7.94407 on 1152 procs for 100 steps with 259200000 atoms

$ /shared/ompi/install/bin/mpirun --mca op avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 Loop time of 8.95102 on 1152 procs for 100 steps with 259200000 atoms

$ gcc --version gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

From OMPI config log:

MCA_BUILD_OP_AVX2_FLAGS='-mavx2'
MCA_BUILD_OP_AVX512_FLAGS='-march=skylake-avx512'
MCA_BUILD_OP_AVX_FLAGS='-mavx'
MCA_BUILD_ompi_op_avx_DSO_FALSE='#'
MCA_BUILD_ompi_op_avx_DSO_TRUE=''
MCA_BUILD_ompi_op_has_avx2_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx2_support_TRUE=''
MCA_BUILD_ompi_op_has_avx512_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx512_support_TRUE=''
MCA_BUILD_ompi_op_has_avx_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx_support_TRUE=''
MCA_ompi_op_ALL_COMPONENTS=' avx'
MCA_ompi_op_ALL_SUBDIRS=' mca/op/avx'
MCA_ompi_op_DSO_COMPONENTS=' avx'
MCA_ompi_op_DSO_SUBDIRS=' mca/op/avx'
#define OMPI_MCA_OP_HAVE_AVX512 1
#define OMPI_MCA_OP_HAVE_AVX2 1
#define OMPI_MCA_OP_HAVE_AVX 1

Looks like there's more to it than the compiler versions and their AVX support.

ggouaillardet commented 3 years ago

@rajachan thanks for confirming there is more that the gcc version.

Would you be able to reproduce this issue with a smaller config ? (ideally one node and 24 MPI tasks)

rajachan commented 3 years ago

Yup, it is more evident with a single-node run.

with op/avx ( --mca op avx -n 24 -N 24): Loop time of 373.581 on 24 procs for 100 steps with 259200000 atoms

without op/avx ( --mca op ^avx -n 24 -N 24): Loop time of 312.945 on 24 procs for 100 steps with 259200000 atoms

Times are in seconds. Will run it through a profiler.

rajachan commented 3 years ago

Stating the obvious with some pretty charts, but the mpiP profile from the run without op/avx shows the aggregate AllReduce cost across ranks:

And the run with op/avx:

Here are the mpiP profiles from the two runs and some more charts in case you want to look it over. I'll take a closer look too. lammps-avx.zip

bosilca commented 3 years ago

This is totally puzzling. Assuming we are pointing toward the AVX support as the culprit behind this performance regression, I went ahead and tested just the MPI_OP and I am unable to replicate it anywhere. I've tried skylake with gcc 4.8.5, 7.0.2 and 10.2.0. Again, I have not looked at the performance of the MPI_Allreduce collective, but specifically at the performance of the MPI_OP.

As it was not clear from the discussion which particular MPI_Allreduce has introduced the issue, a quick grep in the lammps code highlights 2 operations that stand out: sum and max on doubles. I also modified the reduce_local test, to be able to test specific shifts or misalignments of the buffers to see if that could be the issue. Unfortunately, all these efforts were in vain, nothing unusual popped up, performance look usually 15-20% better when AVX is turned on, for both sum and max, and for all of the compilers mentionned above.

It would be great if you can run the same tests on your setup. You will need to patch your code with 20be3fc25713ac (from the #8322), and run mpirun -np 1 ./reduce_local -o max,sum -t d -r 400 -l 131072 -u 524288 -i 4. You should get a list of 4 timings (because of the -i 4), each one for a different shift in the first position of the input/output buffers. You should also enable or disable the avx component to see the difference.

ggouaillardet commented 3 years ago

same here, with the enhanced test and pinning the process, op/avx is faster than the base component on a skylake processor with gcc 4.8.5 (that is only AVX2 capable).

make sure the -bind-to core option is passed to the mpirun command line (or simply taskset -c 0 ./reduce_local if running in singleton mode)

bosilca commented 3 years ago

My previous tests were looking at the performance of a single MPI_OP running undisturbed on the machine, so I though maybe the issue is not coming in the MPI_OP itself but from running multiple of these MPI_OP simultaneously. So I run the OSU allreduce test on all the skylake cores I had access to, and it reflected the same finding as above: the AVX version is 5.7% faster than the non-AVX one (2092.11 us vs. 2170.22 us) for the code compiled with gcc 4.8.5.

rajachan commented 3 years ago

With 20be3fc from #8322 cherry-picked on v4.1.x and GCC 4.8.5:

op/avx excluded:

MPI_MAX    MPI_DOUBLE 8           count  131072      time (seconds / shifts) 0.00033371 0.00033379 0.00033351 0.00033370
MPI_MAX    MPI_DOUBLE 8           count  262144      time (seconds / shifts) 0.00067200 0.00067364 0.00067183 0.00067191
MPI_MAX    MPI_DOUBLE 8           count  524288      time (seconds / shifts) 0.00134448 0.00134368 0.00134335 0.00134373

MPI_SUM    MPI_DOUBLE 8           count  131072      time (seconds / shifts) 0.00027357 0.00027368 0.00027386 0.00027367
MPI_SUM    MPI_DOUBLE 8           count  262144      time (seconds / shifts) 0.00055431 0.00055400 0.00055398 0.00055392
MPI_SUM    MPI_DOUBLE 8           count  524288      time (seconds / shifts) 0.00110281 0.00110347 0.00110247 0.00110312

op/avx included:

MPI_MAX    MPI_DOUBLE 8           count  131072      time (seconds / shifts) 0.00020832 0.00022310 0.00022270 0.00022301
MPI_MAX    MPI_DOUBLE 8           count  262144      time (seconds / shifts) 0.00043216 0.00045851 0.00045999 0.00045923
MPI_MAX    MPI_DOUBLE 8           count  524288      time (seconds / shifts) 0.00086542 0.00091259 0.00091225 0.00091096

MPI_SUM    MPI_DOUBLE 8           count  131072      time (seconds / shifts) 0.00021096 0.00022439 0.00022389 0.00022375
MPI_SUM    MPI_DOUBLE 8           count  262144      time (seconds / shifts) 0.00044044 0.00046373 0.00046331 0.00046293
MPI_SUM    MPI_DOUBLE 8           count  524288      time (seconds / shifts) 0.00085983 0.00091085 0.00091748 0.00091429

The reproduction seems limited to the application's pattern. I will take a closer look at LAMMPS usage of the collective today.

rajachan commented 3 years ago

I've been looking at this for a while now, and I still do not have a silver bullet here. Just the use of AVX for the op seems to be slowing things down. Poking around literature, I see several references to frequency scaling caused by heavy use of AVX on multiple cores simultaneously, and that causing slowdowns. Is this something you are aware of, and could that be a probable cause?

https://dl.acm.org/doi/10.1145/3409963.3410488 https://arxiv.org/pdf/1901.04982.pdf

I am just running the benchmark case that comes with lammps, in case you want to give it a try on your end: https://github.com/lammps/lammps/blob/stable_12Dec2018/bench/in.chain.scaled

Like I mentioned earlier, I can reproduce this with newer versions of GCC and a single compute instance.

ggouaillardet commented 3 years ago

@rajachan thanks for the report.

Frequency scaling is indeed a documented drawback of AVX, that, in the worst case, slow things down, especially on a loaded system. I guess we could add a runtime parameter to set the max AVX flavor to be used. For example, SSE is likely faster than the default implementation, and on some systems, AVX2 might be faster than SSE but slower than AVX512 under load.

bosilca commented 3 years ago

Let's not jump to conclusions yet. If I correctly read the graphs posted by @rajachan we are looking at a factor 10x of performance decrease for the allreduce between the AVX and the non-AVX version, while the papers talk about a 10% decrease in a similar workload (suite of AVX and non-AVX operations).

We already have an MCA parameter to control how much of the hardware AVX support is allowed by the user, op_avx_support. 0 means no AVX/SSE, 31 means just AVX, 53 AVX and AVX2, and no change to allow everything possible/available.

rhc54 commented 3 years ago

@wesbland Have you seen anything like this? Can you perhaps connect us to someone over there who can help us figure out the right path forward?

rajachan commented 3 years ago

@bosilca With LAMMPS running on 24 ranks on a single compute node, here's what I see with the various avx levels:

OMPI_MCA_op_avx_support=0 Loop time of 35.2146 on 24 procs for 100 steps with 28800000 atoms

OMPI_MCA_op_avx_support=31 Loop time of 34.8507 on 24 procs for 100 steps with 28800000 atoms

OMPI_MCA_op_avx_support=53 Loop time of 42.3577 on 24 procs for 100 steps with 28800000 atoms

Default: Loop time of 42.3886 on 24 procs for 100 steps with 28800000 atoms

I am running this on a Skylake system with the following capabilities:

Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

bosilca commented 3 years ago

Based on these numbers it seems we should leave the avx component enabled and with high priority on x86, but restrict it to use only AVX instructions.

wesbland commented 3 years ago

@wesbland Have you seen anything like this? Can you perhaps connect us to someone over there who can help us figure out the right path forward?

@rhc54 I haven't seen anything like this personally, but I can ask around with some of the other folks.

rajachan commented 3 years ago

Based on these numbers it seems we should leave the avx component enabled and with high priority on x86, but restrict it to use only AVX instructions.

Agreed, this sounds the safest without having to change priority.

bosilca commented 3 years ago

I just discuvered that the Intel compiler does not define the AVX* macros without a specific -m option. Kudos to icc folks, way to go. I have a patch, I will restrict the AVX512 as well.

ggouaillardet commented 3 years ago

@rajachan did you bind MPI tasks to a single core (e.g. mpirun -bind-to core ...)?

I suspect AVX512 frequency scaling might cause unnecessary task migration that could severely impact performances.

rajachan commented 3 years ago

I used the default binding policy in that last run, but there's a degradation after pinning to cores as well, just not as pronounced:

OMPI_MCA_op_avx_support=0 Loop time of 42.2022 on 24 procs for 100 steps with 28800000 atoms

OMPI_MCA_op_avx_support=31 Loop time of 42.7109 on 24 procs for 100 steps with 28800000 atoms

OMPI_MCA_op_avx_support=53 Loop time of 45.9294 on 24 procs for 100 steps with 28800000 atoms

default: Loop time of 46.1744 on 24 procs for 100 steps with 28800000 atoms

ggouaillardet commented 3 years ago

@rajachan thanks for the interesting numbers. The degradation is indeed not as pronounced, but the absolute performance is just worst.

without the op/avx component, loop time increased from 35 up to 42 seconds after pinning to a core (!)

rajachan commented 3 years ago

without the op/avx component, loop time increased from 35 up to 42 seconds after pinning to a core (!)

Yes, that's a bit puzzling too, and should be looked at separately in addition to the AVX512 issue.

bosilca commented 3 years ago

We are trying to replicate these results on a single, skylake-based node, but so far we are unable to highlight any performance regression with AVX2 or AVX512 turned on. @dong0321 will post the result soon.

Meanwhile, I will amend #8372 and #8373 to remove the part where I alter the flags of the AVX component, such that we can pull in the fix for icc, but without reducing [yet] the capabilities of the AVX component.

dong0321 commented 3 years ago

I did the same experiments as @rajachan described.

Experiment environment: OMPI master @9ff011728c16dcd642b429f8208ce90602c22adb Single node, 24 processes --bind-to core. Model name: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz Flags: ssse3 sse4_1 sse4_2 avx avx2 avx512f avx512dq avx512cd avx512bw avx512vl avx512_vnni combination of processor capabilities as follow SSE 0x01, SSE2 0x02, SSE3 0x04, SSE4.1 0x08, AVX 0x010, AVX2 0x020, AVX512F 0x100, AVX512BW 0x200.

Here are the cmd lines and results:

/home/zhongdong/opt/git/george_branch/bin/mpirun  -mca op ^avx  -bind-to core  -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>1.txt
1.txt
174:Loop time of 703.492 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx  -bind-to core  -np 24
  /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
 /home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>2.txt
2.txt
173:Loop time of 603.9 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x010  -bind-to core  -np 24 
 /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>3.txt
3.txt
173:Loop time of 601.464 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x020  -bind-to core  -np 24  
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>4.txt
4.txt
173:Loop time of 596.886 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x030  -bind-to core  -np 24 
 /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>5.txt
5.txt
173:Loop time of 581.822 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x100  -bind-to core  -np 24 
 /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>6.txt
6.txt
173:Loop time of 569.93 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x130  -bind-to core  -np 24
  /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>7.txt
7.txt
173:Loop time of 513.994 on 24 procs for 100 steps with 259200000 atoms

/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x120  -bind-to core  -np 24
  /home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in 
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>8.txt
8.txt
173:Loop time of 456.201 on 24 procs for 100 steps with 259200000 atoms

avx

The results show a different story, without op/avx the performance is the worst. With avx enabled (single avx, avx2, avx512 or mix of those), it shows a speedup of 14%~35%.

rajachan commented 3 years ago

I'm talking to George offline about this. I am setting up a test cluster for @dong0321 to check out the differences between our two runs. We will report back with findings.

rajachan commented 3 years ago

I had a vanilla build of OMPI, but @dong0321 had CFLAGS=-march=skylake-avx512 set in his configure, which is what was causing the difference. He has since reproduced my results with a vanilla build. I'll let him post his resulls.

dong0321 commented 3 years ago

I reproduced Raja's results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz This result shows that AVX512 is decreasing the perfs, avx2 is not.

~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0xfff --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 40.3812 on 24 procs for 100 steps with 28800000 atoms

~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x3f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 33.4608 on 24 procs for 100 steps with 28800000 atoms

~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x1f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 33.563 on 24 procs for 100 steps with 28800000 atoms

I also tested on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz, which shows no performance decrease with AVX512 or AVX2.

dong0321 commented 3 years ago

We did another approach that takes the reduce_local and replace the local reduce by an allreduce on MPI_COMM_WORLD() with unaligned and aligned data. It shows no performance decrease. This is on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

Test with allreduce aligned and unaligned. The result shows AVX is still faster than non-AVX. This is the output from rank 0 with 24 processes.

$/home/zhongdong/opt/git/george_branch/bin/mpirun -np 24 reduce_local ... [6:31 PM] -> % tail avx_rank0_shift.txt

MPI_SUM    MPI_INT8_T 8          [success]  count  2048        time (seconds / shifts) 0.00002342 0.00001618 0.00002749 0.00001567
MPI_SUM    MPI_INT8_T 8          [success]  count  4096        time (seconds / shifts) 0.00012998 0.00009841 0.00005475 0.00003839
MPI_SUM    MPI_INT8_T 8          [success]  count  8192        time (seconds / shifts) 0.00007279 0.00007643 0.00006695 0.00007870
MPI_SUM    MPI_INT8_T 8          [success]  count  16384       time (seconds / shifts) 0.00008852 0.00012386 0.00009632 0.00008198
MPI_SUM    MPI_INT8_T 8          [success]  count  32768       time (seconds / shifts) 0.00014307 0.00012712 0.00012602 0.00012312
MPI_SUM    MPI_INT8_T 8          [success]  count  65536       time (seconds / shifts) 0.00024698 0.00019608 0.00020291 0.00020624
MPI_SUM    MPI_INT8_T 8          [success]  count  131072      time (seconds / shifts) 0.00041439 0.00032205 0.00034034 0.00033766
MPI_SUM    MPI_INT8_T 8          [success]  count  262144      time (seconds / shifts) 0.00104947 0.00074683 0.00064079 0.00057179
MPI_SUM    MPI_INT8_T 8          [success]  count  524288      time (seconds / shifts) 0.00299003 0.00144670 0.00124236 0.00107849
MPI_SUM    MPI_INT8_T 8          [success]  count  1048576     time (seconds / shifts) 0.00455701 0.00289851 0.00209786 0.00216357

$/home/zhongdong/opt/git/george_branch/bin/mpirun -mca op ^avx -np 24 reduce_local ... -> % tail noavx_rank0_shift.txt

MPI_SUM    MPI_INT8_T 8          [success]  count  2048        time (seconds / shifts) 0.00003396 0.00003482 0.00003206 0.00002900
MPI_SUM    MPI_INT8_T 8          [success]  count  4096        time (seconds / shifts) 0.00016633 0.00012850 0.00009502 0.00007309
MPI_SUM    MPI_INT8_T 8          [success]  count  8192        time (seconds / shifts) 0.00009685 0.00008488 0.00008030 0.00010761
MPI_SUM    MPI_INT8_T 8          [success]  count  16384       time (seconds / shifts) 0.00012680 0.00013737 0.00013265 0.00011907
MPI_SUM    MPI_INT8_T 8          [success]  count  32768       time (seconds / shifts) 0.00021818 0.00020476 0.00019764 0.00019272
MPI_SUM    MPI_INT8_T 8          [success]  count  65536       time (seconds / shifts) 0.00041932 0.00032535 0.00033718 0.00032707
MPI_SUM    MPI_INT8_T 8          [success]  count  131072      time (seconds / shifts) 0.00074386 0.00072927 0.00060821 0.00060789
MPI_SUM    MPI_INT8_T 8          [success]  count  262144      time (seconds / shifts) 0.00214421 0.00108592 0.00109788 0.00110618
MPI_SUM    MPI_INT8_T 8          [success]  count  524288      time (seconds / shifts) 0.00320920 0.00245157 0.00207727 0.00207326
MPI_SUM    MPI_INT8_T 8          [success]  count  1048576     time (seconds / shifts) 0.00500448 0.00463977 0.00395080 0.00379880

rajachan commented 3 years ago

@dong0321 Can you repeat this test on the Skylake system where we reproduced the impact on LAMMPS?

dong0321 commented 3 years ago

allreduce results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz AVX is faster than non-AVX. This is the output from rank 0 with 24 processes.

$ opt/ompi/4.1.x/bin/mpirun -mca op ^avx -np 24 reduce_local ... [ec2-user@ip-172-31-55-224 datatype]$ tail noavx_rank0_shift.txt

MPI_SUM    MPI_INT8_T 8          [success]  count  2048        time (seconds / shifts) 0.00003220 0.00002618 0.00002693 0.00002493
MPI_SUM    MPI_INT8_T 8          [success]  count  4096        time (seconds / shifts) 0.00009739 0.00007799 0.00006968 0.00007193
MPI_SUM    MPI_INT8_T 8          [success]  count  8192        time (seconds / shifts) 0.00007880 0.00006648 0.00006358 0.00007358
MPI_SUM    MPI_INT8_T 8          [success]  count  16384       time (seconds / shifts) 0.00009920 0.00010670 0.00011290 0.00009291
MPI_SUM    MPI_INT8_T 8          [success]  count  32768       time (seconds / shifts) 0.00015118 0.00014591 0.00013938 0.00014321
MPI_SUM    MPI_INT8_T 8          [success]  count  65536       time (seconds / shifts) 0.00027290 0.00024369 0.00024039 0.00023854
MPI_SUM    MPI_INT8_T 8          [success]  count  131072      time (seconds / shifts) 0.00054232 0.00049116 0.00045581 0.00046936
MPI_SUM    MPI_INT8_T 8          [success]  count  262144      time (seconds / shifts) 0.00097585 0.00096952 0.00092836 0.00093997
MPI_SUM    MPI_INT8_T 8          [success]  count  524288      time (seconds / shifts) 0.00192393 0.00201303 0.00186113 0.00185925
MPI_SUM    MPI_INT8_T 8          [success]  count  1048576     time (seconds / shifts) 0.00386278 0.00411523 0.00387891 0.00387315

$ opt/ompi/4.1.x/bin/mpirun -np 24 reduce_local ... [ec2-user@ip-172-31-55-224 datatype]$ tail avx_rank0_shift.txt

MPI_SUM    MPI_INT8_T 8          [success]  count  2048        time (seconds / shifts) 0.00002041 0.00001467 0.00001553 0.00001319
MPI_SUM    MPI_INT8_T 8          [success]  count  4096        time (seconds / shifts) 0.00009799 0.00005957 0.00004662 0.00004025
MPI_SUM    MPI_INT8_T 8          [success]  count  8192        time (seconds / shifts) 0.00006859 0.00006459 0.00006412 0.00006866
MPI_SUM    MPI_INT8_T 8          [success]  count  16384       time (seconds / shifts) 0.00010162 0.00009352 0.00010067 0.00008038
MPI_SUM    MPI_INT8_T 8          [success]  count  32768       time (seconds / shifts) 0.00013829 0.00013241 0.00012451 0.00012239
MPI_SUM    MPI_INT8_T 8          [success]  count  65536       time (seconds / shifts) 0.00024440 0.00021226 0.00021521 0.00020219
MPI_SUM    MPI_INT8_T 8          [success]  count  131072      time (seconds / shifts) 0.00044169 0.00037666 0.00038682 0.00036369
MPI_SUM    MPI_INT8_T 8          [success]  count  262144      time (seconds / shifts) 0.00086758 0.00078537 0.00072895 0.00072250
MPI_SUM    MPI_INT8_T 8          [success]  count  524288      time (seconds / shifts) 0.00157091 0.00165382 0.00155314 0.00148457
MPI_SUM    MPI_INT8_T 8          [success]  count  1048576     time (seconds / shifts) 0.00347956 0.00367711 0.00335126 0.00350388

rajachan commented 3 years ago

This regression does not happen when compiling OMPI with icc. The issue seems contained to the use of gcc (tested with multiple versions up until v11 candidate built from source). LAMMPS developers have confirmed they are not making explicit use of AVX512 here. @bosilca I propose updating https://github.com/open-mpi/ompi/pull/8376 to conditionally use those combinations only when the Intel compilers are used.

bosilca commented 3 years ago

Allow me to summarize the situation as I understand it. We have a performance regression on one application, on a particular processor, when compiled with a particular compiler (many versions of the same compiler). Analysis of the regression in the application context, pinpoints the performance issue on an MPI_Allreduce, but we are unable to replicate (even using the same set of conditions) in any stand alone benchmark. In addition, we have not been able to reproduce the performance regression on other applications, even on the exact same setup.

So, I'm not sure I understand the proposal here. Allow AVX only when OMPI is compiled with icc ? When the application is compiled with icc ? Both of these are possible but unnecessary restrictive. At this point we have no common denominator here, and no understanding of the root cause. I would advocate we do nothing, add some wording on the FAQ and while we can leave this ticket open for future inquiries we move forward and remove the blocking label.

rajachan commented 3 years ago

Given this is a performance optimization we are talking about and given this was just introduced in this series, yes, that is exactly why I am proposing we be conservative. We have one application that we know of and we don't have full understanding of the problem, so we can not say no other application is impacted (we don't know what we don't know). We learned from the LAMMPS developers that there should be nothing special about their use of Allreduce. I am just repeating myself at this point, but the fact that we need more investigation is enough to say we should not make this the default for everyone.

In my tests with the Intel compiler, I'd just compiled OMPI with icc and not the app.

We have a few different drivers for a 4.1.x bugfix release and I don't want to hold that up any further, so if you want to take the FAQ route I'm fine with that.

open-mpi / ompi