Open rajachan opened 3 years ago
@rajachan how much performance degradation did you measure?
My understanding is that on recent processors, unaligned load/store are as efficient as aligned load/store when the data is aligned.
Is it fair to say that the performance hit is caused by the combined use of unaligned load/store and gcc 4.8.5? (e.g. no performance hit when running the very same code on the very same hardware with GCC >= 7)
@ggouaillardet it's a ~10% degradation. The average loop time of lammps (in seconds) jumps from 7.7 to 8.6.
@shijin-aws thanks,
well, 10% of the user app (e.g. not 10% of the MPI_Reduce()
performance) looks like a pretty massive hit.
what if you disable the op/avx
component (e.g. mpirun --mca op ^avx ...
) ?
@ggouaillardet Using mpirun --mca op ^avx
can make the performance back to the level of ompi 4.0.5 (7.7 seconds).
@shijin-aws Thanks
can you please confirm GCC 4.8.5 is to be blamed here? in that case, should be simply not build the op/avx
component with GCC < 7?
@ggouaillardet Sure, @rajachan suggested the same thing. I am trying to build it with gcc7 on my machine to confirm the root cause is gcc 4.8.5.
you can always use the reduce_local on the test/datatype with the type and op of your liking to see if there is any performance degradation in the AVX part of the reduction operation.
On a skylake machine I see a difference of about 15% between the double SUM local reduction operation compiled with gcc 4.8.5 and gcc 10.2.0. Looking a little deeper into this it seems that gcc 4.8.5 does not understand the -march=skylake-avx512
option, so AVX512 is always turned off.
@bosilca I am afraid this is a different issue.
The reported issue is op/avx
is slower than the default op
component on skylake with gcc 4.8.5
I applied the patch below to the reduce_local
test (move MPI_Wtime()
, add -d <aligments>
, typically to reduce_local -d 1
and only use aligned data, implement -r <repeats>
to make the test longer.
Then I found op/avx
is a bit faster than the default component
(and yes, op/avx
is only using AVX2
because gcc 4.8.5 does not know skylake)
diff --git a/test/datatype/reduce_local.c b/test/datatype/reduce_local.c
index 97890f9..9a69b1b 100644
--- a/test/datatype/reduce_local.c
+++ b/test/datatype/reduce_local.c
@@ -115,11 +115,12 @@ do { \
const TYPE *_p1 = ((TYPE*)(INBUF)), *_p3 = ((TYPE*)(CHECK_BUF)); \
TYPE *_p2 = ((TYPE*)(INOUT_BUF)); \
skip_op_type = 0; \
- for(int _k = 0; _k < min((COUNT), 4); +_k++ ) { \
- memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
- tstart = MPI_Wtime(); \
- MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
- tend = MPI_Wtime(); \
+ tstart = MPI_Wtime(); \
+ for(int _k = 0; _k < min((COUNT), d); +_k++ ) { \
+ for(int _r = 0; _r < repeats; _r++) { \
+ memcpy(_p2, _p3, sizeof(TYPE) * (COUNT)); \
+ MPI_Reduce_local(_p1+_k, _p2+_k, (COUNT)-_k, (MPITYPE), (MPIOP)); \
+ } \
if( check ) { \
for( i = 0; i < (COUNT)-_k; i++ ) { \
if(((_p2+_k)[i]) == (((_p1+_k)[i]) OPNAME ((_p3+_k)[i]))) \
@@ -131,6 +132,7 @@ do { \
} \
} \
} \
+ tend = MPI_Wtime(); \
goto check_and_continue; \
} while (0)
@@ -163,15 +165,21 @@ int main(int argc, char **argv)
{
static void *in_buf = NULL, *inout_buf = NULL, *inout_check_buf = NULL;
int count, type_size = 8, rank, size, provided, correctness = 1;
- int repeats = 1, i, c;
+ int repeats = 1, i, c, d = 4;
double tstart, tend;
bool check = true;
char type[5] = "uifd", *op = "sum", *mpi_type;
int lower = 1, upper = 1000000, skip_op_type;
MPI_Op mpi_op;
- while( -1 != (c = getopt(argc, argv, "l:u:t:o:s:n:vfh")) ) {
+ while( -1 != (c = getopt(argc, argv, "d:l:u:t:o:s:n:vr:fh")) ) {
switch(c) {
+ case 'd':
+ d = atoi(optarg);
+ if( d < 1 ) {
+ fprintf(stderr, "Disalignment must be greater than zero\n");
+ exit(-1);
+ }
case 'l':
lower = atoi(optarg);
if( lower <= 0 ) {
@ggouaillardet @rajachan I rebuilt the application with gcc/g++ 7.2.1 on the same os (alinux1), but the performance does not go back to the level of open mpi 4.0.5.
Here's what I am seeing with vanilla reduce_local and 32-bit float sums:
gcc482 (seconds) | gcc721 (seconds) | ||
---|---|---|---|
1 | time | 0.000017 | 0.000011 |
2 | time | 0 | 0 |
4 | time | 0 | 0 |
8 | time | 0 | 0 |
16 | time | 0 | 0 |
32 | time | 0 | 0 |
64 | time | 0 | 0 |
128 | time | 0 | 0 |
256 | time | 0 | 0 |
512 | time | 0.000001 | 0 |
1024 | time | 0.000001 | 0.000001 |
2048 | time | 0.000002 | 0.000001 |
4096 | time | 0.000003 | 0.000002 |
8192 | time | 0.000007 | 0.000004 |
16384 | time | 0.000013 | 0.000008 |
32768 | time | 0.000025 | 0.000015 |
65536 | time | 0.000052 | 0.000032 |
131072 | time | 0.000109 | 0.000074 |
262144 | time | 0.000221 | 0.000161 |
524288 | time | 0.000457 | 0.00034 |
So are we coming down to determining that this is a compiler issue? I.e., certain versions of gcc give terrible performance?
If so, is there a way we can detect this in configure and react appropriately?
That's what it is looking like to me. I'm going to try @shijin-aws's test with the actual application again to make sure he wasn't inadvertently running with the older compiler.
I've reproduced @shijin-aws's observation. With the LAMMPS application, even with the newer gcc (7.2.1), runs using op/avx perform poorer than the ones without.
$ /shared/ompi/install/bin/mpirun --mca op ^avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 Loop time of 7.94407 on 1152 procs for 100 steps with 259200000 atoms
$ /shared/ompi/install/bin/mpirun --mca op avx -n 1152 -N 36 -hostfile /shared/ompi/hfile /shared/lammps/bin/lmp -in /shared/lammps/bin/in.chute.scaled -var x 90 -var y 90 Loop time of 8.95102 on 1152 procs for 100 steps with 259200000 atoms
$ gcc --version gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
From OMPI config log:
MCA_BUILD_OP_AVX2_FLAGS='-mavx2'
MCA_BUILD_OP_AVX512_FLAGS='-march=skylake-avx512'
MCA_BUILD_OP_AVX_FLAGS='-mavx'
MCA_BUILD_ompi_op_avx_DSO_FALSE='#'
MCA_BUILD_ompi_op_avx_DSO_TRUE=''
MCA_BUILD_ompi_op_has_avx2_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx2_support_TRUE=''
MCA_BUILD_ompi_op_has_avx512_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx512_support_TRUE=''
MCA_BUILD_ompi_op_has_avx_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx_support_TRUE=''
MCA_ompi_op_ALL_COMPONENTS=' avx'
MCA_ompi_op_ALL_SUBDIRS=' mca/op/avx'
MCA_ompi_op_DSO_COMPONENTS=' avx'
MCA_ompi_op_DSO_SUBDIRS=' mca/op/avx'
#define OMPI_MCA_OP_HAVE_AVX512 1
#define OMPI_MCA_OP_HAVE_AVX2 1
#define OMPI_MCA_OP_HAVE_AVX 1
Looks like there's more to it than the compiler versions and their AVX support.
@rajachan thanks for confirming there is more that the gcc version.
Would you be able to reproduce this issue with a smaller config ? (ideally one node and 24 MPI tasks)
Yup, it is more evident with a single-node run.
with op/avx ( --mca op avx -n 24 -N 24): Loop time of 373.581 on 24 procs for 100 steps with 259200000 atoms
without op/avx ( --mca op ^avx -n 24 -N 24): Loop time of 312.945 on 24 procs for 100 steps with 259200000 atoms
Times are in seconds. Will run it through a profiler.
Stating the obvious with some pretty charts, but the mpiP profile from the run without op/avx shows the aggregate AllReduce cost across ranks:
And the run with op/avx:
Here are the mpiP profiles from the two runs and some more charts in case you want to look it over. I'll take a closer look too. lammps-avx.zip
This is totally puzzling. Assuming we are pointing toward the AVX support as the culprit behind this performance regression, I went ahead and tested just the MPI_OP and I am unable to replicate it anywhere. I've tried skylake with gcc 4.8.5, 7.0.2 and 10.2.0. Again, I have not looked at the performance of the MPI_Allreduce collective, but specifically at the performance of the MPI_OP.
As it was not clear from the discussion which particular MPI_Allreduce has introduced the issue, a quick grep in the lammps code highlights 2 operations that stand out: sum and max on doubles. I also modified the reduce_local test, to be able to test specific shifts or misalignments of the buffers to see if that could be the issue. Unfortunately, all these efforts were in vain, nothing unusual popped up, performance look usually 15-20% better when AVX is turned on, for both sum and max, and for all of the compilers mentionned above.
It would be great if you can run the same tests on your setup. You will need to patch your code with 20be3fc25713ac (from the #8322), and run mpirun -np 1 ./reduce_local -o max,sum -t d -r 400 -l 131072 -u 524288 -i 4
. You should get a list of 4 timings (because of the -i 4), each one for a different shift in the first position of the input/output buffers. You should also enable or disable the avx component to see the difference.
same here, with the enhanced test and pinning the process, op/avx
is faster than the base component on a skylake processor with gcc 4.8.5 (that is only AVX2
capable).
make sure the -bind-to core
option is passed to the mpirun
command line
(or simply taskset -c 0 ./reduce_local
if running in singleton mode)
My previous tests were looking at the performance of a single MPI_OP running undisturbed on the machine, so I though maybe the issue is not coming in the MPI_OP itself but from running multiple of these MPI_OP simultaneously. So I run the OSU allreduce test on all the skylake cores I had access to, and it reflected the same finding as above: the AVX version is 5.7% faster than the non-AVX one (2092.11 us vs. 2170.22 us) for the code compiled with gcc 4.8.5.
With 20be3fc from #8322 cherry-picked on v4.1.x and GCC 4.8.5:
op/avx excluded:
MPI_MAX MPI_DOUBLE 8 count 131072 time (seconds / shifts) 0.00033371 0.00033379 0.00033351 0.00033370
MPI_MAX MPI_DOUBLE 8 count 262144 time (seconds / shifts) 0.00067200 0.00067364 0.00067183 0.00067191
MPI_MAX MPI_DOUBLE 8 count 524288 time (seconds / shifts) 0.00134448 0.00134368 0.00134335 0.00134373
MPI_SUM MPI_DOUBLE 8 count 131072 time (seconds / shifts) 0.00027357 0.00027368 0.00027386 0.00027367
MPI_SUM MPI_DOUBLE 8 count 262144 time (seconds / shifts) 0.00055431 0.00055400 0.00055398 0.00055392
MPI_SUM MPI_DOUBLE 8 count 524288 time (seconds / shifts) 0.00110281 0.00110347 0.00110247 0.00110312
op/avx included:
MPI_MAX MPI_DOUBLE 8 count 131072 time (seconds / shifts) 0.00020832 0.00022310 0.00022270 0.00022301
MPI_MAX MPI_DOUBLE 8 count 262144 time (seconds / shifts) 0.00043216 0.00045851 0.00045999 0.00045923
MPI_MAX MPI_DOUBLE 8 count 524288 time (seconds / shifts) 0.00086542 0.00091259 0.00091225 0.00091096
MPI_SUM MPI_DOUBLE 8 count 131072 time (seconds / shifts) 0.00021096 0.00022439 0.00022389 0.00022375
MPI_SUM MPI_DOUBLE 8 count 262144 time (seconds / shifts) 0.00044044 0.00046373 0.00046331 0.00046293
MPI_SUM MPI_DOUBLE 8 count 524288 time (seconds / shifts) 0.00085983 0.00091085 0.00091748 0.00091429
The reproduction seems limited to the application's pattern. I will take a closer look at LAMMPS usage of the collective today.
I've been looking at this for a while now, and I still do not have a silver bullet here. Just the use of AVX for the op seems to be slowing things down. Poking around literature, I see several references to frequency scaling caused by heavy use of AVX on multiple cores simultaneously, and that causing slowdowns. Is this something you are aware of, and could that be a probable cause?
https://dl.acm.org/doi/10.1145/3409963.3410488 https://arxiv.org/pdf/1901.04982.pdf
I am just running the benchmark case that comes with lammps, in case you want to give it a try on your end: https://github.com/lammps/lammps/blob/stable_12Dec2018/bench/in.chain.scaled
Like I mentioned earlier, I can reproduce this with newer versions of GCC and a single compute instance.
@rajachan thanks for the report.
Frequency scaling is indeed a documented drawback of AVX, that, in the worst case, slow things down, especially on a loaded system. I guess we could add a runtime parameter to set the max AVX flavor to be used. For example, SSE is likely faster than the default implementation, and on some systems, AVX2 might be faster than SSE but slower than AVX512 under load.
Let's not jump to conclusions yet. If I correctly read the graphs posted by @rajachan we are looking at a factor 10x of performance decrease for the allreduce between the AVX and the non-AVX version, while the papers talk about a 10% decrease in a similar workload (suite of AVX and non-AVX operations).
We already have an MCA parameter to control how much of the hardware AVX support is allowed by the user, op_avx_support. 0 means no AVX/SSE, 31 means just AVX, 53 AVX and AVX2, and no change to allow everything possible/available.
@wesbland Have you seen anything like this? Can you perhaps connect us to someone over there who can help us figure out the right path forward?
@bosilca With LAMMPS running on 24 ranks on a single compute node, here's what I see with the various avx levels:
OMPI_MCA_op_avx_support=0 Loop time of 35.2146 on 24 procs for 100 steps with 28800000 atoms
OMPI_MCA_op_avx_support=31 Loop time of 34.8507 on 24 procs for 100 steps with 28800000 atoms
OMPI_MCA_op_avx_support=53 Loop time of 42.3577 on 24 procs for 100 steps with 28800000 atoms
Default: Loop time of 42.3886 on 24 procs for 100 steps with 28800000 atoms
I am running this on a Skylake system with the following capabilities:
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Based on these numbers it seems we should leave the avx component enabled and with high priority on x86, but restrict it to use only AVX instructions.
@wesbland Have you seen anything like this? Can you perhaps connect us to someone over there who can help us figure out the right path forward?
@rhc54 I haven't seen anything like this personally, but I can ask around with some of the other folks.
Based on these numbers it seems we should leave the avx component enabled and with high priority on x86, but restrict it to use only AVX instructions.
Agreed, this sounds the safest without having to change priority.
I just discuvered that the Intel compiler does not define the AVX* macros without a specific -m option. Kudos to icc folks, way to go. I have a patch, I will restrict the AVX512 as well.
@rajachan did you bind MPI tasks to a single core (e.g. mpirun -bind-to core ...
)?
I suspect AVX512 frequency scaling might cause unnecessary task migration that could severely impact performances.
I used the default binding policy in that last run, but there's a degradation after pinning to cores as well, just not as pronounced:
OMPI_MCA_op_avx_support=0 Loop time of 42.2022 on 24 procs for 100 steps with 28800000 atoms
OMPI_MCA_op_avx_support=31 Loop time of 42.7109 on 24 procs for 100 steps with 28800000 atoms
OMPI_MCA_op_avx_support=53 Loop time of 45.9294 on 24 procs for 100 steps with 28800000 atoms
default: Loop time of 46.1744 on 24 procs for 100 steps with 28800000 atoms
@rajachan thanks for the interesting numbers. The degradation is indeed not as pronounced, but the absolute performance is just worst.
without the op/avx
component, loop time increased from 35 up to 42 seconds after pinning to a core (!)
without the op/avx component, loop time increased from 35 up to 42 seconds after pinning to a core (!)
Yes, that's a bit puzzling too, and should be looked at separately in addition to the AVX512 issue.
We are trying to replicate these results on a single, skylake-based node, but so far we are unable to highlight any performance regression with AVX2 or AVX512 turned on. @dong0321 will post the result soon.
Meanwhile, I will amend #8372 and #8373 to remove the part where I alter the flags of the AVX component, such that we can pull in the fix for icc, but without reducing [yet] the capabilities of the AVX component.
I did the same experiments as @rajachan described.
Experiment environment: OMPI master @9ff011728c16dcd642b429f8208ce90602c22adb Single node, 24 processes --bind-to core. Model name: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz Flags: ssse3 sse4_1 sse4_2 avx avx2 avx512f avx512dq avx512cd avx512bw avx512vl avx512_vnni combination of processor capabilities as follow SSE 0x01, SSE2 0x02, SSE3 0x04, SSE4.1 0x08, AVX 0x010, AVX2 0x020, AVX512F 0x100, AVX512BW 0x200.
Here are the cmd lines and results:
/home/zhongdong/opt/git/george_branch/bin/mpirun -mca op ^avx -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>1.txt
1.txt
174:Loop time of 703.492 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>2.txt
2.txt
173:Loop time of 603.9 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x010 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>3.txt
3.txt
173:Loop time of 601.464 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x020 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>4.txt
4.txt
173:Loop time of 596.886 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x030 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>5.txt
5.txt
173:Loop time of 581.822 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x100 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>6.txt
6.txt
173:Loop time of 569.93 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x130 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>7.txt
7.txt
173:Loop time of 513.994 on 24 procs for 100 steps with 259200000 atoms
/home/zhongdong/opt/git/george_branch/bin/mpirun --mca op avx --mca op_avx_support 0x120 -bind-to core -np 24
/home/zhongdong/Downloads/git_from_2020/lammps/src/lmp_mpi -in
/home/zhongdong/Downloads/git_from_2020/lammps/bench/in.chute.scaled -var x 90 -var y 90 &>8.txt
8.txt
173:Loop time of 456.201 on 24 procs for 100 steps with 259200000 atoms
The results show a different story, without op/avx the performance is the worst. With avx enabled (single avx, avx2, avx512 or mix of those), it shows a speedup of 14%~35%.
I'm talking to George offline about this. I am setting up a test cluster for @dong0321 to check out the differences between our two runs. We will report back with findings.
I had a vanilla build of OMPI, but @dong0321 had CFLAGS=-march=skylake-avx512
set in his configure, which is what was causing the difference. He has since reproduced my results with a vanilla build. I'll let him post his resulls.
I reproduced Raja's results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz This result shows that AVX512 is decreasing the perfs, avx2 is not.
~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0xfff --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 40.3812 on 24 procs for 100 steps with 28800000 atoms
~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x3f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 33.4608 on 24 procs for 100 steps with 28800000 atoms
~/opt/ompi/4.1.x/bin/mpirun --mca op avx --mca op_avx_support 0x1f --bind-to core -np 24 path/lmp_mpi -in /path/in.chute.scaled -var x 30 -var y 30 Loop time of 33.563 on 24 procs for 100 steps with 28800000 atoms
I also tested on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz, which shows no performance decrease with AVX512 or AVX2.
We did another approach that takes the reduce_local and replace the local reduce by an allreduce on MPI_COMM_WORLD() with unaligned and aligned data. It shows no performance decrease. This is on cascade lake Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Test with allreduce aligned and unaligned. The result shows AVX is still faster than non-AVX. This is the output from rank 0 with 24 processes.
$/home/zhongdong/opt/git/george_branch/bin/mpirun -np 24 reduce_local ... [6:31 PM] -> % tail avx_rank0_shift.txt
MPI_SUM MPI_INT8_T 8 [success] count 2048 time (seconds / shifts) 0.00002342 0.00001618 0.00002749 0.00001567
MPI_SUM MPI_INT8_T 8 [success] count 4096 time (seconds / shifts) 0.00012998 0.00009841 0.00005475 0.00003839
MPI_SUM MPI_INT8_T 8 [success] count 8192 time (seconds / shifts) 0.00007279 0.00007643 0.00006695 0.00007870
MPI_SUM MPI_INT8_T 8 [success] count 16384 time (seconds / shifts) 0.00008852 0.00012386 0.00009632 0.00008198
MPI_SUM MPI_INT8_T 8 [success] count 32768 time (seconds / shifts) 0.00014307 0.00012712 0.00012602 0.00012312
MPI_SUM MPI_INT8_T 8 [success] count 65536 time (seconds / shifts) 0.00024698 0.00019608 0.00020291 0.00020624
MPI_SUM MPI_INT8_T 8 [success] count 131072 time (seconds / shifts) 0.00041439 0.00032205 0.00034034 0.00033766
MPI_SUM MPI_INT8_T 8 [success] count 262144 time (seconds / shifts) 0.00104947 0.00074683 0.00064079 0.00057179
MPI_SUM MPI_INT8_T 8 [success] count 524288 time (seconds / shifts) 0.00299003 0.00144670 0.00124236 0.00107849
MPI_SUM MPI_INT8_T 8 [success] count 1048576 time (seconds / shifts) 0.00455701 0.00289851 0.00209786 0.00216357
$/home/zhongdong/opt/git/george_branch/bin/mpirun -mca op ^avx -np 24 reduce_local ... -> % tail noavx_rank0_shift.txt
MPI_SUM MPI_INT8_T 8 [success] count 2048 time (seconds / shifts) 0.00003396 0.00003482 0.00003206 0.00002900
MPI_SUM MPI_INT8_T 8 [success] count 4096 time (seconds / shifts) 0.00016633 0.00012850 0.00009502 0.00007309
MPI_SUM MPI_INT8_T 8 [success] count 8192 time (seconds / shifts) 0.00009685 0.00008488 0.00008030 0.00010761
MPI_SUM MPI_INT8_T 8 [success] count 16384 time (seconds / shifts) 0.00012680 0.00013737 0.00013265 0.00011907
MPI_SUM MPI_INT8_T 8 [success] count 32768 time (seconds / shifts) 0.00021818 0.00020476 0.00019764 0.00019272
MPI_SUM MPI_INT8_T 8 [success] count 65536 time (seconds / shifts) 0.00041932 0.00032535 0.00033718 0.00032707
MPI_SUM MPI_INT8_T 8 [success] count 131072 time (seconds / shifts) 0.00074386 0.00072927 0.00060821 0.00060789
MPI_SUM MPI_INT8_T 8 [success] count 262144 time (seconds / shifts) 0.00214421 0.00108592 0.00109788 0.00110618
MPI_SUM MPI_INT8_T 8 [success] count 524288 time (seconds / shifts) 0.00320920 0.00245157 0.00207727 0.00207326
MPI_SUM MPI_INT8_T 8 [success] count 1048576 time (seconds / shifts) 0.00500448 0.00463977 0.00395080 0.00379880
@dong0321 Can you repeat this test on the Skylake system where we reproduced the impact on LAMMPS?
allreduce results on skylake Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz AVX is faster than non-AVX. This is the output from rank 0 with 24 processes.
$ opt/ompi/4.1.x/bin/mpirun -mca op ^avx -np 24 reduce_local ... [ec2-user@ip-172-31-55-224 datatype]$ tail noavx_rank0_shift.txt
MPI_SUM MPI_INT8_T 8 [success] count 2048 time (seconds / shifts) 0.00003220 0.00002618 0.00002693 0.00002493
MPI_SUM MPI_INT8_T 8 [success] count 4096 time (seconds / shifts) 0.00009739 0.00007799 0.00006968 0.00007193
MPI_SUM MPI_INT8_T 8 [success] count 8192 time (seconds / shifts) 0.00007880 0.00006648 0.00006358 0.00007358
MPI_SUM MPI_INT8_T 8 [success] count 16384 time (seconds / shifts) 0.00009920 0.00010670 0.00011290 0.00009291
MPI_SUM MPI_INT8_T 8 [success] count 32768 time (seconds / shifts) 0.00015118 0.00014591 0.00013938 0.00014321
MPI_SUM MPI_INT8_T 8 [success] count 65536 time (seconds / shifts) 0.00027290 0.00024369 0.00024039 0.00023854
MPI_SUM MPI_INT8_T 8 [success] count 131072 time (seconds / shifts) 0.00054232 0.00049116 0.00045581 0.00046936
MPI_SUM MPI_INT8_T 8 [success] count 262144 time (seconds / shifts) 0.00097585 0.00096952 0.00092836 0.00093997
MPI_SUM MPI_INT8_T 8 [success] count 524288 time (seconds / shifts) 0.00192393 0.00201303 0.00186113 0.00185925
MPI_SUM MPI_INT8_T 8 [success] count 1048576 time (seconds / shifts) 0.00386278 0.00411523 0.00387891 0.00387315
$ opt/ompi/4.1.x/bin/mpirun -np 24 reduce_local ... [ec2-user@ip-172-31-55-224 datatype]$ tail avx_rank0_shift.txt
MPI_SUM MPI_INT8_T 8 [success] count 2048 time (seconds / shifts) 0.00002041 0.00001467 0.00001553 0.00001319
MPI_SUM MPI_INT8_T 8 [success] count 4096 time (seconds / shifts) 0.00009799 0.00005957 0.00004662 0.00004025
MPI_SUM MPI_INT8_T 8 [success] count 8192 time (seconds / shifts) 0.00006859 0.00006459 0.00006412 0.00006866
MPI_SUM MPI_INT8_T 8 [success] count 16384 time (seconds / shifts) 0.00010162 0.00009352 0.00010067 0.00008038
MPI_SUM MPI_INT8_T 8 [success] count 32768 time (seconds / shifts) 0.00013829 0.00013241 0.00012451 0.00012239
MPI_SUM MPI_INT8_T 8 [success] count 65536 time (seconds / shifts) 0.00024440 0.00021226 0.00021521 0.00020219
MPI_SUM MPI_INT8_T 8 [success] count 131072 time (seconds / shifts) 0.00044169 0.00037666 0.00038682 0.00036369
MPI_SUM MPI_INT8_T 8 [success] count 262144 time (seconds / shifts) 0.00086758 0.00078537 0.00072895 0.00072250
MPI_SUM MPI_INT8_T 8 [success] count 524288 time (seconds / shifts) 0.00157091 0.00165382 0.00155314 0.00148457
MPI_SUM MPI_INT8_T 8 [success] count 1048576 time (seconds / shifts) 0.00347956 0.00367711 0.00335126 0.00350388
This regression does not happen when compiling OMPI with icc. The issue seems contained to the use of gcc (tested with multiple versions up until v11 candidate built from source). LAMMPS developers have confirmed they are not making explicit use of AVX512 here. @bosilca I propose updating https://github.com/open-mpi/ompi/pull/8376 to conditionally use those combinations only when the Intel compilers are used.
Allow me to summarize the situation as I understand it. We have a performance regression on one application, on a particular processor, when compiled with a particular compiler (many versions of the same compiler). Analysis of the regression in the application context, pinpoints the performance issue on an MPI_Allreduce, but we are unable to replicate (even using the same set of conditions) in any stand alone benchmark. In addition, we have not been able to reproduce the performance regression on other applications, even on the exact same setup.
So, I'm not sure I understand the proposal here. Allow AVX only when OMPI is compiled with icc ? When the application is compiled with icc ? Both of these are possible but unnecessary restrictive. At this point we have no common denominator here, and no understanding of the root cause. I would advocate we do nothing, add some wording on the FAQ and while we can leave this ticket open for future inquiries we move forward and remove the blocking label.
Given this is a performance optimization we are talking about and given this was just introduced in this series, yes, that is exactly why I am proposing we be conservative. We have one application that we know of and we don't have full understanding of the problem, so we can not say no other application is impacted (we don't know what we don't know). We learned from the LAMMPS developers that there should be nothing special about their use of Allreduce. I am just repeating myself at this point, but the fact that we need more investigation is enough to say we should not make this the default for everyone.
In my tests with the Intel compiler, I'd just compiled OMPI with icc and not the app.
We have a few different drivers for a 4.1.x bugfix release and I don't want to hold that up any further, so if you want to take the FAQ route I'm fine with that.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source tarball, default configuration built with GCC 4.8.5.
Please describe the system on which you are running
Details of the problem
We noticed an OS-specific regression with LAMMPS (in.chute.scaled case) with 4.1.0. Bisecting through the commits, this seems to have been introduced with the AVX-based MPI_OP changes that got backported into this series. Specifically, the commit which moved to the unaligned SSE memory access primitives for reduce OPs seems to be causing it: https://github.com/open-mpi/ompi/pull/7957
This was added to address the Accumulate issue, so it is a necessary correctness fix (https://github.com/open-mpi/ompi/issues/7954)
The actual PR which introduced the SSE-based MPI_OP in the first place was backported from master: https://github.com/open-mpi/ompi/pull/7935
Broadly, allreduce performance seems to have taken a hit in 4.1.0 compared to 4.0.5 in this environment because of these changes. We do not see this with Amazon Linux 2 (which has a 7.x series GCC) or Ubuntu 18, for instance.
Tried with https://github.com/open-mpi/ompi/pull/8322 just in case, that does not help either.
@bosilca does anything obvious stand out to you?