sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
244 stars 58 forks source link

Defect: Performance issue on multiple nodes #560

Open gutmann opened 6 years ago

gutmann commented 6 years ago

Defect/Bug Report

Performance issue on multiple nodes. I thought I'd add a comment here on a performance issue I'm seeing at the moment using coarrays in derived types. I haven't chased down anything more specific, but using the coarray-icar test-ideal, and a small (200 x 200 x 20) problem size, I'm seeing huge slow downs across multiple nodes. This was not present in opencoarrays 1.9.1 (with gcc 6.3) and it is not present with intel.

I initially thought this could be related to #556 but now think this is completely separate since it is internode communication and thus will require MPI calls.

Coarray_icar uses non-blocking sends on allocatable coarrays embedded within derived types. It first processes halo grid cells, then sends them, then processes internal grid cells, then syncs with it's neighbors before reading the halos that were sent to it.

Observed Behavior

Performance gets worse when multiple nodes are used

Expected Behavior

Performance gets better when multiple nodes are used

Steps to Reproduce

git clone https://github.com/gutmann/coarray_icar
cd coarray_icar/src/tests
make MODE=fast

# edit input_parameters
cat > input-parameters.txt <<EOF
&grid nx=200,ny=200,nz=20 /
EOF

# get access to multiple nodes
for (( i=36; i<=144; i+=36 )); do
    cafrun -n $i ./test-ideal | grep "Model run time"
done

Example results for Opencoarrays 2.1, 1.9.1 and intel (more details on each below)

Images OpenCoarrays 2.1 OpenCoarrays 1.9.1 Intel 18.0.1
36 14.7 16.3 11.3
72 105 8.9 5.8
144 140 4.6 3.2
720 170 1.4 0.94

All times in seconds. This is just the core runtime of the numerics not any of the initialization time.

OpenCoarrays 2.1

gfortran/gcc 8.1 (built via opencoarrays install.sh)
mpich 3.2.1 (built via opencoarrays install.sh)
opencoarrays 2.1.0-31-gc0e3ffb (with fprintf statements commented out in mpi/mpi.c)

OpenCoarrays 1.9.1

gfortran/gcc 6.3
MPT 2.15f
opencoarrays 1.9.1

Intel 18.0.1

ifort 18.0.1*
iMPI 2018.1.163
gutmann commented 6 years ago

Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.

afanfa commented 6 years ago

Hi Ethan, please make to sure to use -O0 while compiling with Intel and GFortran.

On Tue, Jul 3, 2018 at 12:42 PM Ethan Gutmann notifications@github.com wrote:

Note, I have logged into the remote nodes when the code is running and I see the process correctly running on all remote nodes, so there isn't anything weird going on wherein it is trying to run all images on one node or anything.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402255598, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO19o8R55wo_0AjH0Ut1iVrGS5IyX5ks5uC7sPgaJpZM4VBcYD .

--

Alessandro Fanfarillo

zbeekman commented 6 years ago

please make to sure to use -O0 while compiling with Intel and GFortran.

@afanfa just out of curiosity, why?

gutmann commented 6 years ago

Thanks @afanfa, make MODE=fast compiles with caf -fopenmp -lgomp -Ofast -cpp -DUSE_ASSERTIONS=.false.. I am setting OMP_NUM_THREADS=1 in the tests above to disable OpenMP threading, though I suppose it could still be doing something behind the scenes. Testing without openMP enabled at all now (and -O0 instead of -Ofast)

gutmann commented 6 years ago

With -O0 and no openMP directives:

Images Runtime
36 26
72 116
afanfa commented 6 years ago

Intel is much better than GFortran in optimizing code (I'm referring to computation, not communication). If the code has even a small part of computation Intel will almost always produce better results when the same optimization level is applied (unless it's -O0)

On Tue, Jul 3, 2018 at 1:00 PM zbeekman notifications@github.com wrote:

please make to sure to use -O0 while compiling with Intel and GFortran.

@afanfa https://github.com/afanfa just out of curiosity, why?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402260712, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO1-sYhLs53Hhh0YjPW2ZaqJkvBU2jks5uC79PgaJpZM4VBcYD .

--

Alessandro Fanfarillo

gutmann commented 6 years ago

Yes, I don't care too much about the relative performance of intel vs gfortran, I'm more interested in the scaling of the two (using 2x more cores with intel takes 1/2 the time, but with gfortran+opencoarrays it takes ~>5x longer here)

zbeekman commented 6 years ago

@afanfa don't worry I suspect the performance regression happened in PR #427 or other "recent" changes and not in code you wrote 😉

zbeekman commented 6 years ago

@gutmann I fixed your first table and deleted the (now extraneous) comment with the corrected formatting; the issue was you need a line of whitespace before tables.

afanfa commented 6 years ago

Thanks Zaak. ;-)

On Tue, Jul 3, 2018 at 1:08 PM zbeekman notifications@github.com wrote:

@afanfa https://github.com/afanfa don't worry I suspect the performance regression happened in PR #427 https://github.com/sourceryinstitute/OpenCoarrays/issues/427 and not in code you wrote 😉

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-402262975, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDO13MtabemK_E00nWsjODsfA9LeR_Qks5uC8E8gaJpZM4VBcYD .

--

Alessandro Fanfarillo

t-bltg commented 6 years ago

Sorry I cannot test the scalability with this many cores.

I'm pretty sure this is NOT it, but can you try the following patch ?

--- exchangeable_implementation_.f90    2018-07-03 21:14:56.723202212 +0200
+++ exchangeable_implementation.f90 2018-07-03 21:00:28.527662543 +0200
@@ -138,6 +138,7 @@

   module subroutine put_north(this)
       class(exchangeable_t), intent(inout) :: this
+      real, allocatable :: buf(:, :, :)
       integer :: n, nx
       n = ubound(this%local,3)
       nx = size(this%local,1)
@@ -150,12 +151,13 @@
       end if

       !dir$ pgas defer_sync
-      this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = this%local(:,:,n-halo_size*2+1:n-halo_size)
+      buf = this%local(:,:,n-halo_size*2+1:n-halo_size)
+      this%halo_south_in(1:nx,:,1:halo_size)[north_neighbor] = buf
   end subroutine

   module subroutine put_south(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: start, nx
       start = lbound(this%local,3)
       nx = size(this%local,1)
@@ -167,31 +169,35 @@
                      "put_south: conformable halo_north_in and local " )
       end if
       !dir$ pgas defer_sync
-      this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = this%local(:,:,start+halo_size:start+halo_size*2-1)
+      buf = this%local(:,:,start+halo_size:start+halo_size*2-1)
+      this%halo_north_in(1:nx,:,1:halo_size)[south_neighbor] = buf
   end subroutine

   module subroutine retrieve_north_halo(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: n, nx
       n = ubound(this%local,3)
       nx = size(this%local,1)

-      this%local(:,:,n-halo_size+1:n) = this%halo_north_in(:nx,:,1:halo_size)
+      buf = this%halo_north_in(:nx,:,1:halo_size)
+      this%local(:,:,n-halo_size+1:n) = buf
   end subroutine

   module subroutine retrieve_south_halo(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: start, nx
       start = lbound(this%local,3)
       nx = size(this%local,1)

-      this%local(:,:,start:start+halo_size-1) = this%halo_south_in(:nx,:,1:halo_size)
+      buf = this%halo_south_in(:nx,:,1:halo_size)
+      this%local(:,:,start:start+halo_size-1) = buf
   end subroutine

   module subroutine put_east(this)
       class(exchangeable_t), intent(inout) :: this
+      real, allocatable :: buf(:, :, :)
       integer :: n, ny
       n = ubound(this%local,1)
       ny = size(this%local,3)
@@ -204,12 +210,13 @@
       end if

       !dir$ pgas defer_sync
-      this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = this%local(n-halo_size*2+1:n-halo_size,:,:)
+      buf = this%local(n-halo_size*2+1:n-halo_size,:,:)
+      this%halo_west_in(1:halo_size,:,1:ny)[east_neighbor] = buf
   end subroutine

   module subroutine put_west(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: start, ny
       start = lbound(this%local,1)
       ny = size(this%local,3)
@@ -222,27 +229,30 @@
       end if

       !dir$ pgas defer_sync
-      this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = this%local(start+halo_size:start+halo_size*2-1,:,:)
+      buf = this%local(start+halo_size:start+halo_size*2-1,:,:)
+      this%halo_east_in(1:halo_size,:,1:ny)[west_neighbor] = buf
   end subroutine

   module subroutine retrieve_east_halo(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: n, ny
       n = ubound(this%local,1)
       ny = size(this%local,3)

-      this%local(n-halo_size+1:n,:,:) = this%halo_east_in(1:halo_size,:,1:ny)
+      buf = this%halo_east_in(1:halo_size,:,1:ny)
+      this%local(n-halo_size+1:n,:,:) = buf
   end subroutine

   module subroutine retrieve_west_halo(this)
       class(exchangeable_t), intent(inout) :: this
-
+      real, allocatable :: buf(:, :, :)
       integer :: start, ny
       start = lbound(this%local,1)
       ny = size(this%local,3)

-      this%local(start:start+halo_size-1,:,:) = this%halo_west_in(1:halo_size,:,1:ny)
+      buf = this%halo_west_in(1:halo_size,:,1:ny)
+      this%local(start:start+halo_size-1,:,:) = buf
   end subroutine

On a 16 physical, 32 logical cores machine (so timing 36 is irrelevant here)

Images Duration
9 37.150s
18 30.003s
27 20.916s
36 25.165s
gutmann commented 6 years ago

yes, don't use 36 there, 8 and 16 would be of interest, but I think the problem only appears when you invoke MPI calls across multiple nodes.

With the patch above applied (and -Ofast):

Images Runtime (new) Runtime (original)
36 15.2 14.7
72 99.4 105

I suspect that differences are not statistically significant.

gutmann commented 6 years ago

@zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.

zbeekman commented 6 years ago

Sorry no bandwidth to spare at the moment. Hope to look later this week. On Tue, Jul 10, 2018 at 1:37 PM Ethan Gutmann notifications@github.com wrote:

@zbeekman https://github.com/zbeekman have you tried reproducing this on another machine? I can dig around with different MPI implementations on cheyenne if it turns out this is some how specific to my installation on cheyenne, but I haven't had this kind of speed issue there with past installations.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/560#issuecomment-403906145, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPAih5-YOyvq_UMkuk9gttbLoomT-ks5uFOZqgaJpZM4VBcYD .

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

zbeekman commented 5 years ago

Be gone stalebot!