sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
243 stars 58 forks source link

Defect: Performance of derived type coarrays #556

Open t-bltg opened 6 years ago

t-bltg commented 6 years ago
Avg response time
Issue Stats

Defect/Bug Report

Observed Behavior

I've observed some massive slowdown of my code when copying co-array locally (no [remote] references).

Expected Behavior

Memcpy should be used, if not, at least we expect no MPI communications !

Steps to Reproduce

issue.f90

! caf -o issue issue.f90
! cafrun -np 2 ./issue
module co_obj
   implicit none
   type co
      real(8), allocatable :: a(:, :, :, :)[:]
   end type
end module

program main
   use co_obj
   use mpi
   implicit none

   type(co) :: lhs, rhs
   real(8) :: t0
   real(8), allocatable :: buf(:, :, :, :)
   integer :: ni, nj, nk, nl, i, j, k, l

   ni = 8
   nj = 8
   nk = 8
   nl = 8

   if (num_images() /= 2) error stop 1

   allocate( &
      lhs % a(ni, nj, nk, nl)[*], &
      rhs % a(ni, nj, nk, nl)[*], &
      buf(ni, nj, nk, nl))

   sync all

   print *, '==> START <=='
   t0 = mpi_wtime()
   buf(:, :, :, :) = rhs % a
   lhs % a = buf
   print *, 't1=', mpi_wtime() - t0

   sync all

   t0 = mpi_wtime()
   lhs % a = rhs % a ! implicit MPI transfer, where there should NOT be !
   print *, 't2=', mpi_wtime() - t0

   sync all
   t0 = mpi_wtime()
   do l = 1, nl
      do k = 1, nk
         do j = 1, nj
            do i = 1, ni
               lhs % a(i, j, k, l) = rhs % a(i, j, k, l)
            end do
         end do
      end do
   end do
   print *, 't3=', mpi_wtime() - t0

   sync all
   print *, '==> STOP <=='

end program

output (decimals truncated)

 ==> START <==
 ==> START <==
 t1=   1.05E-004
 t1=   9.32E-005
 t2=   8.96     # <== yes this is clearly a bottleneck
 t2=   9.11   
 t3=   2.69E-005
 t3=   3.43E-005
 ==> STOP <==
 ==> STOP <==

Tracking down the source of this unwanted caf_send, in the fortran sources:

gcc/fortran/trans-expr.c around l. 10240

  else if (flag_coarray == GFC_FCOARRAY_LIB
       && lhs_caf_attr.codimension && rhs_caf_attr.codimension
       && ((lhs_caf_attr.allocatable && lhs_refs_comp)
           || (rhs_caf_attr.allocatable && rhs_refs_comp)))
    {
      /* Only detour to caf_send[get][_by_ref] () when the lhs or rhs is an
     allocatable component, because those need to be accessed via the
     caf-runtime.  No need to check for coindexes here, because resolve
     has rewritten those already.  */
      gfc_code code;
      gfc_actual_arglist a1, a2;
      /* Clear the structures to prevent accessing garbage.  */
      memset (&code, '\0', sizeof (gfc_code));
      memset (&a1, '\0', sizeof (gfc_actual_arglist));
      memset (&a2, '\0', sizeof (gfc_actual_arglist));
      a1.expr = expr1;
      a1.next = &a2;
      a2.expr = expr2;
      a2.next = NULL;
      code.ext.actual = &a1;
      code.resolved_isym = gfc_intrinsic_subroutine_by_id (GFC_ISYM_CAF_SEND);
      tmp = gfc_conv_intrinsic_subroutine (&code);
    }

So this is strange: gfortran is delegating the assignment to the underlying coarray lib, even if no explicit remote reference is done (arr(..)[]) ! those need to be accessed via the caf-runtime => Why ? The documentation clearly states that caf_send is to be used to send data to a remote process, not locally ...

Question

Lets's assume that the assignment needs to be handled by the caf lib, shouldn't we try to use memcpy if we detect that remote_img == this_image ?

If someone could clarify the strategy: should I 1) patch gfortran so that the assignement does not refer a caf_send OR 2) patch OpenCoarrays trying to avoid MPI comms ?

scrasmussen commented 6 years ago

@neok-m4700 thanks for reporting this and all you've done to look into the issue! Just thinking out loud, but I would think that when a component of a derived type is an allocateable coarray it needs a slave_token which is handled here in OpenCoarrays. So even though it appears the gfortran side of things should handle local memory movement, since the slave_token handle is in the caf-runtime, that's probably where it needs to be fixed. I'll need to look at this issue closer, but won't be able to till next week

t-bltg commented 6 years ago

Thanks for the ideas, I've made some tests in https://github.com/neok-m4700/OpenCoarrays/commits/perf.

However, it does seem that overhead is due to repeated calls to send_for_ref. I do not (yet !) see how to simplify the logic and bypass costly recursive calls.

gutmann commented 6 years ago

I thought I'd add a comment here on a performance issue I'm seeing at the moment using coarrays in derived types. I haven't chased down anything more specific, but using the coarray-icar test-ideal, and a small (200 x 200 x 20) problem size, I'm seeing huge slow downs across multiple nodes. This was not present in opencoarrays 1.9.1 (with gcc 6.3) and it is not present with intel. I don't know if this could be related to the issue noted above or if this is completely separate since it is internode communication and thus will require MPI calls.

https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne cores / node = 36

Images OpenCoarrays 2.1 OpenCoarrays 1.9.1 Intel 18.0.1
36 14.7 16.3 11.3
72 105 8.9 5.8
144 140 4.6 3.2
720 170 1.4 0.94

All times in seconds. This is just the core runtime of the numerics not any of the initialization time.

OpenCoarrays 2.1

gfortran/gcc 8.1 (built via opencoarrays install.sh)
mpich 3.2.1 (built via opencoarrays install.sh)
opencoarrays 2.1.0-31-gc0e3ffb (with fprintf statements commented out in mpi/mpi.c)

OpenCoarrays 1.9.1

gfortran/gcc 6.3
MPT 2.15f
opencoarrays 1.9.1

Intel 18.0.1

ifort 18.0.1*
iMPI 2018.1.163
zbeekman commented 6 years ago

Uh oh, this is very problematic. Thanks for bringing this to our attention. The GFortran side was re-factored for GCC 8 with some very substantial changes. The execution time should NOT be increasing with the number of nodes! This is an EXTREME performance regression.

CC: @rouson

@neok-m4700 do you have a decent idea of which code regions are responsible for the slowdown? I can try running @gutmann's example code on a similar SGI/HPX machine with TAU if we need to localize this better.

gutmann commented 6 years ago

thanks @zbeekman, I'd like to see someone else reproduce this just to be sure that there isn't something broken with my installation.

Would it make sense to break this off into a separate issue? The more I think about it the more I suspect this is unrelated to the initial issue reported here.

zbeekman commented 6 years ago

Sure, create a new issue but please mention #556 somewhere to connect it to this one for context.

On Tue, Jul 3, 2018 at 2:23 PM Ethan Gutmann notifications@github.com wrote:

thanks @zbeekman https://github.com/zbeekman, I'd like to see someone else reproduce this just to be sure that there isn't something broken with my installation.

Would it make sense to break this off into a separate issue? The more I think about it the more I suspect this is unrelated to the initial issue reported here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/556#issuecomment-402250197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPI6Qivw8OJpZTlw5e7MNZ8RkmDhlks5uC7a1gaJpZM4U0fyo .

t-bltg commented 6 years ago

I believe that is regression is different from this issue since you seem to only use what I call regular coarray in your code, not coarrays in derived types.

Yep, better open a new issue, and trying to reduce the problem to a mwe maybe ?

Sure, when strong scaling has not the right slope sign something is definitively wrong :open_mouth:

I'm trying to run it on our cluster ...

zbeekman commented 6 years ago

I mean after a point with LOOOOOOOOOTS of cores maybe you get the wrong slope but not with that few—that’s bananas. On Tue, Jul 3, 2018 at 2:36 PM neok-m4700 notifications@github.com wrote:

I believe that is regression is different from this issue since you seem to only use what I call regular coarray in your code, not coarrays in derived types. Yep, better open a new issue, and trying to reduce the problem to a mwe maybe ?

Sure, when strong scaling has not the right slope something is definitively wrong 😮

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/556#issuecomment-402253891, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPC1NEJHkTcg1EDhUbB6yfJSEMozSks5uC7mggaJpZM4U0fyo .

gutmann commented 6 years ago

discussion moved to #560

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.