Open t-bltg opened 6 years ago
@neok-m4700 thanks for reporting this and all you've done to look into the issue! Just thinking out loud, but I would think that when a component of a derived type is an allocateable coarray it needs a slave_token
which is handled here in OpenCoarrays
. So even though it appears the gfortran
side of things should handle local memory movement, since the slave_token
handle is in the caf-runtime
, that's probably where it needs to be fixed. I'll need to look at this issue closer, but won't be able to till next week
Thanks for the ideas, I've made some tests in https://github.com/neok-m4700/OpenCoarrays/commits/perf.
However, it does seem that overhead is due to repeated calls to send_for_ref
. I do not (yet !) see how to simplify the logic and bypass costly recursive calls.
I thought I'd add a comment here on a performance issue I'm seeing at the moment using coarrays in derived types. I haven't chased down anything more specific, but using the coarray-icar test-ideal, and a small (200 x 200 x 20) problem size, I'm seeing huge slow downs across multiple nodes. This was not present in opencoarrays 1.9.1 (with gcc 6.3) and it is not present with intel. I don't know if this could be related to the issue noted above or if this is completely separate since it is internode communication and thus will require MPI calls.
https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne cores / node = 36
Images | OpenCoarrays 2.1 | OpenCoarrays 1.9.1 | Intel 18.0.1 |
---|---|---|---|
36 | 14.7 | 16.3 | 11.3 |
72 | 105 | 8.9 | 5.8 |
144 | 140 | 4.6 | 3.2 |
720 | 170 | 1.4 | 0.94 |
All times in seconds. This is just the core runtime of the numerics not any of the initialization time.
OpenCoarrays 2.1
gfortran/gcc 8.1 (built via opencoarrays install.sh)
mpich 3.2.1 (built via opencoarrays install.sh)
opencoarrays 2.1.0-31-gc0e3ffb (with fprintf statements commented out in mpi/mpi.c)
OpenCoarrays 1.9.1
gfortran/gcc 6.3
MPT 2.15f
opencoarrays 1.9.1
Intel 18.0.1
ifort 18.0.1*
iMPI 2018.1.163
Uh oh, this is very problematic. Thanks for bringing this to our attention. The GFortran side was re-factored for GCC 8 with some very substantial changes. The execution time should NOT be increasing with the number of nodes! This is an EXTREME performance regression.
CC: @rouson
@neok-m4700 do you have a decent idea of which code regions are responsible for the slowdown? I can try running @gutmann's example code on a similar SGI/HPX machine with TAU if we need to localize this better.
thanks @zbeekman, I'd like to see someone else reproduce this just to be sure that there isn't something broken with my installation.
Would it make sense to break this off into a separate issue? The more I think about it the more I suspect this is unrelated to the initial issue reported here.
Sure, create a new issue but please mention #556 somewhere to connect it to this one for context.
On Tue, Jul 3, 2018 at 2:23 PM Ethan Gutmann notifications@github.com wrote:
thanks @zbeekman https://github.com/zbeekman, I'd like to see someone else reproduce this just to be sure that there isn't something broken with my installation.
Would it make sense to break this off into a separate issue? The more I think about it the more I suspect this is unrelated to the initial issue reported here.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/556#issuecomment-402250197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPI6Qivw8OJpZTlw5e7MNZ8RkmDhlks5uC7a1gaJpZM4U0fyo .
I believe that is regression is different from this issue since you seem to only use what I call regular coarray in your code, not coarrays in derived types.
Yep, better open a new issue, and trying to reduce the problem to a mwe maybe ?
Sure, when strong scaling has not the right slope sign something is definitively wrong :open_mouth:
I'm trying to run it on our cluster ...
I mean after a point with LOOOOOOOOOTS of cores maybe you get the wrong slope but not with that few—that’s bananas. On Tue, Jul 3, 2018 at 2:36 PM neok-m4700 notifications@github.com wrote:
I believe that is regression is different from this issue since you seem to only use what I call regular coarray in your code, not coarrays in derived types. Yep, better open a new issue, and trying to reduce the problem to a mwe maybe ?
Sure, when strong scaling has not the right slope something is definitively wrong 😮
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/556#issuecomment-402253891, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPC1NEJHkTcg1EDhUbB6yfJSEMozSks5uC7mggaJpZM4U0fyo .
discussion moved to #560
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Defect/Bug Report
OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.1.0)
gfortran 8.1.0 + patches
gcc 8.1.0
uname -a
:Linux <...> 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
mpich 3.2.1
3.11.4
Observed Behavior
I've observed some massive slowdown of my code when copying co-array locally (no
[remote]
references).Expected Behavior
Memcpy should be used, if not, at least we expect no MPI communications !
Steps to Reproduce
issue.f90
output (decimals truncated)
Tracking down the source of this unwanted caf_send, in the fortran sources:
gcc/fortran/trans-expr.c around l. 10240
So this is strange:
gfortran
is delegating the assignment to the underlying coarray lib, even if no explicit remote reference is done (arr(..)[]
) !those need to be accessed via the caf-runtime
=> Why ? The documentation clearly states thatcaf_send
is to be used to send data to a remote process, not locally ...Question
Lets's assume that the assignment needs to be handled by the caf lib, shouldn't we try to use
memcpy
if we detect thatremote_img == this_image
?If someone could clarify the strategy: should I 1) patch
gfortran
so that the assignement does not refer acaf_send
OR 2) patchOpenCoarrays
trying to avoid MPI comms ?