sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
246 stars 55 forks source link

The number of images creates an UNKNOWN error #637

Closed singleterry closed 5 years ago

singleterry commented 5 years ago

Problem: once the number of images gets beyond 64, CAF errors with a blank error.
CAF Error.zip

I have enclosed a sample. This is based off of my real mini-app that uses a 2D Coarray (could be simplified of course). For this example, the 2D's are square (meaning the number of indexes is the same). The directory structure contains the program (prog.f90), a template PBS script (coa.pbs), and a controlling compile/execute command (go.csh). The output is contained in sub-directories labeled 4-11. Each directory contains the submitted PBS script (coa.pbs) and the output log (output.log). For 4-8, the output is as expected - it finishes successfully. For 9-11, it just blows up. No error given.

Same executable with just differing input.

Not to rub it in, but these work perfectly for all input under Intel 2019.1.144.

rouson commented 5 years ago

Thanks for the report, @singleterry. I suspect the issue is related to the line

pid = this_image(FLUX)

I don't think OpenCoarrays actually supports passing an argument to the this_image intrinsic function. We have unfortunately accepted some code contributions in the past where the contributor knew certain functionality was broken, said so in a comment in the code, but didn't bother to generate a useful error message for the user. I'll see if we can at least generate a more useful runtime message if I can figure out how to detect when the argument is being passed. Unfortunately, it will have to be a runtime error message unless a gfortran developer is willing to modify the compiler, which is beyond my skill set and available time currently.

For more details, see line 1096 in the OpenCoarrays MPI layer, where the argument to this_image is named unused. And the comment above that line states indicates that the code is non-standard:

  /* TODO: This is interface is violating the F2015 standard, but not the gfortran
   * API. Fix it (the fortran API). */
  int
  PREFIX(this_image) (int distance __attribute__((unused)))
  {
    return caf_this_image;
  }

It's very frustrating that this was introduced into OpenCoarrays, but it hasn't bitten us so far because it seems to be much more common for users to invoke this_image() with no argument.

singleterry commented 5 years ago

Hello,

That is Key functionality for what I am doing. A lot of index manipulation would be necessary to go from a single dimension image number to multi-dimension image numbers. Intel handles it fine. I’ll come up with an algorithm and include it in the CAF version and run it again. It strikes me that this may not be the issue. I am getting an array of indexes back from this this_image into pid from that line????

I am also seeing HUGE differences in runtimes between Intel (faster) and CAF (slower). I am not sure that the compile arguments are being used properly for -march. I am half way through editing the paper to include Intel 2019 and OpenCoarrays 2.3.1 results (wanted to get the initial issue resolved first). Because of all your help, your name is on it and before we publish, I’ll let you see why there are such huge differences. Plus, I’ll have a better handle on what is going one once I resolve this initial problem.

Thanks for the quick response!

Robert

singleterry commented 5 years ago

Hello,

Sorry. Here is an output from Intel and OpenCoarrays:

Program: program t integer ,Codimension[1:10,1:] :: a integer ,Dimension(2) :: pid pid = this_image(a) print , 'Hello World, ', pid(1), pid(2) end program t

Compile for Intel: % ifort -coarray -o t -traceback -coarray-num-images=20 t.f90

Output for Intel executable: % t MPI startup(): I_MPI_JOB_CONTEXT environment variable is not supported. MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported. MPI startup(): I_MPI_DEVICE environment variable is not supported. MPI startup(): I_MPI_FALLBACK environment variable is not supported. MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started. Hello World, 5 1 Hello World, 7 1 Hello World, 8 1 Hello World, 9 1 Hello World, 10 1 Hello World, 1 2 Hello World, 2 2 Hello World, 3 2 Hello World, 5 2 Hello World, 6 2 Hello World, 7 2 Hello World, 8 2 Hello World, 9 2 Hello World, 10 2 Hello World, 1 1 Hello World, 2 1 Hello World, 3 1 Hello World, 4 1 Hello World, 6 1 Hello World, 4 2

Not sure what the initial message are for, checking with my SA’s

Compile for OpenCoarrays: caf -o t -fbacktrace t.f90 Output for OpenCoarrays executable: % cafrun -np 20 t Hello World, 1 1 Hello World, 2 1 Hello World, 3 1 Hello World, 4 1 Hello World, 5 1 Hello World, 6 1 Hello World, 7 1 Hello World, 9 1 Hello World, 1 2 Hello World, 2 2 Hello World, 3 2 Hello World, 4 2 Hello World, 5 2 Hello World, 6 2 Hello World, 7 2 Hello World, 8 2 Hello World, 9 2 Hello World, 8 1 Hello World, 10 2 Hello World, 10 1

Everything seems to work fine!?!?

Robert

rouson commented 5 years ago

Wow. Yes, it appears to be working fine even though the header file clearly shows this_image returning a single integer. I'm guessing gfortran is substituting its own definition of this_image rather than calling the one in OpenCoarrays. I will investigate it further.

Once upon a time, gfortran/OpenCoarrays were generally at least as fast and most often faster than Intel for coarray code, but Intel made some significant improvements to their coarray peroformance in recent releases. Hopefully this is a good sign for Intel. We'll need to study what's going on with gfortran.

Thanks for including me on the paper. I'm glad my contribution was helpful. I'll also be happy to contribute to the writing.

afanfa commented 5 years ago

Hi @singleterry, as you have noted, this_image works just fine thanks to a manipulation happening in the Fortran front-end of GFortran. In other words, the result returned by int PREFIX(this_image) gets contextualized by GFortran, based on the rank, shape, and extent of the codimensions.

afanfa commented 5 years ago

@singleterry I am running your code on NCAR's supercomputer without any issues. I am using the OpenCoarrays version currently on master, MPICH-3.2 and GFortran-6.3.0 (there is no MPICH-3.2 module available with newer compilers). I have also tested it with GNU-8.1 and MPT-2.19 and I haven't found issues.

I am running it like this:

mpirun -np 144 ./prog 12

Is that correct?

singleterry commented 5 years ago

Hello,

Yes, except I use cafrun instead of mpirun.

My versions are (of the packages that I can remember):

1) OpenCoaarays 2.3.1

a. runtime

                                                           i.      GCC 8.2.0  (ok)

                                                         ii.      MPICH 3.2 (ok)

b. build

                                                           i.      CMAKE 2.8.4 (should be 3.4.0 – may have been during build)

                                                         ii.      BISON 3.0.4 (ok)

                                                        iii.      WGET 1.4 (should be 1.16.3 – may have been during build)

                                                       iv.      FLEX 2.5.37 (should be 2.6.0 – may have been during build)

                                                         v.      MAKE 3.82 (should be 4.1 – may have been during build)

                                                       vi.      M4 1.4.16 (should be 1.4.17 – may have been during build)

                                                      vii.      PKG-CONFIG 0.27.1 (should be 0.28 – may have been during build)

These were the versions downloading and installed by OpenCoarays 2.3.1 using ./install.sh script. Maybe there is an issue with MPICH 3.2? Our default build version of GCC is gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28) as we did not build with bootstrapping. This is CentOS 7.6 (I think). Maybe we need a newer version of GCC or to bootstrap the build?

When I ran my jobs with mpirun/mpiexec instead of cafrun, I get the same results.

This seems like a version issue with the packages used by OpenCoarrays. Can you do a clean build of 2.3.1 and try it? If I need to go to another version of a package or to OpenCoarrays 2.5.0, it is not a big deal.

Thanks Robert


Robert C. Singleterry Jr., PhD NASA Administrator's Fellow (2002-2004, Cohort 6) Durability, Damage Tolerance, & Reliability Branch Research Directorate NASA Langley Research Center MS 188E 2 West Reid Street Bldg. 1205 Rm. 285 Hampton, VA 23681 757 864 1437 (Office) 757 864 8094 (FAX) robert.c.singleterry@nasa.gov 757 371 4848 (Personal Cell)


From: Alessandro Fanfarillo [mailto:notifications@github.com] Sent: Friday, March 15, 2019 12:52 PM To: sourceryinstitute/OpenCoarrays OpenCoarrays@noreply.github.com Cc: Singleterry, Robert C. (LARC-D309) robert.c.singleterry@nasa.gov; Mention mention@noreply.github.com Subject: Re: [sourceryinstitute/OpenCoarrays] The number of images creates an UNKNOWN error (#637)

@singleterryhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_singleterry&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=VsE-hFmwRwCzGNpewHBvNT41QEeLCNNKGOeG5pwxnRA&s=eylcmMZrdDMxIPxckA8bgtpkkjh4wbFr3kLm1h3z5aU&e= I am running your code on NCAR's supercomputer without any issues. I am using the OpenCoarrays version currently on master, MPICH-3.2 and GFortran-6.3.0 (there is no MPICH-3.2 module available with newer compilers). I have also tested it with GNU-8.1 and MPT-2.19 and I haven't found issues.

I am running it like this:

mpirun -np 144 ./prog 12

Is that correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sourceryinstitute_OpenCoarrays_issues_637-23issuecomment-2D473362465&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=VsE-hFmwRwCzGNpewHBvNT41QEeLCNNKGOeG5pwxnRA&s=f7zhLjZvKsarJKA_NNjFOufOO9eR6vo0s3hR3O-HMM0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEL6WyCFNX0vE4LE0whADYD5nXsGc0yAks5vW8-5FNgaJpZM4brMH8&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=VsE-hFmwRwCzGNpewHBvNT41QEeLCNNKGOeG5pwxnRA&s=Oco5gRTyfln015vJgkHB7iKnNO65KkySB3L5sS9LJ_4&e=.

zbeekman commented 5 years ago

Hi Robert,

It sounds like this issue is resolved, is that correct? If I am mistaken please let me know and we can reopen this issue.

FYI, if you install via the install.sh script, it verifies (and builds) the tooling that is needed to perform the installation (but only installs the toolchain into a local directory).

This means, that the versions of software that you are listing may be the versions available by default on your system, but they are most certainly NOT the versions used to build OpenCoarrays. (For example, the minimum CMake version allowed to actually build opencoarrays is 3.10 and an error will be thrown by the build system if you install with anything less.)

If you install via install.sh please include the full command line invocation you used to perform the build.

singleterry commented 5 years ago

Hello,

This issue is resolved! Yay! I get all the images I should get!

I may have mis-typed my list, but yes, the tools are downloaded and installed locally. I was then able to have my SA just copy my directory structure and repoint the directory headings to the proper place in the setup scripts and all worked perfectly!

Thanks again! Robert


Robert C. Singleterry Jr., PhD NASA Administrator's Fellow (2002-2004, Cohort 6) Durability, Damage Tolerance, & Reliability Branch Research Directorate NASA Langley Research Center MS 188E 2 West Reid Street Bldg. 1205 Rm. 285 Hampton, VA 23681 757 864 1437 (Office) 757 864 8094 (FAX) robert.c.singleterry@nasa.gov 757 371 4848 (Personal Cell)


From: zbeekman [mailto:notifications@github.com] Sent: Thursday, March 28, 2019 10:10 AM To: sourceryinstitute/OpenCoarrays OpenCoarrays@noreply.github.com Cc: Singleterry, Robert C. (LARC-D309) robert.c.singleterry@nasa.gov; Mention mention@noreply.github.com Subject: Re: [sourceryinstitute/OpenCoarrays] The number of images creates an UNKNOWN error (#637)

Hi Robert,

It sounds like this issue is resolved, is that correct? If I am mistaken please let me know and we can reopen this issue.

FYI, if you install via the install.sh script, it verifies (and builds) the tooling that is needed to perform the installation (but only installs the toolchain into a local directory).

This means, that the versions of software that you are listing may be the versions available by default on your system, but they are most certainly NOT the versions used to build OpenCoarrays. (For example, the minimum CMake version allowed to actually build opencoarrays is 3.10https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sourceryinstitute_OpenCoarrays_blob_master_CMakeLists.txt-23L1&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=UV2Bx_bk810zmn_ZSbqkbmy81QVo1fUSPBIARuig4JY&e= and an error will be thrown by the build system if you install with anything less.)

If you install via install.sh please include the full command line invocation you used to perform the build.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sourceryinstitute_OpenCoarrays_issues_637-23issuecomment-2D477610900&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=XIrOBx-sxUeRThmw83W2Zg0RLcBI6WG_1aj0-05mPCU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEL6W0gAzHeg-2Duirl7ESA-5FfFpIdSTdYnks5vbM0hgaJpZM4brMH8&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=Rmv7AjesflChJYwqi1br5JR2liE52ja0n0CDen0nU8k&e=.

zbeekman commented 5 years ago

I'm happy to hear it's working!

On Mon, Apr 8, 2019 at 1:02 PM singleterry notifications@github.com wrote:

Hello,

This issue is resolved! Yay! I get all the images I should get!

I may have mis-typed my list, but yes, the tools are downloaded and installed locally. I was then able to have my SA just copy my directory structure and repoint the directory headings to the proper place in the setup scripts and all worked perfectly!

Thanks again! Robert


Robert C. Singleterry Jr., PhD NASA Administrator's Fellow (2002-2004, Cohort 6) Durability, Damage Tolerance, & Reliability Branch Research Directorate NASA Langley Research Center MS 188E 2 West Reid Street Bldg. 1205 Rm. 285 Hampton, VA 23681 757 864 1437 (Office) 757 864 8094 (FAX) robert.c.singleterry@nasa.gov 757 371 4848 (Personal Cell)


From: zbeekman [mailto:notifications@github.com] Sent: Thursday, March 28, 2019 10:10 AM To: sourceryinstitute/OpenCoarrays OpenCoarrays@noreply.github.com Cc: Singleterry, Robert C. (LARC-D309) robert.c.singleterry@nasa.gov; Mention mention@noreply.github.com Subject: Re: [sourceryinstitute/OpenCoarrays] The number of images creates an UNKNOWN error (#637)

Hi Robert,

It sounds like this issue is resolved, is that correct? If I am mistaken please let me know and we can reopen this issue.

FYI, if you install via the install.sh script, it verifies (and builds) the tooling that is needed to perform the installation (but only installs the toolchain into a local directory).

This means, that the versions of software that you are listing may be the versions available by default on your system, but they are most certainly NOT the versions used to build OpenCoarrays. (For example, the minimum CMake version allowed to actually build opencoarrays is 3.10< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sourceryinstitute_OpenCoarrays_blob_master_CMakeLists.txt-23L1&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=UV2Bx_bk810zmn_ZSbqkbmy81QVo1fUSPBIARuig4JY&e=> and an error will be thrown by the build system if you install with anything less.)

If you install via install.sh please include the full command line invocation you used to perform the build.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sourceryinstitute_OpenCoarrays_issues_637-23issuecomment-2D477610900&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=XIrOBx-sxUeRThmw83W2Zg0RLcBI6WG_1aj0-05mPCU&e=>, or mute the thread< https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEL6W0gAzHeg-2Duirl7ESA-5FfFpIdSTdYnks5vbM0hgaJpZM4brMH8&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=ncVImubQGT9opafLNWqIugZRHAiRrABfeXRYwzOtKOWxzhwtjUkdbBwws_iaZNQj&m=nKdQpzWNSY5ZupEcMbxu8R5bwLufTLaodxrBz784zKk&s=Rmv7AjesflChJYwqi1br5JR2liE52ja0n0CDen0nU8k&e=>.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sourceryinstitute/OpenCoarrays/issues/637#issuecomment-480916399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAREPMLMw6dwynzOxGs9tG11HcF44X6zks5ve3Y7gaJpZM4brMH8 .