sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
244 stars 58 forks source link

homebrew-installed opencoarrays produces seg faults with simple coarry accesses #626

Open rouson opened 5 years ago

rouson commented 5 years ago
Avg response time
Issue Stats

Defect/Bug Report

Observed Behavior

$ cat main.f90 
  type Array_Type
      real, allocatable :: values(:)
  end type
  type(Array_Type) array[*]

  allocate(array%values(2),source=0.)
  array%values = this_image()
  sync all
  print *, array%values
end
$ caf main.f90 
$ cafrun -n 4 ./a.out
   4.00000000       4.00000000    
   1.00000000       1.00000000    
   2.00000000       2.00000000    
   3.00000000       3.00000000    
[localhost:73816] *** An error occurred in MPI_Win_detach
[localhost:73816] *** reported by process [4040687617,1]
[localhost:73816] *** on win rdma window 5
[localhost:73816] *** MPI_ERR_OTHER: known error not in list
[localhost:73816] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[localhost:73816] ***    and potentially your MPI job)
[localhost:73814] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[localhost:73814] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/local/bin/mpiexec -n 4 ./a.out`
failed to run.
$ caf --version

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.3.1)
...

The error occurs intermittently (non-deterministically).

Installing using the OpenCoarrays installer eliminates the problem -- presumably because the installer installs MPICH instead of OpenMPI.

zbeekman commented 5 years ago

I'm seeing the following failures with OpenMPI and OC 2.5.0:

96% tests passed, 3 tests failed out of 78

Total Test time (real) =  44.76 sec

The following tests FAILED:
         14 - alloc_comp_get_convert_nums (Failed)
         23 - alloc_comp_send_convert_nums (Failed)
         69 - issue-515-mimic-mpi-gatherv (Failed)
Errors while running CTest

Hopefully it's the same problem we're seeing here... we'll see what happens with the "bottling" of the latest 2.5.0 release of OpenCoarrays.

zbeekman commented 5 years ago

@rouson I can't reproduce this with a fresh install of OpenCoarrays from Homebrew. I'm going to close this. If you have issues that are persisting, you can re-open or we can investigate together.

cprich01 commented 1 month ago

Seems that the original install of openmpi was the problem and since coarrays install couldn't see openmpi it loaded the default. mpich is loaded with coarrays and it is a pass-through openmpi if it exist on the system. Openmpi in turn wraps gfortran with mpifort. Any link in the chain can break the process and all of these have settings and flags to fine tune the program to the task at hand. Since mpich is the default most are unaware that mpich will run successfully in series to itself. I am new to openmpi and the last time I did fortran programming was on a monitor that can do a batch job to a mainframe and you would get these cards and sort and verify the order of each line of code and then feed them to the beast.

cprich01 commented 1 month ago

my problem is a simple one but searching through documentation for the needle in the haystacks is getting the best of me. I have to ethernet ports on this computer so I think that if I link them together I can at least tell that osc ucx is working

zbeekman commented 1 month ago

@cprich01 it's not clear to me what the actual issue is that you are facing. I suggest opening a new bug report unless you are facing exactly the same problem on macOS as described above.