pmodels / casper

Process-based Asynchronous Progress Model for MPI Communication
https://pmodels.github.io/casper-www/
Other
9 stars 4 forks source link

Open-MPI win_allocate issue #35

Open jeffhammond opened 4 years ago

jeffhammond commented 4 years ago

We should root-cause this. My money is on Travis CI environment or a Open-MPI bug, rather than Casper.

testing mpiexec=mpiexec --oversubscribe -np 4 CSP_NG=0 win_allocate ...
CASPER Configuration:
    RMA_ERR_CHECK    (enabled) 
    CSP_VERBOSE      = err|conf_g|warn|conf_win|conf_comm|info
    CSP_NG           = 0
    CSP_ASYNC_CONFIG = on
    CSP_TOPO         = machine
    CSP_ASYNC_MODE   = rma|pt2pt
PT2PT Offloading Options:
    CSP_OFFLOAD_MIN_MSGSZ   = 8192 bytes
    CSP_OFFLOAD_SHMQ_NCELLS = 64 (total 13 Kbytes)
                              cell size = 208 bytes, cell size(aligned) = 256 bytes
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.
  Local host:  travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74
  System call: open(2) 
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] *** An error occurred in MPI_Win_allocate
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] *** reported by process [893255681,2]
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] *** MPI_ERR_WIN: invalid window
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19159] ***    and potentially your MPI job)
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19152] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2147
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19152] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[travis-job-daea1d28-3e8e-48bf-b0db-e18066dffe74:19152] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
test failed ! mpiexec --oversubscribe -np 4 /home/travis/build/pmodels/casper/test/win_allocate
minsii commented 3 years ago

@jeffhammond Do you have any idea if we are still facing this issue with a newer version of OpenMPI? I plan to make a production release for Casper since it was stable in the past a few years.

jeffhammond commented 3 years ago

I don't know. I stopped paying attention to Travis CI failures in many of my projects.

minsii commented 3 years ago

Me too. We moved to github actions in the other projects. Let me migrate it for Casper too.