ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

parallel I/O fails after a process failure #48

Closed abouteiller closed 5 years ago

abouteiller commented 5 years ago

Original report by Kai Keller (Bitbucket: kellekai, GitHub: kellekai).


I use parallel I/O with MPI (MPI_File_open) after a process failure, the execution terminates with the following error:

[xxxxxxxxx:02050] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.xxxxxxxxx.1001/pid.2033/1/file2.mpi_cid-3-2050.sm

I have observed, that the job session directory ‘ompi_process_info.job_session_dir’ does not anymore exists and because of that the file can not be created (in file ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c).

I have attached a simple example that hopefully reproduces the error.

i have configured ULFM as:

./configure --with-ft --prefix=/xxxx/opt/ULFM/2.1 --enable-mpi-cxx --enable-cxx-exceptions=yes --enable-debug --enable-mpi-fortran=no

I am on commit: 6c76e287178d42d7dfd1e50e6be4ba18a86a06a1

abouteiller commented 5 years ago

Original comment by Nuria Losada (Bitbucket: nuriallv, GitHub: nuriallv).


Hi Kai,

Commit b54585d should fix your problem. Can you try it and tell us?

abouteiller commented 5 years ago

Original comment by Kai Keller (Bitbucket: kellekai, GitHub: kellekai).


Hi Nuria,

yes, it seems to work now.