Open angainor opened 4 years ago
Just to add some info, I found that the segfault happens when the file is located on the Lustre file system. The same code works fine if the file is stored on a local disk, or on a BeeGFS share.
Can this have something to do with Lustre integration / version / etc? Does anyone have suggestions on how to debug this?
I'm afraid I don't know much about HDF. @edgargabriel any insight into this?
Does the same problem happen with OMPIO?
@jsquyres Perfect! thanks, -mca io ompio
works :) I'm not up to date here. Is there a substantial differece between that and romio321
?
ROMIO is an import of MPI-IO functionality from MPICH. Originally, ROMIO was a standalone MPI-IO library written at Argonne (back in the early days of MPI-2 when MPI-IO was new). It eventually got slurped up into MPICH itself. But ever since it was created, ROMIO was slurped up into other MPI implementations too -- such as Open MPI. We've continued to import newer versions of ROMIO from MPICH over the years. I don't remember offhand which version of ROMIO we have, but perhaps it's got a bug in this case.
OMPIO is our own, native MPI-IO implementation -- wholly separate from ROMIO. It was spearheaded by Dr. Edgar Gabriel at U. Houston (i.e., @edgargabriel). OMPIO is Open MPI's default MPI-IO these days, except in a few cases (I don't remember which cases offhand, sorry!).
Put simply: OMPI vs. ROMIO is just another run-time plugin/component decision in Open MPI, just like all the others. 😄 We tend to prefer OMPIO 😉, but we keep ROMIO because of its age, maturity, and simply because some people/apps have a preference and/or established/verified compatibility with it.
@jsquyres Thanks a lot, that's good to know! I run 4.0.3, which seems to use romio321
by default. At least on our system (maybe because of Lustre?)
[login-2.betzy.sigma2.no:05806] io:base:file_select: component available: ompio, priority: 1
[login-2.betzy.sigma2.no:05806] io:base:file_select: component available: romio321, priority: 10
I guess I will simply change that in openmpi-mca-params.conf
if you say it should actually be the default.
@jsquyres and some more info: I checked OpenMPI 3.1.4 with romio314
, and that works. So it seems it is something with the newer version..
I am not sure I have much to contribute to this discussion, I haven't seen this bug yet with romio321.
Generally speaking, romio is used by default on Lustre file system (that's why it has a higher priority in this case), and ompio basically everywhere else. That being said, ompio does have support for Lustre as well, and we are working on some interesting features that if they work out the way we hope, we can also switch on Lustre to ompio.
didn't get fixed in 4.0.4
This sounds like a bug we fixed in ROMIO at some point in the last four years, but I haven't waded through the history to find what might be the fix. I would love to see a romio-332
-- it is only one year old.
For reference: I just encountered the same error on a system with GPFS filesystem as well, with OpenMPI 4.1.0-rc1.
Ran into what seems like this bug when trying to build hdf5 1.12.0 on a system with OpenMPI 4.0.5. System has Lustre (2.12.4.1_cray_139_g0763d21) and the backtrace is exactly like above from H5PB_write on. This happens during make check
with the testpar/testphdf5 unit test of hdf5, and using
../libtool --mode=execute mpirun -mca io ompio -n 6 ./testphdf5
the test finishes successfully.
Unfortunately I lack information to make a debug build of OpenMPI on that system that would exactly match the system version but I'll try to get more information out of the person who installed that package.
@tjahns not sure whether it is relevant for your work or not, but note that ompio is now the default even on Lustre file systems starting from the 4.1.x release. The romio component in Open MPI will also be updated to resolve the issues, but I am not 100% sure on what the status of this effort is.
While OMPIO resolves this issue, we haven't finished the ROMIO update yet because we're waiting for some fixes from upstream. See #8371.
Just one more item to resolve: https://github.com/pmodels/mpich/pull/5101 thanks for your patience
Not going to update ROMIO in v4.0.x. Removing label.
Moved the milestone to 5.0 and removed the 4.1.x label; given that OMPIO works for this use case, we're not going to backport significant ROMIO changes into 4.1.x at this point.
I am looking at OpenMPI 4.0.3 and HDF5 1.10.6 compiled against it. A user reported segfault in
ADIOI_Flatten()
when using a chunked dataset, i.e., when the following line is executed:A simple FORTRAN reproducer is attached (compile with
h5pfc ioerror.F90
, run withmpirun -np2 ./a.out
). The same code works with Intel MPI. Here is the stack:Could that be an OpenMPI problem, or do you think it is HDF5 that's causing it? I'd appreciate any help! thanks!
ioerror.zip