Closed jsquyres closed 9 years ago
Trying to wrap my head around this stack trace. It should not be possible. Check fboxes reads from a pointer in a shared memory segment owned by the sender. The sender's fast box data pointer is initialized by the sender before it is sent to the receiver. This rules out the possibility that the sending process ran out of room in its shared memory segment.
Only thing i can think of is memory corruption and by luck it is working with sm.
Even more suspicious. The line number of the crash is not the first read from either the fast box or the endpoint. This is not likely Vader's fault.
@opoplawski Can you post the program here that caused the problem?
Note that the stack trace is from one of the hang conditions where each process spins in opal_condition_wait(). Other times it will trigger and MPI error. I'm afraid the test program is fairly involved. It's the nc_test4/tst_nc4perf.c (https://github.com/Unidata/netcdf-c/blob/master/nc_test4/tst_nc4perf.c) test program in netcdf 4.3.3.1 that makes use of HDF5 IO - in this case its hdf5 1.8.14. This is running on Fedora Rawhide which is using gcc 5.0.0. So far I've only seen it on i686.
I should also note that I'm not entirely sure if this is a regression or not (or when it started). I've seen odd behavior for quite a while with netcdf's MPI tests.
Ah, ok. If the trace is not a crash that makes more sense. I will take a look and see if I can figure out why that test is getting stuck.
Can't reproduce on master (same vader revision) on SLES11 with btl_vader_single_copy_mechanism set to either xpmem or none. tst_nc4perf runs to completion with 2, 4, and 10 ranks running on a single node. I used netcdf master with hdf5 1.8.14, gcc 4.8.2.
This could be a romio bug. The version in 1.8.4 lags behind trunk. Can you try running with -mca io ompio?
Another alternative would be to run with master.
That's worse:
$ mpirun -np 4 -mca io ompio ./openmpi/nc_test4/tst_nc4perf * Testing parallel IO for NASA... num_proc MPI mode access cache (MB) grid size chunks avg. write time(s) avg. write bandwidth(MB/s) num_tries * * stack smashing detected _: /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/nc_test4/.libs/tstnc4perf terminated [mock1:09190] ** Process received signal *** [mock1:09190] Signal: Aborted (6) [mock1:09190] Signal code: (-6) ======= Backtrace: ========= /lib/libc.so.6(+0x6e183)[0xf3ebd183] /lib/libc.so.6(fortify_fail+0x37)[0xf3f5dee7] [mock1:09190] /lib/libc.so.6(+0x10eea8)[0xf3f5dea8] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so([ 0] _fini+0x0)[0xf2a32654] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.solinux-gate.so.1(kernel_rt_sigreturn+0x0)0xf7777cc0[0xf2a301a4] [mock1:09190] /usr/lib/openmpi/lib/libmpi.so.1(PMPI_File_set_size+0xa7)[0xf40c53a7] [ 1] linux-gate.so.1(__kernel_vsyscall+0x10)[0xf7777ce0] [mock1:09190] [ 2] /usr/lib/openmpi/lib/libhdf5.so.9(+0xd4b5f)[0xf4340b5f] /lib/libc.so.6(gsignal+0x46)[0xf3e86c36] [mock1:09190] [ 3] /usr/lib/openmpi/lib/libhdf5.so.9(H5FD_truncate+0x40)[0xf43374d0] /lib/libc.so.6(abort+0x145)[0xf3e884b5] [mock1:09190] [ 4] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_dest+0x37a)[0xf432445a] /lib/libc.so.6(+0x6e188)[0xf3ec6188] [mock1:09190] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_try_close+0x193)[0xf43255a3] [ 5] /lib/libc.so.6(__fortify_fail+0x37)[0xf3f66ee7] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_close+0x3c)[0xf432590c[mock1:09190] ] [ 6] /usr/lib/openmpi/lib/libhdf5.so.9(H5I_dec_ref+0xb9)[0xf43aa299] /lib/libc.so.6(+0x10eea8)[0xf3f66ea8] [mock1:09190] /usr/lib/openmpi/lib/libhdf5.so.9(H5I_dec_app_ref[ 7] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so(_fini+0x0)[0xf2a39654] [mock1:09190] /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/liblib/.libs/libnetcdf.so.7(+0x8ede7)[ 8] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so(+0x91a4)[0xf2a371a4] [mock1:09190] /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/liblib/.libs/libnetcdf.so.7(nc_close+0x42)[0x[ 9] e9752000-e976e000 r-xp 00000000 08:03 5394869 /usr/lib/libgcc_s-5.0.0-20150226.so.1
(gdb) list 0x91a4 0x91a4 is at io_ompio_file_open.c:455. 450 } 451 452 ret = data->ompio_fh.f_fs->fs_file_set_size (&data->ompio_fh, size); 453 454 return ret; 455 } 456 457 int 458 mca_io_ompio_file_get_size (ompi_file_t fh, 459 OMPI_MPI_OFFSET_TYPE *size)
Is this test publicly available? I can have a look at it to see what is going on in ompio. I assuming this is master?
Thanks Edgar
On 3/16/2015 10:46 AM, Orion Poplawski wrote:
That's worse:
$ mpirun -np 4 -mca io ompio ./openmpi/nc_test4/tst_nc4perf * Testing parallel IO for NASA... num_proc MPI mode access cache (MB) grid size chunks avg. write time(s) avg. write bandwidth(MB/s) num_tries * * stack smashing detected /_: /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/nc_test4/.libs/tst_nc4perf terminated [mock1:09190] /_ Process received signal * [mock1:09190] Signal: Aborted (6) [mock1:09190] Signal code: (-6) ======= Backtrace: ========= /lib/libc.so.6(+0x6e183)[0xf3ebd183] /lib/libc.so.6(fortify_fail+0x37)[0xf3f5dee7] [mock1:09190] /lib/libc.so.6(+0x10eea8)[0xf3f5dea8] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so([ 0] _fini+0x0)[0xf2a32654] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.solinux-gate.so.1(kernel_rt_sigreturn+0x0)0xf7777cc0 <+0x91a4>[0xf2a301a4] [mock1:09190] /usr/lib/openmpi/lib/libmpi.so.1(PMPI_File_set_size+0xa7)[0xf40c53a7] [ 1] linux-gate.so.1(__kernel_vsyscall+0x10)[0xf7777ce0] [mock1:09190] [ 2] /usr/lib/openmpi/lib/libhdf5.so.9(+0xd4b5f)[0xf4340b5f] /lib/libc.so.6(gsignal+0x46)[0xf3e86c36] [mock1:09190] [ 3] /usr/lib/openmpi/lib/libhdf5.so.9(H5FD_truncate+0x40)[0xf43374d0] /lib/libc.so.6(abort+0x145)[0xf3e884b5] [mock1:09190] [ 4] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_dest+0x37a)[0xf432445a] /lib/libc.so.6(+0x6e188)[0xf3ec6188] [mock1:09190] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_try_close+0x193)[0xf43255a3] [ 5] /lib/libc.so.6(__fortify_fail+0x37)[0xf3f66ee7] /usr/lib/openmpi/lib/libhdf5.so.9(H5F_close+0x3c)[0xf432590c[mock1:09190] ] [ 6] /usr/lib/openmpi/lib/libhdf5.so.9(H5I_dec_ref+0xb9)[0xf43aa299] /lib/libc.so.6(+0x10eea8)[0xf3f66ea8] [mock1:09190] /usr/lib/openmpi/lib/libhdf5.so.9(H5I_dec_app_ref[ 7] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so(_fini+0x0)[0xf2a39654] [mock1:09190] /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/liblib/.libs/libnetcdf.so.7(+0x8ede7)[ 8] /usr/lib/openmpi/lib/openmpi/mca_io_ompio.so(+0x91a4)[0xf2a371a4] [mock1:09190] /builddir/build/BUILD/netcdf-4.3.3.1/openmpi/liblib/.libs/libnetcdf.so.7(nc_close+0x42)[0x[ 9] e9752000-e976e000 r-xp 00000000 08:03 5394869 /usr/lib/libgcc_s-5.0.0-20150226.so.1
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/473#issuecomment-81751525.
Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Nothing can wear you out like caring about people.
If this was the 1.8 series of Open MPI, the ompio module there does not have all the fixes necessary to pass the hdf5 test suite, that was done last summer and it includes too many changes to ompio for it to be feasible to backport it to the 1.8 series.
@edgargabriel See earlier comments for what test and openmpi version this is.
Some more details:
The HDF5 call seems to be from ./hdf5-1.8.14/src/H5FDmpio.c:1091:
if (MPI_SUCCESS != (mpi_code=MPI_File_set_size(fh, (MPI_Offset)0)))
Okay, so probably not worth worrying about the 1.8 ompio failure. Looks like the original issue may be fixed in master, but we have no idea that the fix may have been?
Can you verify that it indeed works for you with master? Just because I can't reproduce doesn't mean it is fixed :). Once we know whether it is fixed we can start the discussion about whether we should back-port the romio fixes on master to 1.8. The fix will likely be among those changes.
I'm trying to build the Fedora package with opmi master, but running into issue #475
Any luck with master?
Still compiling deps. Ran into issue #478 as well - disabled tests for now.
I can still reproduce the hang with the current dev snapshot. This may be triggered by gcc 5.0.0 as well.
That is my guess. I am trying to install a gcc 5 snapshot build to test this theory. Keep in mind that gcc 5.0 is still technically in beta so there is a good chance this is a gcc bug.
I've found another package that is entering a hang loop on Fedora rawhide i686: elpa. Haven't looked at it in detail yet, but this does seem to be a problem with more than just netcdf.
Yup, looks like the same cycle in opal_progress().
I've updated to 1.8.4-134-g9ad2aa8 and applied the atomic patch. It does not appear to affect this problem. I'm also seeing a (probably) similar failure on arm7hl.
Found the problem. Vader assumes 64-bit load/store in the fast box code. With gcc 4.8 this doesn't seem to cause any issues but with gcc-5.0 there is a data race between the process setting the fast box header and the process reading it. This causes the receiver to read an incomplete message leading to a hang or a crash.
I am putting together a fix for the data race for master and with PR it to 1.8.
@opoplawski #503 should fix the issue. I was able to reproduce crashes/hangs with ubuntu 14.04 i386 with gcc 5.0-032215.
It does indeed appear to fix it for me, thanks! Now I just need to track down the armv7hl issue...
Closing, since @opoplawski confirms that it's fixed.
Per the thread starting here: http://www.open-mpi.org/community/lists/devel/2015/03/17131.php
@opoplawski is seeing crashes in the Open MPI test suite in openmpi-1.8.4-99-20150228 (Feb 28 nightly tarball) with the vader BTL. If he disables the vader BTL, the crashes go away:
@hjelmn Can you have a look?