Closed lsawade closed 1 year ago
You mentioned in the title that it occurs with both MPI and non-MPI IO? Does that mean you can reproduce it without the MPI parts? Is it only when built with the MPI compiler, or is it totally independent of MPI?
Hi @takluyver, I meant it is run without the MPI parts, same compiler. Are you suggesting that it may be an MPI compiler issue?
When I do h5py.run_tests()
a single run fails with the same seg fault.
I'm just trying to establish what conditions it shows up in. It's more likely an issue in h5py or HDF5 that something has made visible than an issue with the compiler itself.
Have you tried building from source with a non-MPI compiler? We do have some PPC64LE testing, but not a lot, so it's possible there's something else different about your setup.
This is how I configure HDF5
./configure --enable-shared --enable-parallel \
--enable-fortran --enable-fortran2003 \
--prefix=$HDF5_DESTDIR CC=$MPICC FC=$MPIF90
where openmpi
is version 4.1.1 and theconfig look like this:
gcc -I/usr/local/openmpi/4.1.1/gcc/ppc64le/include -pthread -I/usr/local/include -L/usr/lib64 -L/usr/local/lib64 -L/usr/local/lib64/openmpi -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/usr/local/lib64 -Wl,-rpath -Wl,/usr/local/lib64/openmpi -Wl,-rpath -Wl,/usr/local/openmpi/4.1.1/gcc/ppc64le/lib64 -Wl,--enable-new-dtags -L/usr/local/openmpi/4.1.1/gcc/ppc64le/lib64 -lmpi
This is the HDF5 version I'm using
HDF5_LINK="https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.gz"
and h5py I have tried both building from the github repo like so:
export CC=mpicc
export HDF5_MPI="ON"
export HDF5_DIR="/path/to/parallel/hdf5" # If this isn't found by default
python setup.py install
and through PyPi
export CC=mpicc
export HDF5_MPI="ON"
export HDF5_DIR="/path/to/parallel/hdf5" # If this isn't found by default
pip install --no-binary=h5py h5py
I'm gonna try to use a previous version of openmpi to see whether that makes a difference.
@takluyver, Is there something fundamentally wrong with what I'm doing above? I'm pretty sure I had LZF compression working before without going through the hoops of going into h5py and compiling it from scratch. I feel like I'm making a dumb mistake.
Putting the liblzf_filter.so
into the /lib/plugin/
folder doesn't seem to work either. I really feel like I'm missing a piece of info.
I don't see anything that you're doing wrong, but I don't use MPI myself, nor ppc64le, so I'm not a great person to check it.
I'm pretty sure I had LZF compression working before without going through the hoops of going into h5py and compiling it from scratch.
If you're using h5py with MPI, you'll have either compiled it from scratch or used a third-party package (e.g. from Red Hat), because we don't provide any pre-built packages for ppc64le, or pre-built packages with MPI support.
Putting the liblzf_filter.so into the /lib/plugin/ folder doesn't seem to work either.
I think it has compiled and found the lzf filter code already. There's an lzf_filter
function in the backtrace you showed, and if it hadn't found the filter I'd expect some kind of 'filter not available' message rather than a segfault. I think something is going wrong inside the filter code, but I don't have any great idea what that might be.
Ok! That's helpful though. Thank you!
Then, I may just have to compile everything from scratchscratch, including LZF and the hdf5 plugin suite. Maybe that helps. I'll update when I know more!
Thanks. I'd also be interested whether you see the same with a non-MPI compiler, if you haven't already checked that. We have some ppc64le CI, but there's an issue which mostly prevents it running at the moment.
oh boi, if I build hdf5 and h5py without mpi it's working just fine!
@takluyver since the code works serial I changed the title. Because something is not being installed correctly. Do you think there is a way I can help with identifying the issue?
I see in the lzf filter code there's a compile-time option for debugging messages:
If you can figure out how to turn that on (I guess you can #define
it at the top of that file, though there's probably a smarter way), it might help pin down where the failure is. There aren't a lot of those debug messages, though - you could also add more to try to zoom in on it.
Alright, first I'm trying the installation on non-ppc64 machines and see whether I can isolate the problem to ppc64.
@takluyver So, I got it working on my local machine (macos, just making sure that I'm not making a really dumb mistake). I'll update this again tomorrow. So, it really looks like this is ppc64le related, or mpi compiler related. I made a little test repo to check this, so tomorrow I'm going to test this again on PPC with a bunch of different options/compilers and check where it's failing.
So, I made this https://github.com/lsawade/ph5py-testing to test the installation anywhere. And on most machines that I have tested so far, it works problem-less. One thing came up that I realized I don't fully get is: Where is liblzf_filter.so
saved or how is it used when building? Because I don't know where it is, h5dump
also does not know where it is.
I cannot reproduce the problem. I do not know what's happening. I just recompiled everything on ppc64le
and it's running using the installation scripts in the repo. I'm using the same compilers, the same HDF5 version as 2 days ago and what I believe is the same installation procedure. I am officially going crazy.
I think this can be closed. It is working right now across multiple machines. I truly have no idea what was happening before...
Good to hear it's all good now and many thanks for extensive testing.
Where is liblzf_filter.so saved or how is it used when building?
I think normally in h5py, it's not built as a separate library, but linked into our h5z extension module, by this line:
https://github.com/h5py/h5py/blob/ef1820a8eb0276532d90fe13a0645796015c2575/setup_build.py#L44
You can compile the lzf filter separately if you want to use it as an HDF5 plugin outside h5py, but that's not what building h5py does (as far as I know).
I'm not quite sure what the problem is.
gzip
works completely fine. I compiledHDF5
'by hand' using/usr/local/openmpi/4.1.1/gcc/ppc64le/bin/mpicc
.I tried installing
h5py
both from Pypi and from source. Same error.Maybe the compilation of the shipped
lzf.c
filter file isn't compiled as expected?Compiled using following setup:
PPC64LE
, Linux,RedHat 8h5py 3.7.0
HDF5 1.12.0
Python 3.10.8
Full Version info
``` Summary of the h5py configuration --------------------------------- h5py 3.7.0 HDF5 1.12.0 Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:17:04) [GCC 10.4.0] sys.platform linux sys.maxsize 9223372036854775807 numpy 1.23.5 cython (built with) 0.29.32 numpy (built against) 1.21.6 HDF5 (built against) 1.12.0 ```Reproducible script:
Error backtrace
``` [traverse:983687:0:983687] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) [traverse:983688:0:983688] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x158000000) [traverse:983689:0:983689] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x114000000) [traverse:983686:0:983686] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x120000000) 0 /lib64/libucs.so.0(ucs_handle_error+0x384) [0x200406975df4] 1 /lib64/libucs.so.0(+0x35f98) [0x200406975f98] 2 /lib64/libucs.so.0(+0x362e0) [0x2004069762e0] 3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x2000000604d8] 4 /lib64/glibc-hwcaps/power9/libc.so.6(cfree+0x6c) [0x20000030643c] 5 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5z.cpython-310-powerpc64le-linux-gnu.so(lzf_filter+0x120) [0x20041ff69970] 6 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Z_pipeline+0x8f4) [0x2004373d9710] 7 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__chunk_allocate+0x650) [0x200436f9c2cc] 8 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17c4ac) [0x200436fbc4ac] 9 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__alloc_storage+0x2f8) [0x200436fc3c7c] 10 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__layout_oh_create+0x394) [0x200436fce658] 11 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17adbc) [0x200436fbadbc] 12 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create+0xcb8) [0x200436fbe3e0] 13 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x1919c0) [0x200436fd19c0] 14 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5O_obj_create+0x278) [0x2004371936f0] 15 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2e4b34) [0x200437124b34] 16 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2869a8) [0x2004370c69a8] 17 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5G_traverse+0x114) [0x2004370c78d0] 18 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2dbbf8) [0x20043711bbf8] 19 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5L_link_object+0xa0) [0x200437127174] 20 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create_named+0x13c) [0x200436fbd534] 21 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL__native_dataset_create+0x21c) [0x2004373c2934] 22 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x555de8) [0x200437395de8] 23 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL_dataset_create+0x130) [0x2004373a2c70] 24 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Dcreate2+0x240) [0x200436f78848] 25 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/defs.cpython-310-powerpc64le-linux-gnu.so(+0x2b14c) [0x20041fb8b14c] 26 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5d.cpython-310-powerpc64le-linux-gnu.so(+0x139f8) [0x2004375c39f8] 27 python(+0x2e2a00) [0x107732a00] 28 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/_objects.cpython-310-powerpc64le-linux-gnu.so(+0x17d1c) [0x20041fcf7d1c] 29 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-powerpc64le-linux-gnu.so(+0xc2d0) [0x2004030bc2d0] 30 python(_PyObject_MakeTpCall+0xdc) [0x1074ce4ec] 31 python(_PyEval_EvalFrameDefault+0xa3e8) [0x1074b6108] 32 python(+0x169794) [0x1075b9794] 33 python(_PyFunction_Vectorcall+0x60) [0x1074ce240] 34 python(PyVectorcall_Call+0x114) [0x1074cde04] 35 python(_PyObject_Call+0x168) [0x1074ce088] 36 python(PyObject_Call+0x44) [0x1074ce144] 37 python(_PyEval_EvalFrameDefault+0x50a0) [0x1074b0dc0] 38 python(+0x169794) [0x1075b9794] 39 python(_PyFunction_Vectorcall+0x60) [0x1074ce240] 40 python(+0x2bd0dc) [0x10770d0dc] 41 python(_PyEval_EvalFrameDefault+0x81f4) [0x1074b3f14] 42 python(+0x169794) [0x1075b9794] 43 python(PyEval_EvalCode+0xb8) [0x1075b9a48] 44 python(+0x1c3110) [0x107613110] 45 python(+0x1c34fc) [0x1076134fc] 46 python(+0x1c36c8) [0x1076136c8] 47 python(_PyRun_SimpleFileObject+0x1bc) [0x107616d5c] 48 python(_PyRun_AnyFileObject+0x88) [0x10a727608] 49 python(Py_RunMain+0xacc) [0x10a5cb95c] 50 python(+0x6c174) [0x10a5cc174] 51 python(Py_BytesMain+0x44) [0x10a5cc334] 52 python(+0x5b188) [0x10a5bb188] 53 /lib64/glibc-hwcaps/power9/libc.so.6(+0x29de8) [0x200000289de8] 54 /lib64/glibc-hwcaps/power9/libc.so.6(__libc_start_main+0xb4) [0x200000289fd4] ================================= [traverse:983689] *** Process received signal *** [traverse:983689] Signal: Segmentation fault (11) [traverse:983689] Signal code: (-6) [traverse:983689] Failing at address: 0x1f77b000f0289 [traverse:983689] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [traverse:983689] [ 1] /lib64/glibc-hwcaps/power9/libc.so.6(cfree+0x6c)[0x20000030643c] [traverse:983689] [ 2] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5z.cpython-310-powerpc64le-linux-gnu.so(lzf_filter+0x120)[0x200437939970] [traverse:983689] [ 3] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Z_pipeline+0x8f4)[0x2004373d9710] [traverse:983689] [ 4] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__chunk_allocate+0x650)[0x200436f9c2cc] [traverse:983689] [ 5] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17c4ac)[0x200436fbc4ac] [traverse:983689] [ 6] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__alloc_storage+0x2f8)[0x200436fc3c7c] [traverse:983689] [ 7] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__layout_oh_create+0x394)[0x200436fce658] [traverse:983689] [ 8] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17adbc)[0x200436fbadbc] [traverse:983689] [ 9] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create+0xcb8)[0x200436fbe3e0] [traverse:983689] [10] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x1919c0)[0x200436fd19c0] [traverse:983689] [11] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5O_obj_create+0x278)[0x2004371936f0] [traverse:983689] [12] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2e4b34)[0x200437124b34] [traverse:983689] [13] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2869a8)[0x2004370c69a8] [traverse:983689] [14] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5G_traverse+0x114)[0x2004370c78d0] [traverse:983689] [15] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2dbbf8)[0x20043711bbf8] [traverse:983689] [16] FileObject+0x88) [0x13d497608] [traverse:983689] [17] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create_named+0x13c)[0x200436fbd534] [traverse:983689] [18] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL__native_dataset_create+0x21c)[0x2004373c2934] [traverse:983689] [19] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x555de8)[0x200437395de8] [traverse:983689] [20] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL_dataset_create+0x130)[0x2004373a2c70] [traverse:983689] [21] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Dcreate2+0x240)[0x200436f78848] [traverse:983689] [22] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/defs.cpython-310-powerpc64le-linux-gnu.so(+0x2b14c)[0x2004375ab14c] [traverse:983689] [23] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5d.cpython-310-powerpc64le-linux-gnu.so(+0x139f8)[0x2004379f39f8] [traverse:983689] [24] python(+0x2e2a00)[0x10a842a00] [traverse:983689] [25] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/_objects.cpython-310-powerpc64le-linux-gnu.so(+0x17d1c)[0x20041ffb7d1c] [traverse:983689] [26] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-powerpc64le-linux-gnu.so(+0xc2d0)[0x2004030bc2d0] [traverse:983689] [27] python(_PyObject_MakeTpCall+0xdc)[0x10a5de4ec] [traverse:983689] [28] python(_PyEval_EvalFrameDefault+0xa3e8)[0x10a5c6108] [traverse:983689] [29] python(+0x169794)[0x10a6c9794] [traverse:983689] *** End of error message *** ```