h5py / h5py

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
http://www.h5py.org
BSD 3-Clause "New" or "Revised" License
2.05k stars 521 forks source link

Installation of Parallel HDF5 with h5py and LibLZF on PPC64LE #2182

Closed lsawade closed 1 year ago

lsawade commented 1 year ago

I'm not quite sure what the problem is. gzip works completely fine. I compiled HDF5 'by hand' using /usr/local/openmpi/4.1.1/gcc/ppc64le/bin/mpicc.

I tried installing h5py both from Pypi and from source. Same error.

Maybe the compilation of the shipped lzf.c filter file isn't compiled as expected?

Compiled using following setup:

Full Version info ``` Summary of the h5py configuration --------------------------------- h5py 3.7.0 HDF5 1.12.0 Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:17:04) [GCC 10.4.0] sys.platform linux sys.maxsize 9223372036854775807 numpy 1.23.5 cython (built with) 0.29.32 numpy (built against) 1.21.6 HDF5 (built against) 1.12.0 ```

Reproducible script:

import numpy as np
from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

with h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD) as f:

    dset = f.create_dataset('test', (4, 1000), dtype='i',
                            chunks=(1, 1000), compression="lzf")

    with dset.collective:
        dset[rank] = np.full(1000, rank)
Error backtrace ``` [traverse:983687:0:983687] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) [traverse:983688:0:983688] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x158000000) [traverse:983689:0:983689] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x114000000) [traverse:983686:0:983686] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x120000000) 0 /lib64/libucs.so.0(ucs_handle_error+0x384) [0x200406975df4] 1 /lib64/libucs.so.0(+0x35f98) [0x200406975f98] 2 /lib64/libucs.so.0(+0x362e0) [0x2004069762e0] 3 linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x2000000604d8] 4 /lib64/glibc-hwcaps/power9/libc.so.6(cfree+0x6c) [0x20000030643c] 5 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5z.cpython-310-powerpc64le-linux-gnu.so(lzf_filter+0x120) [0x20041ff69970] 6 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Z_pipeline+0x8f4) [0x2004373d9710] 7 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__chunk_allocate+0x650) [0x200436f9c2cc] 8 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17c4ac) [0x200436fbc4ac] 9 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__alloc_storage+0x2f8) [0x200436fc3c7c] 10 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__layout_oh_create+0x394) [0x200436fce658] 11 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17adbc) [0x200436fbadbc] 12 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create+0xcb8) [0x200436fbe3e0] 13 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x1919c0) [0x200436fd19c0] 14 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5O_obj_create+0x278) [0x2004371936f0] 15 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2e4b34) [0x200437124b34] 16 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2869a8) [0x2004370c69a8] 17 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5G_traverse+0x114) [0x2004370c78d0] 18 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2dbbf8) [0x20043711bbf8] 19 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5L_link_object+0xa0) [0x200437127174] 20 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create_named+0x13c) [0x200436fbd534] 21 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL__native_dataset_create+0x21c) [0x2004373c2934] 22 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x555de8) [0x200437395de8] 23 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL_dataset_create+0x130) [0x2004373a2c70] 24 /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Dcreate2+0x240) [0x200436f78848] 25 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/defs.cpython-310-powerpc64le-linux-gnu.so(+0x2b14c) [0x20041fb8b14c] 26 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5d.cpython-310-powerpc64le-linux-gnu.so(+0x139f8) [0x2004375c39f8] 27 python(+0x2e2a00) [0x107732a00] 28 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/_objects.cpython-310-powerpc64le-linux-gnu.so(+0x17d1c) [0x20041fcf7d1c] 29 /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-powerpc64le-linux-gnu.so(+0xc2d0) [0x2004030bc2d0] 30 python(_PyObject_MakeTpCall+0xdc) [0x1074ce4ec] 31 python(_PyEval_EvalFrameDefault+0xa3e8) [0x1074b6108] 32 python(+0x169794) [0x1075b9794] 33 python(_PyFunction_Vectorcall+0x60) [0x1074ce240] 34 python(PyVectorcall_Call+0x114) [0x1074cde04] 35 python(_PyObject_Call+0x168) [0x1074ce088] 36 python(PyObject_Call+0x44) [0x1074ce144] 37 python(_PyEval_EvalFrameDefault+0x50a0) [0x1074b0dc0] 38 python(+0x169794) [0x1075b9794] 39 python(_PyFunction_Vectorcall+0x60) [0x1074ce240] 40 python(+0x2bd0dc) [0x10770d0dc] 41 python(_PyEval_EvalFrameDefault+0x81f4) [0x1074b3f14] 42 python(+0x169794) [0x1075b9794] 43 python(PyEval_EvalCode+0xb8) [0x1075b9a48] 44 python(+0x1c3110) [0x107613110] 45 python(+0x1c34fc) [0x1076134fc] 46 python(+0x1c36c8) [0x1076136c8] 47 python(_PyRun_SimpleFileObject+0x1bc) [0x107616d5c] 48 python(_PyRun_AnyFileObject+0x88) [0x10a727608] 49 python(Py_RunMain+0xacc) [0x10a5cb95c] 50 python(+0x6c174) [0x10a5cc174] 51 python(Py_BytesMain+0x44) [0x10a5cc334] 52 python(+0x5b188) [0x10a5bb188] 53 /lib64/glibc-hwcaps/power9/libc.so.6(+0x29de8) [0x200000289de8] 54 /lib64/glibc-hwcaps/power9/libc.so.6(__libc_start_main+0xb4) [0x200000289fd4] ================================= [traverse:983689] *** Process received signal *** [traverse:983689] Signal: Segmentation fault (11) [traverse:983689] Signal code: (-6) [traverse:983689] Failing at address: 0x1f77b000f0289 [traverse:983689] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [traverse:983689] [ 1] /lib64/glibc-hwcaps/power9/libc.so.6(cfree+0x6c)[0x20000030643c] [traverse:983689] [ 2] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5z.cpython-310-powerpc64le-linux-gnu.so(lzf_filter+0x120)[0x200437939970] [traverse:983689] [ 3] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Z_pipeline+0x8f4)[0x2004373d9710] [traverse:983689] [ 4] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__chunk_allocate+0x650)[0x200436f9c2cc] [traverse:983689] [ 5] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17c4ac)[0x200436fbc4ac] [traverse:983689] [ 6] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__alloc_storage+0x2f8)[0x200436fc3c7c] [traverse:983689] [ 7] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__layout_oh_create+0x394)[0x200436fce658] [traverse:983689] [ 8] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x17adbc)[0x200436fbadbc] [traverse:983689] [ 9] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create+0xcb8)[0x200436fbe3e0] [traverse:983689] [10] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x1919c0)[0x200436fd19c0] [traverse:983689] [11] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5O_obj_create+0x278)[0x2004371936f0] [traverse:983689] [12] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2e4b34)[0x200437124b34] [traverse:983689] [13] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2869a8)[0x2004370c69a8] [traverse:983689] [14] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5G_traverse+0x114)[0x2004370c78d0] [traverse:983689] [15] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x2dbbf8)[0x20043711bbf8] [traverse:983689] [16] FileObject+0x88) [0x13d497608] [traverse:983689] [17] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5D__create_named+0x13c)[0x200436fbd534] [traverse:983689] [18] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL__native_dataset_create+0x21c)[0x2004373c2934] [traverse:983689] [19] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(+0x555de8)[0x200437395de8] [traverse:983689] [20] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5VL_dataset_create+0x130)[0x2004373a2c70] [traverse:983689] [21] /scratch/gpfs/lsawade/SpecfemMagic/packages/hdf5/build/lib/libhdf5.so.200(H5Dcreate2+0x240)[0x200436f78848] [traverse:983689] [22] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/defs.cpython-310-powerpc64le-linux-gnu.so(+0x2b14c)[0x2004375ab14c] [traverse:983689] [23] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/h5d.cpython-310-powerpc64le-linux-gnu.so(+0x139f8)[0x2004379f39f8] [traverse:983689] [24] python(+0x2e2a00)[0x10a842a00] [traverse:983689] [25] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/h5py/_objects.cpython-310-powerpc64le-linux-gnu.so(+0x17d1c)[0x20041ffb7d1c] [traverse:983689] [26] /home/lsawade/.conda/envs/gf/lib/python3.10/site-packages/numpy/random/bit_generator.cpython-310-powerpc64le-linux-gnu.so(+0xc2d0)[0x2004030bc2d0] [traverse:983689] [27] python(_PyObject_MakeTpCall+0xdc)[0x10a5de4ec] [traverse:983689] [28] python(_PyEval_EvalFrameDefault+0xa3e8)[0x10a5c6108] [traverse:983689] [29] python(+0x169794)[0x10a6c9794] [traverse:983689] *** End of error message *** ```
takluyver commented 1 year ago

You mentioned in the title that it occurs with both MPI and non-MPI IO? Does that mean you can reproduce it without the MPI parts? Is it only when built with the MPI compiler, or is it totally independent of MPI?

lsawade commented 1 year ago

Hi @takluyver, I meant it is run without the MPI parts, same compiler. Are you suggesting that it may be an MPI compiler issue?

Extra test

When I do h5py.run_tests() a single run fails with the same seg fault.

takluyver commented 1 year ago

I'm just trying to establish what conditions it shows up in. It's more likely an issue in h5py or HDF5 that something has made visible than an issue with the compiler itself.

Have you tried building from source with a non-MPI compiler? We do have some PPC64LE testing, but not a lot, so it's possible there's something else different about your setup.

lsawade commented 1 year ago

This is how I configure HDF5

./configure --enable-shared --enable-parallel \
    --enable-fortran --enable-fortran2003 \
    --prefix=$HDF5_DESTDIR CC=$MPICC FC=$MPIF90

where openmpi is version 4.1.1 and theconfig look like this:

gcc -I/usr/local/openmpi/4.1.1/gcc/ppc64le/include -pthread -I/usr/local/include -L/usr/lib64 -L/usr/local/lib64 -L/usr/local/lib64/openmpi -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/usr/local/lib64 -Wl,-rpath -Wl,/usr/local/lib64/openmpi -Wl,-rpath -Wl,/usr/local/openmpi/4.1.1/gcc/ppc64le/lib64 -Wl,--enable-new-dtags -L/usr/local/openmpi/4.1.1/gcc/ppc64le/lib64 -lmpi

This is the HDF5 version I'm using

HDF5_LINK="https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.12/hdf5-1.12.2/src/hdf5-1.12.2.tar.gz"

and h5py I have tried both building from the github repo like so:

export CC=mpicc
export HDF5_MPI="ON"
export HDF5_DIR="/path/to/parallel/hdf5"  # If this isn't found by default

python setup.py install

and through PyPi

export CC=mpicc
export HDF5_MPI="ON"
export HDF5_DIR="/path/to/parallel/hdf5"  # If this isn't found by default

pip install --no-binary=h5py h5py

I'm gonna try to use a previous version of openmpi to see whether that makes a difference.

lsawade commented 1 year ago

@takluyver, Is there something fundamentally wrong with what I'm doing above? I'm pretty sure I had LZF compression working before without going through the hoops of going into h5py and compiling it from scratch. I feel like I'm making a dumb mistake.

Putting the liblzf_filter.so into the /lib/plugin/ folder doesn't seem to work either. I really feel like I'm missing a piece of info.

takluyver commented 1 year ago

I don't see anything that you're doing wrong, but I don't use MPI myself, nor ppc64le, so I'm not a great person to check it.

I'm pretty sure I had LZF compression working before without going through the hoops of going into h5py and compiling it from scratch.

If you're using h5py with MPI, you'll have either compiled it from scratch or used a third-party package (e.g. from Red Hat), because we don't provide any pre-built packages for ppc64le, or pre-built packages with MPI support.

Putting the liblzf_filter.so into the /lib/plugin/ folder doesn't seem to work either.

I think it has compiled and found the lzf filter code already. There's an lzf_filter function in the backtrace you showed, and if it hadn't found the filter I'd expect some kind of 'filter not available' message rather than a segfault. I think something is going wrong inside the filter code, but I don't have any great idea what that might be.

lsawade commented 1 year ago

Ok! That's helpful though. Thank you!

Then, I may just have to compile everything from scratchscratch, including LZF and the hdf5 plugin suite. Maybe that helps. I'll update when I know more!

takluyver commented 1 year ago

Thanks. I'd also be interested whether you see the same with a non-MPI compiler, if you haven't already checked that. We have some ppc64le CI, but there's an issue which mostly prevents it running at the moment.

lsawade commented 1 year ago

oh boi, if I build hdf5 and h5py without mpi it's working just fine!

lsawade commented 1 year ago

@takluyver since the code works serial I changed the title. Because something is not being installed correctly. Do you think there is a way I can help with identifying the issue?

takluyver commented 1 year ago

I see in the lzf filter code there's a compile-time option for debugging messages:

https://github.com/h5py/h5py/blob/6cc4c912a97cff4f3a6c133943d573118d795025/lzf/lzf_filter.c#L211-L213

If you can figure out how to turn that on (I guess you can #define it at the top of that file, though there's probably a smarter way), it might help pin down where the failure is. There aren't a lot of those debug messages, though - you could also add more to try to zoom in on it.

lsawade commented 1 year ago

Alright, first I'm trying the installation on non-ppc64 machines and see whether I can isolate the problem to ppc64.

lsawade commented 1 year ago

@takluyver So, I got it working on my local machine (macos, just making sure that I'm not making a really dumb mistake). I'll update this again tomorrow. So, it really looks like this is ppc64le related, or mpi compiler related. I made a little test repo to check this, so tomorrow I'm going to test this again on PPC with a bunch of different options/compilers and check where it's failing.

lsawade commented 1 year ago

So, I made this https://github.com/lsawade/ph5py-testing to test the installation anywhere. And on most machines that I have tested so far, it works problem-less. One thing came up that I realized I don't fully get is: Where is liblzf_filter.so saved or how is it used when building? Because I don't know where it is, h5dump also does not know where it is.

lsawade commented 1 year ago

I cannot reproduce the problem. I do not know what's happening. I just recompiled everything on ppc64le and it's running using the installation scripts in the repo. I'm using the same compilers, the same HDF5 version as 2 days ago and what I believe is the same installation procedure. I am officially going crazy.

lsawade commented 1 year ago

I think this can be closed. It is working right now across multiple machines. I truly have no idea what was happening before...

ajelenak commented 1 year ago

Good to hear it's all good now and many thanks for extensive testing.

takluyver commented 1 year ago

Where is liblzf_filter.so saved or how is it used when building?

I think normally in h5py, it's not built as a separate library, but linked into our h5z extension module, by this line:

https://github.com/h5py/h5py/blob/ef1820a8eb0276532d90fe13a0645796015c2575/setup_build.py#L44

You can compile the lzf filter separately if you want to use it as an HDF5 plugin outside h5py, but that's not what building h5py does (as far as I know).