Building mpi4py and h5py against system libraries

andersy005 commented 5 years ago

As @rabernat pointed out here, there's a performance hit when using pre-built libraries from conda on Cheyenne.

I started looking into this. My plan was as following:

[x] Build hdf5 (v1.10.4) MPI enabled library
[x] Build mpi4py from source
[x] Build h5py against hdf5 MPI enabled library
[x] Attempt at running the benchmark scripts
The first step of building hdf5 (v1.10.4) MPI enabled library on Cheyenne has proven itself to be a cumbersome process :). Getting the right combination of compilers wasn't a trivial task. I successfully built hdf5 with following modules:

abanihi@cheyenne2: ~/devel/hdf5-1.10.4/release_docs $ ml

Currently Loaded Modules:
  1) conda/4   2) nano/2.7.4   3) git/2.10.2   4) ncarenv/1.2   5) intel/17.0.1   6) impi/2017.1.132

Enabling threadsafe build option doesn't work right out of the box:

checking for thread safe support... configure: error: The thread-safe library is incompatible with the high-level library. --disable-hl can be used to prevent building the high-level library (recommended). Alternatively, --enable-unsupported will allow building the high-level library, though this configuration is not supported by The HDF Group.

As a result, I had to use --enable-unsupported to allow building the high-level library

$ CC=`which mpicc` ./configure --enable-unsupported --enable-parallel --enable-threadsafe --with-pthread=/glade/u/apps/ch/os/lib64/ --prefix=$CONDA_PREFIX

Building h5py against HDF5 MPI enable library works

abanihi@r3i1n25:~/devel/h5py> python setup.py configure --mpi --hdf5=$CONDA_PREFIX
running configure
Autodetected HDF5 1.10.4
********************************************************************************
                       Summary of the h5py configuration

    Path to HDF5: '/glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr'
    HDF5 Version: '1.10.4'
     MPI Enabled: True
Rebuild Required: False

********************************************************************************
abanihi@r3i1n25:~/devel/h5py> python setup.py install

But then we importing h5py, I get this warning:

/glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: mpi4py.MPI.Win size changed, may indicate binary incompatibility. Expected 32 from C header, got 40 from PyObject
  return f(*args, **kwds)
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[29994,1],0] (PID 56477)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
-------------------------------------------------------------------------

I am not sure about the ramifications of this when using h5py

Attempt at running the benchmark scripts: at the time of writing, this step is still work in progress. I will report the results once it's complete.

Ccing @kmpaul, @jukent, @jhamman

rabernat commented 5 years ago

This is amazingly fast progress @andersy005! Great job!

FYI, I got the same warning from h5py with my conda-derived setup.

rabernat commented 5 years ago

python setup.py configure --mpi --hdf5=$CONDA_PREFIX

Are you sure about this? Shouldn't --hdf5 point the hdf5 library you built in step 1, not the conda environment?

andersy005 commented 5 years ago

Are you sure about this? Shouldn't --hdf5 point the hdf5 library you built in step 1, not the conda environment?

Not 100% sure. However, since I am specifying the installation point as $CONDA_PREFIX in

$ CC=`which mpicc` ./configure --enable-unsupported --enable-parallel --enable-threadsafe --with-pthread=/glade/u/apps/ch/os/lib64/ --prefix=$CONDA_PREFIX

the built library ends up in $CONDA_PREFIX/lib, and I assumed that this message:

abanihi@r3i1n25:~/devel/h5py> python setup.py configure --mpi --hdf5=$CONDA_PREFIX
running configure
Autodetected HDF5 1.10.4
********************************************************************************
                       Summary of the h5py configuration

    Path to HDF5: '/glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr'
    HDF5 Version: '1.10.4'
     MPI Enabled: True
Rebuild Required: False

was a sign that h5py install script was able to figure out the location of the built library.

The single node benchmark script just terminated with errors.

Here's the output:

```console abanihi@cheyenne2: ~/devel/zarr_hdf_benchmarks $ cat hdfzarr_single.o4502057 Restoring modules from user's hdf5_zarr, for system: "ch" [mpiexec@r13i7n24] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host r13i7n24.ib0.cheyenne.ucar.edu failed [mpiexec@r13i7n24] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@r13i7n24] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@r13i7n24] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion [r13i7n24:06233] *** Process received signal *** [r13i7n24:06233] Signal: Segmentation fault (11) [r13i7n24:06233] Signal code: (128) [r13i7n24:06233] Failing at address: (nil) [r13i7n24:06233] [ 0] /glade/u/apps/ch/os/lib64/libpthread.so.0(+0xf870)[0x2aaaaacdd870] [r13i7n24:06233] [ 1] /glade/u/apps/ch/os/lib64/libc.so.6(strchrnul+0x23)[0x2aaaaaf771a3] [r13i7n24:06233] [ 2] /glade/u/apps/ch/os/lib64/libc.so.6(_IO_vfprintf+0x92)[0x2aaaaaf31262] [r13i7n24:06233] [ 3] /glade/u/apps/ch/os/lib64/libc.so.6(vasprintf+0xa3)[0x2aaaaaf5c703] [r13i7n24:06233] [ 4] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0xee)[0x2aaac1687e6e] [r13i7n24:06233] [ 5] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [ 6] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [ 7] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [ 8] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [ 9] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [10] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [11] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [12] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [13] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [14] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [15] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [16] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [17] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [18] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [19] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [20] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [21] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [22] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [23] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [24] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [25] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [26] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] [27] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06233] [28] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06233] [29] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06233] *** End of error message *** [mpiexec@r13i7n24] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host r13i7n24.ib0.cheyenne.ucar.edu failed [mpiexec@r13i7n24] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@r13i7n24] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@r13i7n24] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion [r13i7n24:06506] *** Process received signal *** [r13i7n24:06506] Signal: Bus error (7) [r13i7n24:06506] Signal code: Non-existant physical address (2) [r13i7n24:06506] Failing at address: 0x2aaaaaf771a3 [r13i7n24:06506] [ 0] /glade/u/apps/ch/os/lib64/libpthread.so.0(+0xf870)[0x2aaaaacdd870] [r13i7n24:06506] [ 1] /glade/u/apps/ch/os/lib64/libc.so.6(strchrnul+0x23)[0x2aaaaaf771a3] [r13i7n24:06506] [ 2] /glade/u/apps/ch/os/lib64/libc.so.6(_IO_vfprintf+0x92)[0x2aaaaaf31262] [r13i7n24:06506] [ 3] /glade/u/apps/ch/os/lib64/libc.so.6(vasprintf+0xa3)[0x2aaaaaf5c703] [r13i7n24:06506] [ 4] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0xee)[0x2aaac1687e6e] [r13i7n24:06506] [ 5] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [ 6] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [ 7] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [ 8] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [ 9] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [10] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [11] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [12] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [13] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [14] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [15] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [16] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [17] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [18] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [19] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [20] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [21] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [22] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [23] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [24] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [25] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [26] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] [27] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E__push_stack+0x8c)[0x2aaac1687f7c] [r13i7n24:06506] [28] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5E_printf_stack+0x121)[0x2aaac1687ea1] [r13i7n24:06506] [29] /glade/work/abanihi/softwares/miniconda3/envs/hdf5_zarr/lib/libhdf5.so.103(H5I_inc_ref+0xcc)[0x2aaac17323dc] [r13i7n24:06506] *** End of error message *** ```

I am not sure how to proceed

rabernat commented 5 years ago

Yeah so this is hard because you have to make sure that all your MPI libraries are compatible throughout your stack.

Maybe a red herring, but it looks from this line ../../ui/mpich/mpiexec.c:1147 that you are invoking mpiexec from mpich, yet your are using impi in your environment. Need to make sure the same MPI--ideally the one recommended for cheyenne--is used everywhere.

Surely there is someone at NCAR who can help sort this out, no?

andersy005 commented 5 years ago

../../ui/mpich/mpiexec.c:1147

I am glad you caught this one.

Surely there is someone at NCAR who can help sort this out, no?

I will see if I can get input from CISL Help desk. I am now going through the documentation to see if there's any mention of recommended compilers and/or tips on how to keep MPI libraries compatible throughout one's software stack.

andersy005 commented 5 years ago

@rabernat, I finally got hdf5 and h5py to build correctly with the right set of compilers. The results can be found here: https://github.com/andersy005/zarr_hdf_benchmarks/blob/master/plot_all_results-build-from-source.ipynb

I did not observe that much difference from your results with conda-based libraries. Let me know if there are other hypothesis worth testing and I will test them.

Some time in March I will look into adding dask parallelism and compare performances.

kmpaul commented 5 years ago

We might want to look at building the necessary tools using the system libraries, not conda installs. It is unclear to me whether the conda MPI packages build with MPIO support, which is needed to take advantage of the parallel filesystem. My intuition tells me that without MPIO support (in the MPI library), you will not see scaling across multiple nodes.

andersy005 commented 5 years ago

We might want to look at building the necessary tools using the system libraries, not conda installs.

For this experiment, I did not use conda libraries. I built hdf5 MPI enabled, h5py, mpi4py against system libraries. Are there other tools I should be aware of that need to be built from source against system libraries?

kmpaul commented 5 years ago

Ah! I missed the non-use of conda. Sorry.

But, yet, I would say that you should NOT be building hdf5. There is a pre-build hdf5 on the system that can be loaded with the hdf5-mpi module. I would try building with that...and try it with different system compilers (e.g., module load intel, module load gnu, module load pgi).

And I'm not sure if the h5py package really implements the parallel-hdf5 layer properly. I've tested it in the past and not seen it scale. So, we should verify that the C library actually scales! (Perhaps first... Just so we know that the underlying layers are actually working!)

rabernat commented 5 years ago

I did not observe that much difference from your results with conda-based libraries. Let me know if there are other hypothesis worth testing and I will test them.

That's not completely true! Look at the read and write performance for 72 cores and 4000000 byte chunks

operation	your version (native)	my version (conda)
hdf-read	2112	3025
zarr-read	3531	221
hdf-write	5279	3273
zarr-write	3163	234

The zarr performance increased by more than 10x.

However, it is still the case that we don't observe good scaling with the number of cores.

At this point, I would definitely ask someone from Cheyenne for some feedback.

andersy005 commented 5 years ago

The zarr performance increased by more than 10x.

@rabernat, can you speculate on what could be the reason for Zarr's performance increase? For zarr, I used a conda install.

At this point, I would definitely ask someone from Cheyenne for some feedback.

@kmpaul, could you advise whom to contact for some feedback?

rabernat commented 5 years ago

It must have something to do with mpi4py working better when built against the system python. I added an MPI barrier at the end of each read block:

https://github.com/rabernat/zarr_hdf_benchmarks/blob/ce31d41616ce5714349f453e2709c1ff5a3b09ef/parallel_read_write.py#L94-L97

This causes execution to pause until all ranks have reached that point. This might happen faster using the native MPI.

jakirkham commented 5 years ago

Hi all, sorry to wander in here uninvited.

Am just curious, have you raised any issues in conda-forge about your findings? Expect there are other people making use of MPI on different clusters that would be interested to hear what you learned and willing to work with you to address any performance problems you have identified.

Support for MPI in the h5py build is still very new. So wouldn't be surprised if there are a few things that need to get ironed out.

rabernat / zarr_hdf_benchmarks

Building mpi4py and h5py against system libraries #2