marvel-nccr / ansible-role-abinit

A role for the abinit code
Other
0 stars 0 forks source link

Issue with parallel I/O (netcdf) #14

Open chrisjsewell opened 3 years ago

chrisjsewell commented 3 years ago

Taken from email chain:

@giovannipizzi:

Hi Samuel, probably you're already looking into this, but could you help Chris debug and figure out how to solve the issue of abinit on Quantum Mobile? https://github.com/aiidateam/aiida-common-workflows/issues/159

The only blocking thing, before submission of the common-workflows paper, is to make sure we can reproduce the results of the paper in Quantum Mobile; once abinit (and CP2K, that Chris is addressing separately) are fixed, we're ready to submit.

Let me know if you discover that the issue is much more complex to fix than expected, so we look into a plan B (but I hope that, being a library issue, this can be sorted out?)

samuel ponce:

Yes, I've been following it but I'm not sure how to fix this. In fact, I often struggle linking/compiling Abinit on various hpc machines.

I'm adding Jean-Michel in c.c., since with Matteo, he probably has the most experience on this.

In brief, Chris is trying to make a docker image of the Quantum Mobile which now includes Abinit. However Abinit does not properly run due to the Abinit build system detecting a NetCDF library with MPI-IO support but then compiling it without MPI-IO support: https://github.com/aiidateam/aiida-common-workflows/issues/159

@chrisjsewell:

Just to clarify Docker images are in general OS agnostic (that’s in essence their raison d’être) , certainly for software e.g. the build library locations do not change when using MacOs or Ubuntu, except… for this subtle compilation optimisation business (that I was not aware before) which, only if requested, will look at the hosts hardware and will depend on what host it is built (see https://stackoverflow.com/a/54163496/5033292)

Matteo Giantomassi

From what I can see in the output of abinit -b, both MPI-IO and parallel netcdf+hdf5 are activated in the buid (activated means that the corresponding CPP preprocessing options are activated in the Abinit source and calls to MPI-IO and netcdf4+hdf5 are exposed to the compiler). The linker is happy hence these external IO routines are present in the external libraries yet the code aborts inside the MPI library due to a stack smashing error. This means that the ubuntu library you are using was compiled with this kind of runtime check.

To avoid harcoding paths, you can use use the nc-config (C lib) and nf-config (Fortran) executables provided by netcdf.

with_netcdf=${nc-config --prefix}
with_hdf5=${nc-config --prefix}
with_netcdf_fortran=$(nf-config —prefix)

PS: it would be useful to have the output of:

nc-config —all
nf-config —all 

to get further insight into the https://github.com/aiidateam/aiida-common-workflows/issues/159 issue

cc also @sphuber

chrisjsewell commented 3 years ago

using the test Docker (on OSX) from #13, i.e. with libpnetcdf-dev installed:

abinit -b

root@instance:/# abinit -b
 DATA TYPE INFORMATION: 
 REAL:      Data type name: REAL(DP) 
            Kind value:      8
            Precision:      15
            Smallest nonnegligible quantity relative to 1: 0.22204460E-015
            Smallest positive number:                      0.22250739E-307
            Largest representable number:                  0.17976931E+309
 INTEGER:   Data type name: INTEGER(default) 
            Kind value: 4
            Bit size:   32
            Largest representable number: 2147483647
 LOGICAL:   Data type name: LOGICAL 
            Kind value: 4
 CHARACTER: Data type name: CHARACTER             Kind value: 1
  ==== Using MPI-2 specifications ==== 
  MPI-IO support is ON
  xmpi_tag_ub ................   2147483647
  xmpi_bsize_ch ..............            1
  xmpi_bsize_int .............            4
  xmpi_bsize_sp ..............            4
  xmpi_bsize_dp ..............            8
  xmpi_bsize_spc .............            8
  xmpi_bsize_dpc .............           16
  xmpio_bsize_frm ............            4
  xmpi_address_kind ..........            8
  xmpi_offset_kind ...........            8
  MPI_WTICK ..................    1.0000000000000001E-009

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 CPP options activated during the build:

                    CC_GNU                   CXX_GNU                    FC_GNU

 HAVE_FC_ALLOCATABLE_DT...             HAVE_FC_ASYNC         HAVE_FC_BACKTRACE

  HAVE_FC_COMMAND_ARGUMENT      HAVE_FC_COMMAND_LINE        HAVE_FC_CONTIGUOUS

           HAVE_FC_CPUTIME              HAVE_FC_EXIT             HAVE_FC_FLUSH

             HAVE_FC_GAMMA            HAVE_FC_GETENV   HAVE_FC_IEEE_ARITHMETIC

   HAVE_FC_IEEE_EXCEPTIONS          HAVE_FC_INT_QUAD             HAVE_FC_IOMSG

     HAVE_FC_ISO_C_BINDING  HAVE_FC_ISO_FORTRAN_2008        HAVE_FC_LONG_LINES

        HAVE_FC_MOVE_ALLOC  HAVE_FC_ON_THE_FLY_SHAPE           HAVE_FC_PRIVATE

         HAVE_FC_PROTECTED           HAVE_FC_SHIFTLR         HAVE_FC_STREAM_IO

            HAVE_FC_SYSTEM          HAVE_FORTRAN2003                 HAVE_HDF5

             HAVE_HDF5_MPI        HAVE_LIBPAW_ABINIT      HAVE_LIBTETRA_ABINIT

                HAVE_LIBXC                  HAVE_MPI                 HAVE_MPI2

       HAVE_MPI_IALLGATHER       HAVE_MPI_IALLREDUCE        HAVE_MPI_IALLTOALL

       HAVE_MPI_IALLTOALLV           HAVE_MPI_IBCAST         HAVE_MPI_IGATHERV

        HAVE_MPI_INTEGER16               HAVE_MPI_IO HAVE_MPI_TYPE_CREATE_S...

               HAVE_NETCDF       HAVE_NETCDF_FORTRAN   HAVE_NETCDF_FORTRAN_MPI

           HAVE_NETCDF_MPI             HAVE_OS_LINUX         HAVE_TIMER_ABINIT

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 === Build Information === 
  Version       : 9.2.1
  Build target  : x86_64_linux_gnu7.5
  Build date    : 20210413

 === Compiler Suite === 
  C compiler       : gnu7.5
  C++ compiler     : gnu7.5
  Fortran compiler : gnu7.5
  CFLAGS           : -g -O2
  CXXFLAGS         : -g -O2
  FCFLAGS          : -g -ffree-line-length-none
  FC_LDFLAGS       : 

 === Optimizations === 
  Debug level        : @abi_debug_flavor@
  Optimization level : @abi_optim_flavor@
  Architecture       : unknown_unknown

 === Multicore === 
  Parallel build : yes
  Parallel I/O   : yes
  openMP support : 
  GPU support    : 

 === Connectors / Fallbacks === 
  LINALG flavor  : netlib
  FFT flavor     : goedecker
  HDF5           : yes
  NetCDF         : yes
  NetCDF Fortran : yes
  LibXC          : yes
  Wannier90      : no

 === Experimental features === 
  Exports             : 
  GW double-precision : 

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Default optimizations:
   -O2

 Optimizations for 43_ptgroups:
   -O0

 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

nc-config --prefix

/usr

nc-config —all

root@instance:/# nc-config --all

This netCDF 4.6.0 has been built with the following features: 

  --cc        -> /usr/bin/cc
  --cflags    -> -I/usr/include -I/usr/include/hdf5/serial
  --libs      -> -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/hdf5/serial -lnetcdf -lhdf5_hl -lhdf5 -lpthread -lsz -lz -ldl -lm -lcurl

  --has-c++   -> no
  --cxx       -> 

  --has-c++4  -> no
  --cxx4      -> 

  --has-fortran-> yes
  --fc        -> gfortran
  --fflags    -> -I/usr/include
  --flibs     -> -L/usr/lib -lnetcdff -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -lnetcdf -lnetcdf
  --has-f90   -> no
  --has-f03   -> yes

  --has-dap   -> yes
  --has-dap2  -> yes
  --has-dap4  -> yes
  --has-nc2   -> yes
  --has-nc4   -> yes
  --has-hdf5  -> yes
  --has-hdf4  -> no
  --has-logging-> no
  --has-pnetcdf-> no
  --has-szlib -> no
  --has-cdf5 -> no
  --has-parallel-> no

  --prefix    -> /usr
  --includedir-> /usr/include
  --libdir    -> /usr/lib/x86_64-linux-gnu
  --version   -> netCDF 4.6.0

nf-config —all

root@instance:/# nf-config --all

This netCDF-Fortran 4.4.4 has been built with the following features: 

  --cc        -> gcc
  --cflags    ->  -I/usr/include -Wdate-time -D_FORTIFY_SOURCE=2

  --fc        -> gfortran
  --fflags    -> -I/usr/include
  --flibs     -> -L/usr/lib -lnetcdff -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -lnetcdf -lnetcdf 
  --has-f90   -> no
  --has-f03   -> yes

  --has-nc2   -> yes
  --has-nc4   -> yes

  --prefix    -> /usr
  --includedir-> /usr/include
  --version   -> netCDF-Fortran 4.4.4

pnetcdf-config --all

root@instance:/# pnetcdf-config --all

This parallel-netcdf 1.9.0 has been built with the following features: 

  --cc                        -> /usr/bin/mpicc
  --cflags                    -> -g -O2 -fdebug-prefix-map=/build/pnetcdf-iIyKNf/pnetcdf-1.9.0=. -fstack-protector-strong -Wformat -Werror=format-security
  --cppflags                  -> -Wdate-time -D_FORTIFY_SOURCE=2
  --ldflags                   -> -Wl,-Bsymbolic-functions -Wl,-z,relro
  --libs                      -> 

  --has-c++                   -> yes
  --cxx                       -> /usr/bin/mpicxx
  --cxxflags                  -> -g -O2 -fdebug-prefix-map=/build/pnetcdf-iIyKNf/pnetcdf-1.9.0=. -fstack-protector-strong -Wformat -Werror=format-security

  --has-fortran               -> yes
  --f77                       -> /usr/bin/mpif77
  --fflags                    -> -g -O2 -fdebug-prefix-map=/build/pnetcdf-iIyKNf/pnetcdf-1.9.0=. -fstack-protector-strong

  --fc                        -> /usr/bin/mpif90
  --fcflags                   -> -g -O2 -fdebug-prefix-map=/build/pnetcdf-iIyKNf/pnetcdf-1.9.0=. -fstack-protector-strong

  --relax-coord-bound         -> disabled
  --in-place-swap             -> enabled
  --erange-fill               -> enabled
  --subfiling                 -> disabled
  --large-req                 -> disabled
  --null-byte-header-padding  -> disabled
  --debug                     -> disabled

  --prefix                    -> /usr
  --includedir                -> /usr/include
  --libdir                    -> /usr/lib/x86_64-linux-gnu
  --version                   -> parallel-netcdf 1.9.0
  --release-date              -> 19 Dec 2017
  --config-date               -> 
chrisjsewell commented 3 years ago

Eurgh I give up with this rubbish lol:

https://docs.abinit.org/INSTALL_Ubuntu/ says you simply install libpnetcdf-dev, well that certainly does not seem to be the case.

Including this in the apt install, still gives you --has-pnetcdf-> no and --has-parallel-> no for netcdf. Note it also says it is linked to /usr/lib/x86_64-linux-gnu/hdf5/serial, even though we have installed libhdf5-openmpi-dev, and so /usr/lib/x86_64-linux-gnu/hdf5/openmpi is available.

Does this mean that we also have to build netcdf from source? If so do we actually need pnetcdf, via --enable-pnetcdf (see https://parallel-netcdf.github.io/) or can we just link to the openmpi hdf5?

gmatteo commented 3 years ago

Let's try to simplify things a bit and mainly focus on the hard-requirements i.e. the libs that allow users to run standard GS/DFPT calculations in parallel with MPI and produce (small) netcdf files that can be used by python tools such as AbiPy for visualization purposes (e.g. band structure plots).

The first question is: what happens if you try run the input file that, in the previous build, was aborting with a stack smashing error when calling MPI_FILE_OPEN?

Do you still have the same error?

If this first test completes successfully, I would say that the fact that your netcdf library does not support parallel-IO (-has-parallel-> no) it's not a big deal. Basic MPI-IO capabilities provided by the MPI library are enough for standard calculations. In other words, Abinit will be able to write/read Fortran binary files in parallel using MPI-IO and stream IO (no netcdf/hdf5 stuff is required here).

If the error persists, we have a serious problem. As I explained in the previous post, some of these ubuntu libraries are compiled with fstack-protector and/or -D_FORTIFY_SOURCE=2. For instance, I see:

root@instance:/# nf-config --all

This netCDF-Fortran 4.4.4 has been built with the following features: 

  --cc        -> gcc
  --cflags    ->  -I/usr/include -Wdate-time -D_FORTIFY_SOURCE=2
 --cflags                    -> -g -O2 -fdebug-prefix-map=/build/pnetcdf-iIyKNf/pnetcdf-1.9.0=. -fstack-protector-strong -Wformat -Werror=format-security
  --cppflags                  -> -Wdate-time -D_FORTIFY_SOURCE=2

so I assume that also the MPI library was compiled with similar options.

From this man page:

With _FORTIFY_SOURCE set to 2, some more checking is added, but some conforming programs might fail.

In this case, the program should be intended as the MPI/netcdf/hdf5 library provided by apt so the stack smashing issue should be reported to the maintainers of these packages as Abinit is just a client of these libs and there's no way to disable these checks on our side.

As mentioned here

_FORTIFY_SOURCE level 2 is more secure, but is a slightly riskier compilation strategy; if you use it, make sure you have very strong regression tests for your compiled code to prove the compiler hasn't introduced any unexpected behaviour.

If the GS calculation seems to work in parallel, I would say we are on the right track and we only need to check whether other basic capabilities work as expected. At this point, you may want to use the runtests.py script to execute additional parts of the Test Suite, just to improve a bit the coverage:

cd ~abinit/tests
./runtests.py v1 v3 -j2  # run tests in the v1, v3 directories with 2 python threads (fast)
./runtests.py mpiio -n4  # run tests in the mpiio dir with 4 MPI procs (this will take more time)

If the tests are OK, I would say that the basic stuff works as expected. Running all the tests (~2000) will take much longer (~40 minutes with 6 cores) so you may want to skip this part.

PS:

Note that having a hdf5 library that supports MPI-IO (-has-parallel-> yes) is not required by Abinit. Besides parallel-netcdf (--has-pnetcdf-> yes) refers to (yet another) implementation of parallel-IO netcdf that is still around for legacy reasons so I don't think you need it to build Abinit.

We (optionally) require an hdf5 library compiled with MPI-IO support but in this case the compilation/linking process becomes more complicated because the full software stack (netcdf Fortran/C, hdf5-c) must be compiled with the same MPI library used to compile abinit. That's the reason why our build system provides a shell script to compile the different libs from source using mpif90 and mpicc if the HPC center does not provide pre-installed modules that work out of the box.

The reason is that MPI is not just an API but it's also an implementation-dependent ABI so it's not possible to mix libs compiled with different compilers/MPI-implementations. It's not abinit that it complicated to build (although we always welcome comments and suggestions to facilitate the build process) It's the MPI software stack that is tricky and things become even more complicated when you have a library that depends on MPI. That's the reason why I'm suggesting to ignore the problem with hdf5+MP-IO and just focus on having a MPI library that does not crash when one tries to create a file.

chrisjsewell commented 3 years ago

Thanks for the reply @gmatteo

The first question is: what happens if you try run the input file that, in the previous build, was aborting with a stack smashing error when calling MPI_FILE_OPEN? Do you still have the same error?

I'm unclear why you think this will have changed to the previous build? Given that the only difference is the install of libpnetcdf-dev, which (as noted above) does not appear to change anything.

At this point, you may want to use the runtests.py script to execute additional parts of the Test Suite, just to improve a bit the coverage:

This is already run in https://github.com/marvel-nccr/ansible-role-abinit/blob/master/tasks/tests.yml, and does not surface the stack smashing error

In this case, the program should be intended as the MPI/netcdf/hdf5 library provided by apt so the stack smashing issue should be reported to the maintainers of these packages as Abinit is just a client of these libs and there's no way to disable these checks on our side.

If you think there is an issue with the apt libraries fair enough, you are certainly more knowledgable in this area than me. But then this should be made clear in https://docs.abinit.org/INSTALL_Ubuntu/, where it specifically recommends using these apt libraries

It's not abinit that it complicated to build It's the MPI software stack That's the reason why I'm suggesting to ignore the problem with hdf5+MP-IO and just focus on having a MPI library that does not crash when one tries to create a file.

Again I would note here that this is not an issue for any of the other simulation codes with exactly the same MPI libraries.

chrisjsewell commented 3 years ago

Anyhow, I don't see a way forward on this install route into Ubuntu, so will pivot to look at the Conda install route

giovannipizzi commented 3 years ago

Just a quick comment/question to avoid possible misunderstandings: @chrisjsewell: after installing libpnetcdf-dev, did you also re run the configure/make part of abinit from scratch, or you just installed the library with APT?

I think (@gmatteo correct me if I'm wrong) that installing that package makes it possible for the configure system to detect the library, and therefore compile abinit with the right support. However, just installing the library without recompiling abinit should not change the behaviour of the code (I think).

chrisjsewell commented 3 years ago

after installing libpnetcdf-dev, did you also re run the configure/make part of abinit from scratch

I didn't just install libpnetcdf-dev, I added it to ansible role (#13) then tox converge to create the entire Docker container from scratch, including apt install and compilation.

ltalirz commented 3 years ago

I haven't read through the thread, just wanted to provide a link to the build.sh used to build the abinit conda package on conda-forge in case it helps: https://github.com/conda-forge/abinit-feedstock/blob/master/recipe/build.sh

chrisjsewell commented 3 years ago

Thanks, although I'd say that's not actually the salient point of the recipe (the make command is basically the same here), it's actually that the netcdf packages linked to are ones that have been compiled against the mpi library: https://github.com/conda-forge/abinit-feedstock/blob/master/recipe/meta.yaml#L57

Basically, I believe that to get parallel I/O here we also have to directly compile the netcdf libraries, rather than just installing them from apt.

chrisjsewell commented 3 years ago

Needless to say this introduces yet more complexity and build time to Quantum Mobile (for which abinit is already one of the longest running components), and so if we are anyway planning to move to Conda, I would rather spend my time on that rather than trying to add the netcdf compilation.

ltalirz commented 3 years ago

https://github.com/conda-forge/abinit-feedstock/issues/32

chrisjsewell commented 3 years ago

To also link to the conda effort: https://github.com/marvel-nccr/ansible-role-conda-codes/pull/1