ghb24 / NECI_STABLE

Standalone NECI codebase designed for FCIQMC and other stochastic quantum chemistry methods.
GNU General Public License v3.0
42 stars 18 forks source link

Running NECI failes #13

Open fstein93 opened 10 months ago

fstein93 commented 10 months ago

Dear developers,

I am currently trying to compile and run NECI on my notebook. I am able to compile the code, but starting any kind of calculation with NECI, the trial fails with a not useful error message.

Error message

Backtrace for this error:

0 0x7f07aecf151f in ???

    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0

1 0x0 in ???

The last few lines of the output file of the directory NECI_STABLE/test_suite/neci/parallel/N_FCIMCPar

Setting integer bit-length of determinants as bit-strings to: 64 SYMMETRY MULTIPLICATION TABLE No Symmetry table found. 21 Symmetry PAIRS 8 DISTINCT ORBITAL PAIR PRODUCT SYMS Symmetry and spin of orbitals correctly set up for excitation generators. Simply transferring this into a spin orbital representation. Not storing the H matrix.

My setup:

Ubuntu 22 hosted on Windows Subsystem Compilers: GCC 10 (the oldest compiler still available on the system) and GCC13 MPI: OpenMPI 4.1.5 AMD Ryzen5 5600H 16 GB RAM

I tried to compile with and without HDF5 1.12.2. I tried to compile the Code with the standard optimization level (-O3) and the debug one (-Og), always with the same result. I tried a serial run and a parallel run.

Do you have any advice on how to compile and run the code? What are the memory requirements of the code apart from the replicated ERIs?

jphaupt commented 10 months ago

Unfortunately I am having a hard time recreating your problem. I suppose you are running the test in test_suite/neci/parallel/N_FCIMCPar (presumably while in the directory, so that NECI can see the FCIDUMP)? If not, what is your input? Are you able to run the unit tests?

You could try something like the following (start in root dir):

mkdir build && cd build # or wherever you prefer to build
cmake -DENABLE_HDF5=OFF -DCMAKE_BUILD_TYPE=Debug -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx  ..
cmake --build . -j -v --
ctest -j # you may wish to end this early if they are all passing so far; it is rather slow with Debug mode
cd ../test_suite/neci/parallel/N_FCIMCPar/
../../../../build/bin/neci neci.inp

At which step (if any) do you get an error, and what is it?

fstein93 commented 10 months ago

Currently, I have a bunch of compilation issues due to several routines incorrectly declared as pure (1. starting from src/lib/error_handling_neci.F90, line 8, the routine implementing the interface is definitely NOT pure which is not allowed, 2. GCC apparently has a false-positive in src/matmul.F90, line 24/29 assuming that the variable I may be 1 in the second branch)

fstein93 commented 10 months ago

I finished the tests. A few executables are not created (including neci). The test summary is 58% tests passed, 40 tests failed out of 96

Total Test time (real) = 5.06 sec

The following tests FAILED: 1 - test_neci_countbits (Not Run) 5 - test_kdneci_countbits (Not Run) 7 - test_neci_real_space_hubbard (Not Run) 8 - test_neci_lattice_mod (Not Run) 12 - test_kdneci_lattice_mod (Not Run) 14 - test_neci_molecular_tc (Not Run) 15 - test_neci_tc_freeze (Not Run) 16 - test_neci_back_spawn (Not Run) 20 - test_kdneci_back_spawn (Not Run) 22 - test_neci_back_spawn_excit_gen (Not Run) 23 - test_kneci_back_spawn_excit_gen (Failed) 24 - test_dneci_back_spawn_excit_gen (Failed) 25 - test_mneci_back_spawn_excit_gen (Failed) 26 - test_kdneci_back_spawn_excit_gen (Not Run) 27 - test_kmneci_back_spawn_excit_gen (Failed) 32 - test_kdneci_ueg_excit_gens (Not Run) 36 - test_neci_gasci_util (Not Run) 37 - test_neci_gasci_on_the_fly_heat_bath (SEGFAULT) 38 - test_neci_gasci_disconnected (SEGFAULT) 39 - test_neci_gasci_discarding (SEGFAULT) 40 - test_neci_gasci_supergroup_index (Not Run) 41 - test_neci_gasci_pchb_rhf_hermitian (Not Run) 42 - test_neci_gasci_pchb_rhf_nonhermitian (Not Run) 43 - test_neci_gasci_pchb_uhf_hermitian (Not Run) 44 - test_neci_gasci_pchb_uhf_nonhermitian (Not Run) 50 - test_kdneci_lattice_models_utils (Not Run) 56 - test_kdneci_cc_amplitudes (Not Run) 62 - test_kdneci_cepa_shifts (Not Run) 68 - test_kdneci_guga (Not Run) 70 - test_kmneci_impurity_excit_gen (SEGFAULT) 71 - test_neci_pcpp_excitgen (SEGFAULT) 72 - test_neci_pchb_excitgen_rhf_hermitian (Not Run) 73 - test_neci_pchb_excitgen_rhf_nonhermitian (Not Run) 74 - test_neci_pchb_excitgen_uhf_hermitian (Not Run) 75 - test_neci_pchb_excitgen_uhf_nonhermitian (Not Run) 77 - test_mneci_loop (SEGFAULT) 82 - test_kdneci_guga_pchb_excitgen (Not Run) 93 - test_neci_aliasTables (Not Run) 94 - test_neci_CDF_sampler (Not Run) 96 - test_neci_sltcnd (SEGFAULT)

The full test output is here. The extended output is here.

jphaupt commented 10 months ago

Sorry, could you please clarify how you were able to run it before? I was under the impression from your first post that it compiles but crashes at the start of the run (after some print out). I suppose you must have compiled differently; do you have those commands and/or your CMakeCache?

The stop_all routine is written that way with an interface specifically as a trick that allows us to use it as a pure function. That shouldn't be the problem. What specifically is the compilation error you are getting?

fstein93 commented 9 months ago

I did more tests. Apparently, CMake sometimes picked up the wrong compilers.

In a clean setup, I started with the Intel compilers (Intel oneapi, Intel HPC kit, version 2021.10.0) which compiles perfectly and most tests pass.

Then, I compiled with GCC 13.2+OpenMPI 4.1.5 and the code does not compile because of

  1. a false-positive in src/matmul.F90, line 24/29 assuming that the variable I may be 1 in the second branch. This is fixed by unrolling the first step of the loop.
  2. The next issue is related to the interface of routine hidden_stop_all defined in src/lib/error_handling_neci.F90 which does not match the interface of the given routine implemented in src/lib/error_handling_neci_impls.F90 because the interface of a submodule procedure must match the one provided in its parent (sub)module. In the given case, the interface defined in error_handling.F90 is PURE whereas its implementation is not declared as PURE. Generally, this may lead to unintended behaviour if the compiler decides to reorder apparently pure procedure calls although the actual implementation of the interface is not. I guess that some compilers do not test the interfaces. Locally, I am fixing that issue by commenting out the respective code (no good workaround).
  3. The compilation flag -Werror implies the flag -Werror-surprising which raises issues in build/fypp/libmneci/sltcnd.F90, line 25 stating Type specified for intrinsic function ‘size’ at (1) is ignored (pointing to use excitation_types). I fix it by running sed -i "s/-Werror/-Werror -Wno-error=surprising/g" src/*/*/flags.make unit_tests/*/*/*/flags.make from the build directory.
  4. An internal compiler error (could also be an actual bug in the code) while compiling src/CMakeFiles/libdneci.dir/real_space_hubbard.F90.o .

EDIT: It is sad if only the intel compiler works as there is enough non-Intel hardware outside (clusters, notebooks, desktop machines, 8 out of the top 10 of the TOP500 list, 7 out of the top 10 of the GREEN500 list) on which the Intel compiler is not expected to provide efficient code or not even available.

What compiler do you commonly use to build the code?

mcocdawc commented 9 months ago

Dear @fstein93 ,

Thank you for all these comments and sorry for our late answer. I was gone for a longer time.

It is sad if only the intel compiler works as there is enough non-Intel hardware outside

Fully agree, we also use AMD hardware ourselves and strive to support as many compilers as possible. In our testsuite we test with gcc 7.5, but on my private machine I also compiled and run with gcc 10. Indeed we should get our hand on some newer gcc.

Regarding your second comment. Yes this faked interface is wrong, but we know it and consciously did it. Note that an error stop is also allowed by the Fortran standard in pure procedures. The only reason why it is not pure in our case, is that MPI_Abort is not pure, which we prefer in a parallel environment over error stop. So far this faked interface was never a problem, if it really is, then we might replace it with error stop.

fstein93 commented 9 months ago

I agree, GCC7 is not widely available (or even supported) anymore on most hardware (my OS starts with GCC10).

I still double-checked the standard. You are right regarding error stop (and of course MPI). But even then, the write(stdout,*) statements are definitely not possible in pure procedures, only internal write statements to variables are allowed.

Unrelated to that, you might also consider to switch to the modern mpi_f08 standard which solves several issues of the old interfaces.

fstein93 commented 9 months ago

Another aspect you might consider is OpenMP parallelization. If I understand it correctly, your code is only MPI parallelized. As such, you have to replicate ERIs between all ranks which becomes your memory bottleneck. With a hybrid MPI/OpenMP approach, you may reduce the memory requirements by using (ideally) only a single (a few) rank(s) per node and allowing the OpenMP threads to share the ERIs (they do not change anyways).

mcocdawc commented 9 months ago

Can you compile with gcc10?

Unrelated to that, you might also consider to switch to the modern mpi_f08 standard which solves several issues of the old interfaces.

That is already done, in our private repo. We will soon update this public one.

Regarding OpenMP: What is an ERI? I don't know this abbreviation. We already use hybrid parallelization and shared memory. Large allocations that can be shared between processes on one node use shared arrays, e.g. two-electron integrals or PCHB alias tables. We use the shared memory of MPI for this. In addition, information about walkers are different between the processes and would be private variables in an OpenMP parallelization anyway.

fstein93 commented 9 months ago

ERI=Electron Repulsion Integral

GCC10 was not possible for me with similar issues than with GCC13.

mcocdawc commented 9 months ago

ERI=Electron Repulsion Integral

Ah good to know. As I wrote above they reside already in node-shared memory.

Regarding GCC10, I leave the issue open and we aim for updating our compilers better sooner than later.

fstein93 commented 9 months ago

I have just given the recompilation with GCC another try. I have figured out that the code does indeed compile with GCC 13.1 in release mode. As soon as I compile NECI in debug mode, the compilation fails with the error(s) described above.

In release mode, the test suite fails with the same unspecific error (segmentation fault without any reasonable backtrace). Attaching NECI to the GNU debugger reveals the line (/home/fstein/NECI/NECI_STABLE/build/fypp/libneci/excitation_types.F90, line 1496, I am aware that this file is produced by FYPP). I have observed the same issue within my own projects and I know that this is a bug in Gfortran. This bug is related to the assignment of a polymorphic variable. This is fixed by using source allocation (compare here). The same issue appears in /home/fstein/NECI/NECI_STABLE/build/fypp/libkmneci/excitation_types.F90, line 1536. Fixing both lines results into executables which pass the test suite.

mcocdawc commented 8 months ago

Ok this is good news. If the problems in DEBUG only result from escalating warnings to errors and you still want DEBUG, then you can set

set( ${PROJECT_NAME}_Fortran_WARN_ERROR_FLAG "")

instead of

set( ${PROJECT_NAME}_Fortran_WARN_ERROR_FLAG "-Werror")

in cmake/compiler_flags/GNU_Fortran.cmake.

Oh yes I found a similar bug in gfortran here. It is fixed in our private repo (using a subroutine as factory function), we will soon update the public one.

PS: If you need access because you would like to use NECI for actual production runs I can ask if we can give you access to the private repo. It is just that we don't want unpublished implementations of new algorithms out in the wild.