Open fstein93 opened 1 year ago
Unfortunately I am having a hard time recreating your problem. I suppose you are running the test in test_suite/neci/parallel/N_FCIMCPar
(presumably while in the directory, so that NECI can see the FCIDUMP)? If not, what is your input? Are you able to run the unit tests?
You could try something like the following (start in root dir):
mkdir build && cd build # or wherever you prefer to build
cmake -DENABLE_HDF5=OFF -DCMAKE_BUILD_TYPE=Debug -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx ..
cmake --build . -j -v --
ctest -j # you may wish to end this early if they are all passing so far; it is rather slow with Debug mode
cd ../test_suite/neci/parallel/N_FCIMCPar/
../../../../build/bin/neci neci.inp
At which step (if any) do you get an error, and what is it?
Currently, I have a bunch of compilation issues due to several routines incorrectly declared as pure (1. starting from src/lib/error_handling_neci.F90, line 8, the routine implementing the interface is definitely NOT pure which is not allowed, 2. GCC apparently has a false-positive in src/matmul.F90, line 24/29 assuming that the variable I may be 1 in the second branch)
I finished the tests. A few executables are not created (including neci). The test summary is 58% tests passed, 40 tests failed out of 96
Total Test time (real) = 5.06 sec
The following tests FAILED: 1 - test_neci_countbits (Not Run) 5 - test_kdneci_countbits (Not Run) 7 - test_neci_real_space_hubbard (Not Run) 8 - test_neci_lattice_mod (Not Run) 12 - test_kdneci_lattice_mod (Not Run) 14 - test_neci_molecular_tc (Not Run) 15 - test_neci_tc_freeze (Not Run) 16 - test_neci_back_spawn (Not Run) 20 - test_kdneci_back_spawn (Not Run) 22 - test_neci_back_spawn_excit_gen (Not Run) 23 - test_kneci_back_spawn_excit_gen (Failed) 24 - test_dneci_back_spawn_excit_gen (Failed) 25 - test_mneci_back_spawn_excit_gen (Failed) 26 - test_kdneci_back_spawn_excit_gen (Not Run) 27 - test_kmneci_back_spawn_excit_gen (Failed) 32 - test_kdneci_ueg_excit_gens (Not Run) 36 - test_neci_gasci_util (Not Run) 37 - test_neci_gasci_on_the_fly_heat_bath (SEGFAULT) 38 - test_neci_gasci_disconnected (SEGFAULT) 39 - test_neci_gasci_discarding (SEGFAULT) 40 - test_neci_gasci_supergroup_index (Not Run) 41 - test_neci_gasci_pchb_rhf_hermitian (Not Run) 42 - test_neci_gasci_pchb_rhf_nonhermitian (Not Run) 43 - test_neci_gasci_pchb_uhf_hermitian (Not Run) 44 - test_neci_gasci_pchb_uhf_nonhermitian (Not Run) 50 - test_kdneci_lattice_models_utils (Not Run) 56 - test_kdneci_cc_amplitudes (Not Run) 62 - test_kdneci_cepa_shifts (Not Run) 68 - test_kdneci_guga (Not Run) 70 - test_kmneci_impurity_excit_gen (SEGFAULT) 71 - test_neci_pcpp_excitgen (SEGFAULT) 72 - test_neci_pchb_excitgen_rhf_hermitian (Not Run) 73 - test_neci_pchb_excitgen_rhf_nonhermitian (Not Run) 74 - test_neci_pchb_excitgen_uhf_hermitian (Not Run) 75 - test_neci_pchb_excitgen_uhf_nonhermitian (Not Run) 77 - test_mneci_loop (SEGFAULT) 82 - test_kdneci_guga_pchb_excitgen (Not Run) 93 - test_neci_aliasTables (Not Run) 94 - test_neci_CDF_sampler (Not Run) 96 - test_neci_sltcnd (SEGFAULT)
Sorry, could you please clarify how you were able to run it before? I was under the impression from your first post that it compiles but crashes at the start of the run (after some print out). I suppose you must have compiled differently; do you have those commands and/or your CMakeCache?
The stop_all routine is written that way with an interface specifically as a trick that allows us to use it as a pure function. That shouldn't be the problem. What specifically is the compilation error you are getting?
I did more tests. Apparently, CMake sometimes picked up the wrong compilers.
In a clean setup, I started with the Intel compilers (Intel oneapi, Intel HPC kit, version 2021.10.0) which compiles perfectly and most tests pass.
Then, I compiled with GCC 13.2+OpenMPI 4.1.5 and the code does not compile because of
I
may be 1 in the second branch. This is fixed by unrolling the first step of the loop. -Werror
implies the flag -Werror-surprising
which raises issues in build/fypp/libmneci/sltcnd.F90, line 25 stating Type specified for intrinsic function ‘size’ at (1) is ignored
(pointing to use excitation_types). I fix it by running sed -i "s/-Werror/-Werror -Wno-error=surprising/g" src/*/*/flags.make unit_tests/*/*/*/flags.make
from the build directory.EDIT: It is sad if only the intel compiler works as there is enough non-Intel hardware outside (clusters, notebooks, desktop machines, 8 out of the top 10 of the TOP500 list, 7 out of the top 10 of the GREEN500 list) on which the Intel compiler is not expected to provide efficient code or not even available.
What compiler do you commonly use to build the code?
Dear @fstein93 ,
Thank you for all these comments and sorry for our late answer. I was gone for a longer time.
It is sad if only the intel compiler works as there is enough non-Intel hardware outside
Fully agree, we also use AMD hardware ourselves and strive to support as many compilers as possible. In our testsuite we test with gcc 7.5, but on my private machine I also compiled and run with gcc 10. Indeed we should get our hand on some newer gcc.
Regarding your second comment.
Yes this faked interface is wrong, but we know it and consciously did it.
Note that an error stop
is also allowed by the Fortran standard in pure
procedures.
The only reason why it is not pure in our case, is that MPI_Abort
is not pure, which we prefer in a parallel environment over error stop
.
So far this faked interface was never a problem, if it really is, then we might replace it with error stop
.
I agree, GCC7 is not widely available (or even supported) anymore on most hardware (my OS starts with GCC10).
I still double-checked the standard. You are right regarding error stop
(and of course MPI). But even then, the write(stdout,*)
statements are definitely not possible in pure
procedures, only internal write
statements to variables are allowed.
Unrelated to that, you might also consider to switch to the modern mpi_f08
standard which solves several issues of the old interfaces.
Another aspect you might consider is OpenMP parallelization. If I understand it correctly, your code is only MPI parallelized. As such, you have to replicate ERIs between all ranks which becomes your memory bottleneck. With a hybrid MPI/OpenMP approach, you may reduce the memory requirements by using (ideally) only a single (a few) rank(s) per node and allowing the OpenMP threads to share the ERIs (they do not change anyways).
Can you compile with gcc10?
Unrelated to that, you might also consider to switch to the modern mpi_f08 standard which solves several issues of the old interfaces.
That is already done, in our private repo. We will soon update this public one.
Regarding OpenMP: What is an ERI? I don't know this abbreviation. We already use hybrid parallelization and shared memory. Large allocations that can be shared between processes on one node use shared arrays, e.g. two-electron integrals or PCHB alias tables. We use the shared memory of MPI for this. In addition, information about walkers are different between the processes and would be private variables in an OpenMP parallelization anyway.
ERI=Electron Repulsion Integral
GCC10 was not possible for me with similar issues than with GCC13.
ERI=Electron Repulsion Integral
Ah good to know. As I wrote above they reside already in node-shared memory.
Regarding GCC10, I leave the issue open and we aim for updating our compilers better sooner than later.
I have just given the recompilation with GCC another try. I have figured out that the code does indeed compile with GCC 13.1 in release mode. As soon as I compile NECI in debug mode, the compilation fails with the error(s) described above.
In release mode, the test suite fails with the same unspecific error (segmentation fault without any reasonable backtrace). Attaching NECI to the GNU debugger reveals the line (/home/fstein/NECI/NECI_STABLE/build/fypp/libneci/excitation_types.F90, line 1496, I am aware that this file is produced by FYPP). I have observed the same issue within my own projects and I know that this is a bug in Gfortran. This bug is related to the assignment of a polymorphic variable. This is fixed by using source allocation (compare here). The same issue appears in /home/fstein/NECI/NECI_STABLE/build/fypp/libkmneci/excitation_types.F90, line 1536. Fixing both lines results into executables which pass the test suite.
Ok this is good news. If the problems in DEBUG only result from escalating warnings to errors and you still want DEBUG, then you can set
set( ${PROJECT_NAME}_Fortran_WARN_ERROR_FLAG "")
instead of
set( ${PROJECT_NAME}_Fortran_WARN_ERROR_FLAG "-Werror")
in cmake/compiler_flags/GNU_Fortran.cmake
.
Oh yes I found a similar bug in gfortran here. It is fixed in our private repo (using a subroutine as factory function), we will soon update the public one.
PS: If you need access because you would like to use NECI for actual production runs I can ask if we can give you access to the private repo. It is just that we don't want unpublished implementations of new algorithms out in the wild.
Dear developers,
I am currently trying to compile and run NECI on my notebook. I am able to compile the code, but starting any kind of calculation with NECI, the trial fails with a not useful error message.
Error message
Backtrace for this error:
0 0x7f07aecf151f in ???
1 0x0 in ???
The last few lines of the output file of the directory NECI_STABLE/test_suite/neci/parallel/N_FCIMCPar
Setting integer bit-length of determinants as bit-strings to: 64 SYMMETRY MULTIPLICATION TABLE No Symmetry table found. 21 Symmetry PAIRS 8 DISTINCT ORBITAL PAIR PRODUCT SYMS Symmetry and spin of orbitals correctly set up for excitation generators. Simply transferring this into a spin orbital representation. Not storing the H matrix.
My setup:
Ubuntu 22 hosted on Windows Subsystem Compilers: GCC 10 (the oldest compiler still available on the system) and GCC13 MPI: OpenMPI 4.1.5 AMD Ryzen5 5600H 16 GB RAM
I tried to compile with and without HDF5 1.12.2. I tried to compile the Code with the standard optimization level (-O3) and the debug one (-Og), always with the same result. I tried a serial run and a parallel run.
Do you have any advice on how to compile and run the code? What are the memory requirements of the code apart from the replicated ERIs?