times-software / OCEAN

BSE code for core spectroscopy
Other
16 stars 13 forks source link

segfault in diamond example #290

Closed max-radin closed 1 year ago

max-radin commented 2 years ago

Hi, I'm getting a segfault on the CNBSE step of the diamond example. I'm running ocean 3.0.0 in an ubuntu container with 16gb of memory. I tried decreasing ecut and other parameters but got the same result. Any advice?

Here is the last part of CNBSE/ocean.log:

 Initialization complete           0
   mult =  0.44089472E-02
   1.0000000000000000              100

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0xffffbc4ad08b in ???
#1  0xffffbc4ac047 in ???
#2  0xffffbcd3c7bf in ???
#3  0xffffbc2a7c40 in ???
#4  0xffffba1bfd17 in ???
#5  0xffffba1d251f in ???
#6  0xffffba1d264f in ???
#7  0xffffba148193 in ???
#8  0xffffba19f307 in ???
#9  0xffffba19f55b in ???
#10  0xffffba1ad3f3 in ???
#11  0xffffba1adb8b in ???
#12  0xffffba1aec43 in ???
#13  0xffffba0467fb in MPI_Reduce
#14  0xffffbc66fc97 in mpi_reduce__
#15  0xaaaad31a737b in __ocean_psi_MOD_ocean_psi_send_buffer
#16  0xaaaad31c02c7 in __ocean_action_MOD_ocean_xact.constprop.0
#17  0xaaaad31aac7b in MAIN__
#18  0xaaaad31a38e3 in main

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 7124 RUNNING AT 743548cd03b8
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
jtv3 commented 2 years ago
  1. Is this unique to the diamond example?
  2. What compilers are you using (including version) and what MPI library?
  3. Please recompile with some debugging flags (on gfortran this would be "-g -0g -fbacktrace"
max-radin commented 2 years ago

The backtrace with debugging flags is below. I'm using gfortran 11.2.0 and mpich.

Backtrace for this error:
#0  0xffff8572d08b in ???
#1  0xffff8572c047 in ???
#2  0xffff85fbe7bf in ???
#3  0xffff85527c40 in ???
#4  0xffff8343fd17 in ???
#5  0xffff8345251f in ???
#6  0xffff8345264f in ???
#7  0xffff833c8193 in ???
#8  0xffff8341f307 in ???
#9  0xffff8341f55b in ???
#10  0xffff8342d3f3 in ???
#11  0xffff8342db8b in ???
#12  0xffff8342ec43 in ???
#13  0xffff832c67fb in ???
#14  0xffff858efc97 in ???
#15  0xaaaacef574b3 in core_reduce_send
    at /app/OCEAN/OCEAN2/OCEAN_psi.f90:825
#16  0xaaaacef574b3 in __ocean_psi_MOD_ocean_psi_send_buffer
    at /app/OCEAN/OCEAN2/OCEAN_psi.f90:465
#17  0xaaaacef574b3 in __ocean_psi_MOD_ocean_psi_send_buffer
    at /app/OCEAN/OCEAN2/OCEAN_psi.f90:434
#18  0xaaaacef71307 in __ocean_action_MOD_ocean_xact.constprop.0
    at /app/OCEAN/OCEAN2/OCEAN_action.f90:514
#19  0xaaaacef5acbb in ocean_haydock_herm_do
    at /app/OCEAN/OCEAN2/OCEAN_haydock.f90:258
#20  0xaaaacef5acbb in __ocean_haydock_MOD_ocean_haydock_do
    at /app/OCEAN/OCEAN2/OCEAN_haydock.f90:186
#21  0xaaaacef5acbb in __ocean_driver_MOD_ocean_driver_run
    at /app/OCEAN/OCEAN2/OCEAN_driver.f90:40
#22  0xaaaacef5acbb in ocean
    at /app/OCEAN/OCEAN2/OCEAN.f90:65
#23  0xaaaacef53923 in main
    at /app/OCEAN/OCEAN2/OCEAN.f90:9

I've found that the diamond example does work in my environment when the number of MPI processes is reduced to 1. So for the time being I plan to continue my project using just a single process.

I haven't gotten any other example to work yet, although the errors seem unrelated. For example, I get the following error from the STO example:

mpirun -n 4 /app/bin/pw.x  -npool 4 -inp nscf.in > nscf.out 2>&1
QE62!!
5.55651  8.30299  6.92975
0.204198  0.305129  0.254664
Fractional nuber of electrons, but fixed occupations
DFT Stage Failed
max-radin commented 2 years ago

Ok it looks like the failure in the STO example is because the QE parser is not expecting a + in the exponent of the nelec tag. If no one is already working on this, I can make a PR to handle this case.

jtv3 commented 2 years ago

go ahead and open a PR against the develop branch