gunter-roeth commented 2 years ago

Hi I am getting a deadlock when running on 16 MPI processes the H2O-32-RI-dRPA-TZ.inp case

I start like this MPI_PER_GPU=2 mpirun --bind-to none -n 16 binder.sh ../../../exe/local_cuda/cp2k.psmp -i H2O-32-RI-dRPA-TZ.inp ./exe/local_cuda/cp2k.psmp -i H2O-32-RI-dRPA-TZ.inp

Very quickly the program will hangs after p coordinates 3 0.000 0.000 0.6 p buffer 3 0.000 0.000 0.6 p layout 3 0.000 0.000 0.2 p allocation 2 0.000 0.000 0.0 p init 2 0.000 0.000 0.1

from the gdb stack you can see that 16 are calling gdb_1431574.out:#12 0x00007f0d96a1a74f in __elpa2_impl_MOD_elpa_solve_evp_real_2stage_double_impl () from /opt/elpa/lib/libelpa_openmp.so.15

You can see that 4 stacks go here The first 4 go here grep elpa2_compute_MOD_bandred_real_double *.out gdb_1431574.out:#11 0x00007f0d969f8e7b in elpa2_compute_MOD_bandred_real_double () gdb_1431575.out:#14 0x00007fe14a904325 in elpa2_compute_MOD_bandred_real_double () gdb_1431577.out:#10 0x00007f9bbc664325 in elpa2_compute_MOD_bandred_real_double () gdb_1431581.out:#14 0x00007f2884ddd325 in and go in a mpi reduction ..

the others call directly mod_check_for_gpu_MOD_check_for_gpu while the others go here

10 0x00007f40d09dcbbd in ompi_allreduce_f (sendbuf=0x7ffe6a2269d8 "\001",

recvbuf=0x7ffe6a2265ec "\001", count=0x7f40fb7a9d00,
datatype=<optimized out>, op=0x7f40fb7a9d00, comm=<optimized out>,
ierr=0x7ffe6a2265e8) at pallreduce_f.c:87

11 0x00007f40fb724503 in __mod_check_for_gpu_MOD_check_for_gpu ()

from /opt/elpa/lib/libelpa_openmp.so.15

12 0x00007f40fb7419f7 in __elpa2_impl_MOD_elpa_solve_evp_real_2stage_double_impl () from /opt/elpa/lib/libelpa_openmp.so.15

13 0x00007f40fb6a06f7 in __elpa_impl_MOD_elpa_eigenvectors_d ()

To summarize ... 12 are already in the PMPI_Allreduce while 4 are still doing something else ..

I hope this may give you some guidance to solve this bug ... Please do not hesitate to contact me directly at Gunter Roth gunterr@nvidia.com It would be a complete pleasure to complete any missing information .. and thanks again for all your ELPA efforts .. Gunter

Also attaching my summary file debug_H2O-32-RI-dRPA-TZ.txt ..

debug_H2O-32-RI-dRPA-TZ.txt

marekandreas commented 2 years ago

Hello, not being from the community I do not have any insight what 'H2O-RPA-32' means w.r.t. an eigenvalue problem. Can you tell me whether this is a real or complex eigenvalue problem and how large the matrix size passed to ELPA is in this setup? Secondly, which version of ELPA are you using? And as a last question, does this happen immediately when calling ELPA the first time, or does this happen after several (SCF?) iterations? If it happens after several iterations, does the matrix size change maybe?

LStuber commented 2 years ago

This is a consequence of https://github.com/marekandreas/elpa/issues/17. H2O-RPA-32 is a CP2K benchmark which happens to trigger the deadlock.

marekandreas commented 1 year ago

This bug has been fixed in the master_pre_stage branch and will be available soon in the rc2 of ELPA 2021.11.001

marekandreas / elpa

deadlock when using H2O-RPA-32 #14

10 0x00007f40d09dcbbd in ompi_allreduce_f (sendbuf=0x7ffe6a2269d8 "\001",

11 0x00007f40fb724503 in __mod_check_for_gpu_MOD_check_for_gpu ()

12 0x00007f40fb7419f7 in __elpa2_impl_MOD_elpa_solve_evp_real_2stage_double_impl () from /opt/elpa/lib/libelpa_openmp.so.15

13 0x00007f40fb6a06f7 in __elpa_impl_MOD_elpa_eigenvectors_d ()