m-a-d-n-e-s-s / madness

Multiresolution Adaptive Numerical Environment for Scientific Simulation
GNU General Public License v2.0
181 stars 62 forks source link

moldft with 2 processes aborts #524

Closed fbischoff closed 9 months ago

fbischoff commented 9 months ago

running moldft with 2 processes fails on mac and unix since revision 87715d98a244bff5cbff0bd2c644a8a00d882989. I'm running mpirun -np 2 moldft

error messages are on mac:

MPI_ERR_TRUNCATE: message truncated MADNESS: fatal error: caught an MPI exception [peconic:69437] Process received signal [peconic:69437] Signal: Abort trap: 6 (6) [peconic:69437] Signal code: (0) [peconic:69437] [ 0] 0 libsystem_platform.dylib 0x000000018843da24 _sigtramp + 56 [peconic:69437] [ 1] 0 libsystem_pthread.dylib 0x000000018840dcc0 pthread_kill + 288 [peconic:69437] [ 2] 0 libsystem_c.dylib 0x0000000188319a40 abort + 180 [peconic:69437] [ 3] 0 moldft 0x0000000102e20c4c _ZN7madness5errorEPKc + 232 [peconic:69437] [ 4] 0 moldft 0x00000001023c4720 main + 4192 [peconic:69437] [ 5] 0 dyld 0x000000018808d0e0 start + 2360 [peconic:69437] End of error message

prterun noticed that process rank 0 with PID 69437 on node peconic exited on signal 6 (Abort trap: 6).

or on unix:

Message truncated, error stack: PMPI_Test(186)..............: MPI_Test(request=0x7ffc7ada8a20, flag=0x7ffc7ada88d0, status=0x1) failed MPIR_Test(79)...............: MPIDIG_handle_unexpected(50): Message from rank 1 and tag 1536 truncated; 0 bytes received but buffer size is 8

fbischoff commented 9 months ago

I have traced this down and that's what I found:

on sn-mem this is how I compiled it:

module load intel/compiler module load intel/tbb module load intel/mkl module load intel/mpi module load gmp module load mpfr/3.1.5 module load gcc/11.2.0 module load python module load cmake cd /path/to/build cmake /path/to/madness make -j moldft cd src/apps/moldft mpirun -np 2 ./moldft --geometry=h2o

robertjharrison commented 9 months ago

Thanks ... I was just about to jump in to look for this. I know that Ed was messing with the global sum buffer size. I will look at it right now.

On Wed, Feb 14, 2024 at 7:52 AM Florian Bischoff @.***> wrote:

I have traced this down and that's what I found:

  • A fresh checkout on my Mac, on our local cluster, or on sn-mem shows the error, it seems to be a race condition.
  • the actual error occurs in safempi.h:416 upon initial load balancing in moldft
  • the problem disappears when reversing the concat0 method in worldgop.h around line 900 to revision 2628cb9 https://github.com/m-a-d-n-e-s-s/madness/commit/2628cb9124b70b5b89f450fce54eb659c9225d90
  • I would be happy to fix it, but I'm not familiar enough with MPI to not break anything.

on sn-mem this is how I compiled it:

module load intel/compiler module load intel/tbb module load intel/mkl module load intel/mpi module load gmp module load mpfr/3.1.5 module load gcc/11.2.0 module load python module load cmake cd /path/to/build cmake /path/to/madness make -j moldft cd src/apps/moldft mpirun -np 2 ./moldft --geometry=h2o

— Reply to this email directly, view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/524#issuecomment-1943711776, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZSAPLUAP7X5NKPPWYBYO3YTSXRLAVCNFSM6AAAAABDGFSBECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTG4YTCNZXGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Robert J. Harrison tel: 865-274-8544

robertjharrison commented 9 months ago

Seems to be overwriting. I've printed out the send sizes and the recv buffers and they look OK, except that perhaps the loop termination logic is flawed. I have to go to class but I can look more upon my return

On Wed, Feb 14, 2024 at 7:52 AM Florian Bischoff @.***> wrote:

I have traced this down and that's what I found:

  • A fresh checkout on my Mac, on our local cluster, or on sn-mem shows the error, it seems to be a race condition.
  • the actual error occurs in safempi.h:416 upon initial load balancing in moldft
  • the problem disappears when reversing the concat0 method in worldgop.h around line 900 to revision 2628cb9 https://github.com/m-a-d-n-e-s-s/madness/commit/2628cb9124b70b5b89f450fce54eb659c9225d90
  • I would be happy to fix it, but I'm not familiar enough with MPI to not break anything.

on sn-mem this is how I compiled it:

module load intel/compiler module load intel/tbb module load intel/mkl module load intel/mpi module load gmp module load mpfr/3.1.5 module load gcc/11.2.0 module load python module load cmake cd /path/to/build cmake /path/to/madness make -j moldft cd src/apps/moldft mpirun -np 2 ./moldft --geometry=h2o

— Reply to this email directly, view it on GitHub https://github.com/m-a-d-n-e-s-s/madness/issues/524#issuecomment-1943711776, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZSAPLUAP7X5NKPPWYBYO3YTSXRLAVCNFSM6AAAAABDGFSBECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTG4YTCNZXGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Robert J. Harrison tel: 865-274-8544

evaleev commented 9 months ago

@fbischoff try #526

fbischoff commented 9 months ago

it works for 2 processes, but not for 3 or more, not on our cluster nor on my mac

evaleev commented 9 months ago

@fbischoff please try https://github.com/m-a-d-n-e-s-s/madness/pull/529