pfhopkins / gizmo-public

Public version of the GIZMO multi-physics massively-parallel code
http://www.tapir.caltech.edu/~phopkins/Site/GIZMO.html
10 stars 1 forks source link

Error when running on SDSC Expanse: "[exp-8-32:1263294:0:1263294] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1567b35556c0)" #4

Closed goddarjk closed 2 months ago

goddarjk commented 4 months ago

I have been successfully running simulations using GIZMO on the compute cluster at my home university for about two years, however I am attempting to run on Expanse CPU at SDSC and have repeatedly come up against this same crash and error message (given in the title) which I have not seen before. I have tried various combinations of compilers available on Expanse, but I seem to run up against this same error no matter which I use. This crash consistently happens during the gravity_tree () routine, but not always at the same point in the simulation. I have been in communication with the Expanse support, but we are stumped. Below I include the the module sets I have loaded and how I run. What can I try to resolve this problem? Many Thanks, Julianne

modules:

module load slurm cpu/0.15.4 intel/19.1.1.217 intel-mpi/2019.8.254 fftw/3.3.8 gsl/2.5 hdf5/1.10.6

module load slurm cpu/0.15.4 gcc/10.2.0 openmpi/4.0.4 fftw/3.3.8 gsl/2.5 hdf5/1.10.6

module load slurm cpu/0.17.3b gcc/10.2.0/npcyll4 gsl/2.7/wtlsmyy openmpi/4.1.3/oq3qvsv hdf5/1.10.7/5o4oibc

module load slurm cpu/0.15.4 gcc/10.2.0 openmpi/4.0.4-openib fftw/3.3.8 gsl/2.5 hdf5/1.10.6

to run: module purge source modules ulimit -s unlimited mpirun -v -x LD_LIBRARY_PATH ./GIZMO params.param

mikegrudic commented 4 months ago

Hi @goddarjk, could you share your Config.sh too?

goddarjk commented 4 months ago

I have attached my latest Config.sh. I have experimented with turning USE_MPI_IN_PLACE and/or NO_ISEND_IRECV_IN_DOMAIN on and off. When I comment out NO_ISEND_IRECV_IN_DOMAIN the code no longer crashes but instead gets 'stuck' where there are no more outputs or progress, but the job continues without any error messages until the time set in the job script expires. I also adjust the FFTW flags as needed depending on which version/modules I load. I have also tried turning off all of the additional outputs and debugging flags but this has not made a difference in the result. Config.txt

mikegrudic commented 4 months ago

OK, I noticed in the Makefile case for Expanse the comments recommended using the flag NOTYPEPREFIX_FFTW but with no explanation as to why - this affects the behavior of the particle-mesh gravity solver so I was wondering whether you have tried that one?

goddarjk commented 4 months ago

Thank you, yes I do turn on NOTYPEPREFIX_FFTW when I use the fftw module available in expanse, I only have it commented here because I installed my own versions of fftw2.1.5 in my home directory with both double and single precision libraries to see if it would help when I ran into trouble using the in place FFTW libraries. However, I still get the same error whether I use my own fftw with this flag off or their libraries with this flag on.

calebchoban commented 4 months ago

Hi @goddarjk, I believe I was the one that made the Makefile example for Expanse at SDSC. I no longer use Expanse, but from what I remember you must use FFTW version 3. There is some issue with version 2 that SDSC IT could not figure out. Try compiling and running with FFTW 3 and setting USE_FFTW3 in the Config file and see if that fixes your issue. If that doesn't work I have colleagues that run GIZMO on Expanse that I can reach out to.

goddarjk commented 4 months ago

Hi @calebchoban, thank you so much for your reply. I have tried using FFTW 3 and setting USE_FFTW3 and unfortunately get the same result. I would very much appreciate it if you could put me in touch with someone who is successfully running GIZMO on expanse!

goddarjk commented 2 months ago

I think I have been able to resolve this by restructuring to allow more memory to be available per task, I have been running for a couple weeks now without error. Thanks to everyone for your advice!