pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

Segmentation fault when running on more than 1000 MPI tasks #120

Open cullanhowlett opened 4 months ago

cullanhowlett commented 4 months ago

Hi Pascal,

I'm trying to run a large Velociraptor job using 1280 MPI tasks, but I get the following error when it tries to set up the initial domain decomposition:

[dave106:577331:0:577331] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 577331) ====
 0 0x0000000000054db0 __GI___sigaction()  :0
 1 0x00000000004f93d8 MPIInitialDomainDecompositionWithMesh()  ???:0
 2 0x00000000005a3581 MPINumInDomainGadget()  ???:0
 3 0x00000000004e6f85 MPINumInDomain()  ???:0
 4 0x0000000000432cc0 main()  ???:0
 5 0x000000000003feb0 __libc_start_call_main()  ???:0
 6 0x000000000003ff60 __libc_start_main_alias_2()  :0
 7 0x0000000000435195 _start()  ???:0
=================================
[dave106:577331] *** Process received signal ***
[dave106:577331] Signal: Segmentation fault (11)
[dave106:577331] Signal code:  (-6)
[dave106:577331] Failing at address: 0x29430008cf33
[dave106:577331] [ 0] /lib64/libc.so.6(+0x54db0)[0x1497439f2db0]
[dave106:577331] [ 1] /home/chowlett/VELOCIraptor-STF/build/stf(_Z37MPIInitialDomainDecompositionWithMeshR7Options+0x378)[0x4f93d8]
[dave106:577331] [ 2] /home/chowlett/VELOCIraptor-STF/build/stf(_Z20MPINumInDomainGadgetR7Options+0x161)[0x5a3581]
[dave106:577331] [ 3] /home/chowlett/VELOCIraptor-STF/build/stf(_Z14MPINumInDomainR7Options+0x295)[0x4e6f85]
[dave106:577331] [ 4] /home/chowlett/VELOCIraptor-STF/build/stf(main+0x1a90)[0x432cc0]
[dave106:577331] [ 5] /lib64/libc.so.6(+0x3feb0)[0x1497439ddeb0]
[dave106:577331] [ 6] /lib64/libc.so.6(__libc_start_main+0x80)[0x1497439ddf60]
[dave106:577331] [ 7] /home/chowlett/VELOCIraptor-STF/build/stf(_start+0x25)[0x435195]
[dave106:577331] *** End of error message ***
slurmstepd: error:  mpi/pmix_v4: _errhandler: dave106 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.54696313.1:0]
srun: error: dave106: task 0: Segmentation fault (core dumped)

I've tried ways round this --- for instance using 640 MPI tasks each with 2 cpus per task, but this seems to run into Out-of-Memory errors on my machine (NT @ Swinburne). A 1/8 scaled down version of the job works just fine on the equivalent number of tasks.

Any thoughts?

pelahi commented 4 months ago

Hi @cullanhowlett , can you try the development branch (it's got a lot of fixes and I need to move main to not the default but development.) Also, I might be a bit slow for the next two days as I have to finish marking assignments for a HPC course.

pelahi commented 4 months ago

Hi @cullanhowlett , can you try setting the MPI_zcurve_mesh_decomposition_num_cells_per_dim=256 in the configuration file (that will set the mesh resolution. Likely I'll need to add something to scale the mesh resolution a little better and also cap it to 1024 per dimension.

pelahi commented 3 months ago

Hi @cullanhowlett , I haven't heard back but did the branch help?