Segmentation fault in BuildPGList() if VR_NO_MASS=ON

jchelly commented 4 years ago

I'm running Velociraptor on an Eagle 25Mpc dmonly box in Gadget HDF5 format. I'm using the sample_swiftdm_3dfof_subhalo.cfg parameter file from the repository (with the HDF5 name convention changed) and running with 2 MPI ranks and 8 threads per rank using the latest master (commit ff3ed785d3fe0d6ae8f75ac3e04d56c12a8d3b36).

Velociraptor aborts at line 40 of buildandsortarrays.cxx. I think this happens because the index pid in the expression pglist[pid] is out of range. pglist only has numgroups+1 elements but we're trying to access element pid and pid > numgroups here (numgroups=48 and pid=132 in the case I'm looking at).

jchelly commented 4 years ago

From a bit more testing, it looks like this only happens if you have VR_NO_MASS=ON.

pelahi commented 4 years ago

Haven't run into the issue but I will try the particular run and mpi+openmp decomposition. I thought I had the NO_MASS fully working but obviously not yet fully stable.

pelahi commented 4 years ago

I have tried with the latest master (which has a fix for https://github.com/pelahi/VELOCIraptor-STF/issues/60). I have not encountered the error, which might have been due to the aforementioned hdf reading bug. I will test an earlier master branch.

pelahi commented 4 years ago

I have also tested the earlier master branch on cosma with the snapshot you suggested and have also not run into the same error. I will try reproducing the error with the exact same submission script.

jchelly commented 4 years ago

Thanks for looking into this. I still get the problem with the latest master (d8ed9597423f04c4ab1fa2f02d236746ed288ff2). Here's what I'm doing in a bit more detail:

Configure velociraptor:

module purge
module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7
module load parallel_hdf5/1.10.3 gsl/2.4 parmetis/4.0.3
module load gsl/2.4
module load cmake
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
    -DCMAKE_C_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
    -DCMAKE_C_COMPILER=icc \
    -DCMAKE_CXX_COMPILER=icpc \
    -DVR_NO_MASS=ON
make

Then I run it with

export OMP_NUM_THREADS=8
mpirun -np 2 ./VELOCIraptor-STF/build/stf \
  -i /cosma5/data/Eagle/ScienceRuns/Planck1/L0025N0376/PE/DMONLY/data/snapshot_028_z000p000/snap_028_z000p000 \
  -I 2 -s 16 \
  -C ./tmp/login7a.pri.cosma7.alces.network.327086.cfg \
  -o ./output/snapshot_028/snapshot_028 -Z 2

where the temporary config file is just sample_swiftdm_3dfof_subhalo.cfg with the hdf5 name convention set to eagle format, the Snapshot_value parameter substituted in, and the number of MPI ranks writing collectively set to a large value. I get this error message:

...
0: finished calculation in 18.28822708
TIME::0 took 161.1767371 to search 53157376 with 16
Searching subset
0 Beginning substructure search 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4353 RUNNING AT m5134
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

If you have access to cosma you can find the exact scripts in /cosma5/data/jch/TreeTest/L0025N0376-VR/. I submit the job with "sbatch --array=28 ./slurm_batch.sh"

jchelly commented 4 years ago

I think the difference between your tests and mine is that I have substructure remerging enabled (as it is in the dmonly examples in the velociraptor repository). I get crashes if I use VR_NO_MASS=ON and have these parameters in the .cfg file:

Apply_phase_merge_to_host=1
Structure_phase_merge_dist=0.25

Setting these parameters to zero makes the crash go away.

pelahi commented 4 years ago

Have fixed the bug that was present with VR_NO_MASS=ON and merging substructure. Merged with master.

pelahi / VELOCIraptor-STF

Segmentation fault in BuildPGList() if VR_NO_MASS=ON #59