Closed jchelly closed 4 years ago
From a bit more testing, it looks like this only happens if you have VR_NO_MASS=ON.
Haven't run into the issue but I will try the particular run and mpi+openmp decomposition. I thought I had the NO_MASS fully working but obviously not yet fully stable.
I have tried with the latest master (which has a fix for https://github.com/pelahi/VELOCIraptor-STF/issues/60). I have not encountered the error, which might have been due to the aforementioned hdf reading bug. I will test an earlier master branch.
I have also tested the earlier master branch on cosma with the snapshot you suggested and have also not run into the same error. I will try reproducing the error with the exact same submission script.
Thanks for looking into this. I still get the problem with the latest master (d8ed9597423f04c4ab1fa2f02d236746ed288ff2). Here's what I'm doing in a bit more detail:
Configure velociraptor:
module purge
module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7
module load parallel_hdf5/1.10.3 gsl/2.4 parmetis/4.0.3
module load gsl/2.4
module load cmake
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
-DCMAKE_C_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
-DCMAKE_C_COMPILER=icc \
-DCMAKE_CXX_COMPILER=icpc \
-DVR_NO_MASS=ON
make
Then I run it with
export OMP_NUM_THREADS=8
mpirun -np 2 ./VELOCIraptor-STF/build/stf \
-i /cosma5/data/Eagle/ScienceRuns/Planck1/L0025N0376/PE/DMONLY/data/snapshot_028_z000p000/snap_028_z000p000 \
-I 2 -s 16 \
-C ./tmp/login7a.pri.cosma7.alces.network.327086.cfg \
-o ./output/snapshot_028/snapshot_028 -Z 2
where the temporary config file is just sample_swiftdm_3dfof_subhalo.cfg with the hdf5 name convention set to eagle format, the Snapshot_value parameter substituted in, and the number of MPI ranks writing collectively set to a large value. I get this error message:
...
0: finished calculation in 18.28822708
TIME::0 took 161.1767371 to search 53157376 with 16
Searching subset
0 Beginning substructure search
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 4353 RUNNING AT m5134
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
If you have access to cosma you can find the exact scripts in /cosma5/data/jch/TreeTest/L0025N0376-VR/. I submit the job with "sbatch --array=28 ./slurm_batch.sh"
I think the difference between your tests and mine is that I have substructure remerging enabled (as it is in the dmonly examples in the velociraptor repository). I get crashes if I use VR_NO_MASS=ON and have these parameters in the .cfg file:
Apply_phase_merge_to_host=1
Structure_phase_merge_dist=0.25
Setting these parameters to zero makes the crash go away.
Have fixed the bug that was present with VR_NO_MASS=ON and merging substructure. Merged with master.
I'm running Velociraptor on an Eagle 25Mpc dmonly box in Gadget HDF5 format. I'm using the sample_swiftdm_3dfof_subhalo.cfg parameter file from the repository (with the HDF5 name convention changed) and running with 2 MPI ranks and 8 threads per rank using the latest master (commit ff3ed785d3fe0d6ae8f75ac3e04d56c12a8d3b36).
Velociraptor aborts at line 40 of buildandsortarrays.cxx. I think this happens because the index pid in the expression pglist[pid] is out of range. pglist only has numgroups+1 elements but we're trying to access element pid and pid > numgroups here (numgroups=48 and pid=132 in the case I'm looking at).