pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

Segmentation fault in MPIGetCellListInSearchUsingMesh in EAGLE-XL DMONLY runs #66

Closed jchelly closed 4 years ago

jchelly commented 4 years ago

I'm trying to run a 300Mpc DMONLY box using Swift with on-the-fly Velociraptor. I'm using revision 5c22b31faa189beddd639d24a94998fa34e41cd0 of Velociraptor with the sample_swiftdm_3dfof_subhalo.cfg configuration from the repository. I've configured it with

cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
    -DCMAKE_C_FLAGS_RELEASE="-O3 -xAVX -DNDEBUG" \
    -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
    -DCMAKE_C_COMPILER=icc \
    -DCMAKE_CXX_COMPILER=icpc \
    -DVR_USE_SWIFT_INTERFACE=ON \
    -DVR_USE_GAS=ON \
    -DVR_NO_MASS=ON

It crashes with a segmentation fault after this message:

Finding number of particles to export to other MPI domains...

The problem occurs in the function MPIGetCellListInSearchUsingMesh() on this line:

                if (ignorelocalcells && opt.cellnodeids[index]==ThisTask) continue;

The expression opt.cellnodeids[index] evaluates to an invalid address, which causes a segmentation fault. The variable index is zero when it crashes. This first happened on the 300Mpc box running on Irene but I can reproduce it on a 75Mpc box on Cosma.

pelahi commented 4 years ago

Hi John, how are the mpi domains being set in swift? Is it using metis? I haven't encountered the same issue but have just used swift with a max of 16 mpi domains in current tests. If there is an issue with opt.cellnodeids it suggests that there is something amiss with how they are set in swift and passed to vr.

jchelly commented 4 years ago

I'm using parmetis for these runs. I have it running in ddt now so I'll try to see where this bad value of cellnodeids is coming from.

jchelly commented 4 years ago

I think I've found the problem. When I run with VR_NO_MASS the mass element is omitted from Velociraptor's struct swift_vel_part in SwiftParticle.h. But on the Swift side there's no #ifdef and we don't know about VR_NO_MASS anyway so Swift's struct swift_vel_part in swift_velociraptor_part.h still contains a mass entry.

jchelly commented 4 years ago

Could we just always leave the mass entry in the struct whether it's used or not? Or should swift know about velociraptors configuration somehow?

MatthieuSchaller commented 4 years ago

Happy either way. We can discuss it with Pascal tomorrow morning.

jchelly commented 4 years ago

I'll just comment out the #ifdef for now and see if that gets the 300Mpc box going again.

jchelly commented 4 years ago

Looks like there are some fixes needed on the Swift side too. The siminfo struct doesn't match between swift and velociraptor in the latest master branches and I think the particle mass is not set anywhere with VR_NO_MASS=ON because the mass_uniform_box entry in siminfo doesn't exist in swift.

jchelly commented 4 years ago

It seems to get further with the following changes:

I'll make a pull request with a tidied up version of these fixes. Probably no need to modify Swift after all.

MatthieuSchaller commented 4 years ago

I have pushed the branch VR_detection in SWIFT to fix the issue. The latest VR master exposes functions in the libary depending on how it was compiled that the SWIFT configure script uses to decide whether to compile the SWIFT-VR interface in NOMASS mode or not.

jchelly commented 4 years ago

For Matthieu's fix to work we need this change too - #68.

jchelly commented 4 years ago

This seems to be fixed now.