joezuntz / cosmosis

Other
22 stars 16 forks source link

Segmentation Fault when running polychord sampler with MPI #97

Closed MinhMPA closed 1 year ago

MinhMPA commented 1 year ago

I keep getting Segmentation Fault when I try to run the DES Y3 likelihood using the polychord sampler and MPI. Cosmosis runs without any issue if I either a/ use another sampler or b/ run without MPI. So I cannot pinpoint where exactly the issue arises.

I attach the output with the --segfaults flag below. The error message suggests that the MPI check in initialise_mpi subroutine of polychord fails.

Thank you for your help!

Setting up module 2pt_like
---------------------------
Doing point-mass marginalization: True
Using sigma_crit_inv factors in pm-marg: True
Doing small-scale marginalization: False
Using a single grade of parameter speeds in polychord.
Polychord num_repeats = 60  (from parameter file)
PolyChord: MPI is already initilised, not initialising, and will not finalize
##################################################################################

Your program crashed with an error signal: 11

This the trace of C functions being called
(the first one or two may be part of the error handling):
##################################################################################

/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/datablock/cosmosis_py/../libcosmosis.so(cosmosis_segfault_handler+0x1f)[0x151f686491df]
/lib64/libc.so.6(+0x4eb20)[0x151f74da2b20]
/home/nguyenmn/.conda/envs/cosmosis-gnu/lib/python3.9/site-packages/mpi4py/../../../libmpi.so.40(PMPI_Comm_rank+0x37)[0x151f6767ead7]
/sw/pkgs/arc/intel/2022.1.2/mpi/2021.5.1/lib/libmpifort.so.12(mpi_comm_rank_+0xb)[0x151f431d1d4b]
/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/samplers/polychord/polychord_src/libchord_mpi.so(__random_module_MOD_initialise_random+0x3e)[0x151f43ee356e]
/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/samplers/polychord/polychord_src/libchord_mpi.so(__interfaces_module_MOD_run_polychord_full+0x5c7)[0x151f43f18067]
/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/samplers/polychord/polychord_src/libchord_mpi.so(polychord_c_interface+0x83f)[0x151f43f188af]
/home/nguyenmn/.conda/envs/cosmosis-gnu/lib/python3.9/lib-dynload/../../libffi.so.7(+0x69ed)[0x151f6f52e9ed]
/home/nguyenmn/.conda/envs/cosmosis-gnu/lib/python3.9/lib-dynload/../../libffi.so.7(+0x6077)[0x151f6f52e077]
/home/nguyenmn/.conda/envs/cosmosis-gnu/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x13df7)[0x151f6f547df7]
/home/nguyenmn/.conda/envs/cosmosis-gnu/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x1437c)[0x151f6f54837c]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyObject_MakeTpCall+0x316)[0x55690b8a6ba6]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x535b)[0x55690b944dbb]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyFunction_Vectorcall+0x19a)[0x55690b90088a]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x609)[0x55690b940069]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyFunction_Vectorcall+0x19a)[0x55690b90088a]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x609)[0x55690b940069]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyFunction_Vectorcall+0x19a)[0x55690b90088a]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x3bc)[0x55690b93fe1c]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(+0x138550)[0x55690b899550]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyFunction_Vectorcall+0x336)[0x55690b900a26]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x11e7)[0x55690b940c47]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyFunction_Vectorcall+0x19a)[0x55690b90088a]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalFrameDefault+0x4c84)[0x55690b9446e4]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(+0x138550)[0x55690b899550]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(_PyEval_EvalCodeWithName+0x47)[0x55690b980047]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(PyEval_EvalCodeEx+0x39)[0x55690b980089]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(PyEval_EvalCode+0x1b)[0x55690b9800ab]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(+0x251909)[0x55690b9b2909]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(+0x28c3a4)[0x55690b9ed3a4]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(+0x118d33)[0x55690b879d33]
/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/python3.9(PyRun_SimpleFileExFlags+0x19c)[0x55690b9f783c]
##################################################################################

And here is the python faulthandler report and trace:

Fatal Python error: Segmentation fault

Current thread 0x0000151f75ee2740 (most recent call first):
  File "/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/samplers/polychord/polychord_sampler.py", line 266 in sample
  File "/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/samplers/polychord/polychord_sampler.py", line 204 in worker
  File "/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/main.py", line 84 in sampler_main_loop
  File "/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/main.py", line 417 in run_cosmosis
  File "/nfs/turbo/lsa-nguyenmn/cosmosis/cosmosis/main.py", line 543 in main
  File "/home/nguyenmn/.conda/envs/cosmosis-gnu/bin/cosmosis", line 4 in <module>
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node gl-login2 exited on signal 11 (Segmentation fault).
MinhMPA commented 1 year ago

Closing my own issue ticket as I have fixed the issue by reinstalling CosmoSIS with a different MPI compiler.