pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

Crash with GSL error in EAGLE 100Mpc box #77

Closed jchelly closed 4 years ago

jchelly commented 4 years ago

I'm running Velociraptor on 200 snapshots from the EAGLE 100Mpc DMONLY box. I'm using commit e5a701fe032a8c9be36d542a84427ad787dd3787 from master and configuring with

module purge
module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7
module load parallel_hdf5/1.10.3 gsl/2.4 parmetis/4.0.3
module load gsl/2.4
module load cmake
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_FLAGS_RELEASE="-O3 -xAVX -ip -DNDEBUG" \
    -DCMAKE_C_FLAGS_RELEASE="-O3 -xAVX -ip -DNDEBUG" \
    -DCMAKE_C_COMPILER=icc \
    -DCMAKE_CXX_COMPILER=icpc \
    -DVR_NO_MASS=ON

The config file I'm using is VELOCIraptor-STF/examples/sample_swiftdm_3dfof_subhalo.cfg but with the HDF5 naming convention changed to Eagle.

It runs on 198 of 200 snapshots but two of them fail with this type of message in the log:

27Done searching substructure to 4 sublevels 
7Done searching substructure to 4 sublevels 
18Done searching substructure to 3 sublevels 
30Done searching substructure to 4 sublevels 
12Done searching substructure to 4 sublevels 
6Done searching substructure to 3 sublevels 
gsl: qrpt.c:345: ERROR: rank must have 0 < rank <= N
Default GSL error handler invoked.

The GSL call that generates this message seems to be at line 215 in Fitting.cxx from Nbodylib:

        gsl_multifit_nlinear_driver(maxiit, xtol, gtol, ftol, NULL, NULL, &info_gsl, workspace_gsl);

There's an example of a snapshot where this happens in /cosma7/data/Eagle/ScienceRuns/Planck1/L0100N1504/PE/DMONLY/data/snipshot_356_z000p213// on Cosma.

jchelly commented 4 years ago

Here's the stack trace:

#17 _INTERNAL_26_______src_z_Linux_util_cpp_d7ee2e5e::__kmp_launch_worker (thr=0x2aaaae16a1c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:585 (at 0x00002aaaad8949c0)
#16 __kmp_launch_thread (this_thr=0x2aaaae16a1c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:5885 (at 0x00002aaaad85baab)
#15 __kmp_invoke_task_func (gtid=-1374248512) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7277 (at 0x00002aaaad85c3ea)
#14 __kmp_invoke_microtask () from /cosma/local/Intel/Parallel_Studio_XE_2018/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaad894563)
#13 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_2944__par_region0_2_121 () at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2959 (at 0x000000000052ea44)
#12 PreCalcSearchSubSet (opt=..., subnumingroup=46912553531888, subPart=<error reading variable: Cannot access memory at address 0x0>, sublevel=2880) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2698 (at 0x000000000054ad6c)
#11 GetOutliersValues (opt=..., nbodies=46924830221952, Part=0x2aad89d61118, sublevel=2880) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:376 (at 0x00000000004faee3)
#10 DetermineDenVRatioDistribution (opt=..., nbodies=46925612085264, Part=0x3, meanr=<error reading variable: Cannot access memory at address 0xb40>, sdlow=@0x2aad89d64d00: 2.318394655446907e-310, sdhigh=@0x2aaaaddef27d: 2.0885377338534226e-236, sublevel=24814) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:338 (at 0x00000000004fd0e4)
#9 Math::FitNonLinLS (fitfunc=..., difffuncs=0x2aaaae16a1c0, nparams=-1374242320, params=0x0, covar=..., npoints=-1982444288, x=0x2aaaaddef27d, y=0x4, W=0x2aad89d60fd0, error=5.4110892682213118e-312, cl=9.532824124368238e-130, fixparam=0x9, binned=72, maxiit=-1982459968, iestimateerror=5230820, icallgsl=true) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:20 (at 0x000000000047ce33)
#8 Math::FitNonLinLSWithGSL (fitfunc=..., difffuncs=0x2aaaae16a1c0, nparams=-1374242320, params=0x0, covar=..., npoints=-1982444288, x=0x2aaaaddef27d, y=0x2aadb8707010, W=0x2aad89d60f48, error=5.4110892682213118e-312, cl=9.532824124368238e-130, fixparam=0x2aadb87078c0, binned=1, maxiit=20, iestimateerror=0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:215 (at 0x000000000047d135)
#7 gsl_multifit_nlinear_driver (maxiter=20, xtol=0.01, gtol=0.01, ftol=0.01, callback=0x0, callback_params=0x0, info=0x2aad89d60ce0, w=0x2aadb8707b80) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/fdf.c:279 (at 0x00002aaaaadd529c)
#6 gsl_multifit_nlinear_iterate (w=w@entry=0x2aadb8707b80) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/fdf.c:215 (at 0x00002aaaaadd51ea)
#5 trust_iterate (vstate=0x2aadb8707f70, swts=0x0, fdf=0x2aad89d60c40, x=0x2aadb87223c0, f=0x2aadb8707ed0, J=0x2aadb8707e20, g=0x2aadb8707ca0, dx=0x2aadb8707c30) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/trust.c:330 (at 0x00002aaaaadd8318)
#4 lm_step (vtrust_state=0x2aad89d60a60, delta=<optimized out>, dx=0x2aadb8707c30, vstate=0x2aadb87082c0) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/lm.c:236 (at 0x00002aaaaadd5f06)
#3 qr_solve (f=0x2aadb8707ed0, x=0x2aadb8708110, vtrust_state=0x2aad89d60a60, vstate=0x2aadb87085e0) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/qr.c:232 (at 0x00002aaaaadd67c0)
#2 gsl_linalg_QRPT_lssolve2 (QR=0x2aadb8708650, tau=0x2aadb8708030, p=0x2aadb87080f0, b=b@entry=0x2aadb8707ed0, rank=0, x=x@entry=0x2aadb8708110, residual=0x2aadb8708a80) at /cosma/local/software/GSL/gsl-2.4/linalg/qrpt.c:345 (at 0x00002aaaaad8a25d)
#1 gsl_error (reason=<optimized out>, file=<optimized out>, line=<optimized out>, gsl_errno=<optimized out>) at /cosma/local/software/GSL/gsl-2.4/err/error.c:47 (at 0x00002aaaaad50bad)
#0 abort () from /lib64/libc.so.6 (at 0x00002aaaaddd98e0)
jchelly commented 4 years ago

In the call to GetOutliersValues some parameters have values that look odd to me: sublevel=2880 and subnumingroup=46912553531888. Might just be ddt not handling an optimized build or threads though.

pelahi commented 4 years ago

Indeed, that sublevel depth is pretty crazy.

jchelly commented 4 years ago

I'm now having a similar crash in a 300Mpc DMONLY box where I'm running velociraptor on the fly:

...
54 Beginning substructure search 
TIME::55 took 2563.48 to search 457853291 with 14
Searching subset
55 Beginning substructure search 
gsl: qrpt.c:345: ERROR: rank must have 0 < rank <= N
Default GSL error handler invoked.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 63157 RUNNING AT m7344
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
jchelly commented 4 years ago

(edited because my previous comment seemed to get duplicated somehow)

I think the odd variable values I was seeing in the 100Mpc box might be due to ddt not handling optimization very well. They look more reasonable at -O1. I haven't been able to run it in ddt for long enough to reproduce the crash at -O0 yet.

jchelly commented 4 years ago

Here's the stack trace from a build with -O0. The reported parameter values are probably more reliable in this one.

#19 main (argc=13, argv=0x7ffffffecea8) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/main.cxx:387 (at 0x000000000048448f)
#18 SearchSubSub (opt=..., nsubset=231367038, Partsubset=std::vector of length 231367038, capacity 232120637 = {...}, pfof=@0x7ffffffec220: 0x2aafbd029010, ngroup=@0x7ffffffec218: 154287, nhalos=@0x7ffffffec240: 154287, pdata=0x0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2958 (at 0x00000000005cebbb)
#17 __kmpc_fork_call (loc=0x2aaab03ac1c0, argc=-1338320400, microtask=0x0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_csupport.cpp:349 (at 0x00002aaaaf7f0fb0)
#16 __kmp_fork_call (loc=0x2aaab03ac1c0, gtid=-1338320400, call_context=fork_context_gnu, argc=2880, microtask=0x2aaaaaaec840, invoker=0x2aaab003127d <vfprintf+19661>, ap=0x7ffffffe59c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:2463 (at 0x00002aaaaf831be7)
#15 __kmp_invoke_task_func (gtid=-1338326592) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7277 (at 0x00002aaaaf8303ea)
#14 __kmp_invoke_microtask () from /cosma/local/Intel/Parallel_Studio_XE_2018/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaaf868563)
#13 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_2958__par_region0_2_128 () at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2988 (at 0x00000000005d1ea9)
#12 PreCalcSearchSubSet (opt=..., subnumingroup=24814, subPart=@0x7ffffffe22c8: 0xae2cf08, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2698 (at 0x00000000005dc120)
#11 GetOutliersValues (opt=..., nbodies=24814, Part=0xae2cf08, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:376 (at 0x000000000057291f)
#10 DetermineDenVRatioDistribution (opt=..., nbodies=24814, Part=0xae2cf08, meanr=@0x7ffffffe07f8: -0.70936868354562721, sdlow=@0x7ffffffe0800: 4.2008269113685213, sdhigh=@0x7ffffffe0808: 4.2008269113685213, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:338 (at 0x0000000000571c87)
#9 Math::FitNonLinLS (fitfunc=..., difffuncs=0xd5a030, nparams=4, params=0xd27840, covar=..., npoints=9, x=0xd8ff60, y=0xb342510, W=0x7ffffffe0050, error=0.01, cl=0.94999999999999996, fixparam=0xdeb730, binned=1, maxiit=20, iestimateerror=0, icallgsl=true) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:20 (at 0x00000000004efc02)
#8 Math::FitNonLinLSWithGSL (fitfunc=..., difffuncs=0xd5a030, nparams=4, params=0xd27840, covar=..., npoints=9, x=0xd8ff60, y=0xb342510, W=0x7ffffffe0050, error=0.01, cl=0.94999999999999996, fixparam=0xdeb730, binned=1, maxiit=20, iestimateerror=0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:215 (at 0x00000000004f234e)
#7 gsl_multifit_nlinear_driver (maxiter=20, xtol=0.01, gtol=0.01, ftol=0.01, callback=0x0, callback_params=0x0, info=0x7ffffffdfae0, w=0xc1e0d10) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/fdf.c:279 (at 0x00002aaaaadd9fcc)
#6 gsl_multifit_nlinear_iterate (w=w@entry=0xc1e0d10) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/fdf.c:215 (at 0x00002aaaaadd9f1a)
#5 trust_iterate (vstate=0xb373a60, swts=0x0, fdf=0x7ffffffdfc08, x=0xd8ea60, f=0xd6b3a0, J=0xd85540, g=0xd83c10, dx=0xd82c90) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/trust.c:330 (at 0x00002aaaaaddd048)
#4 lm_step (vtrust_state=0x7ffffffdf930, delta=<optimized out>, dx=0xd82c90, vstate=0xdc9810) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/lm.c:236 (at 0x00002aaaaaddac36)
#3 qr_solve (f=0xd6b3a0, x=0xc7a3a0, vtrust_state=0x7ffffffdf930, vstate=0x1062470) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/qr.c:232 (at 0x00002aaaaaddb4f0)
#2 gsl_linalg_QRPT_lssolve2 (QR=0xd4d3b0, tau=0xd35f30, p=0xcee900, b=b@entry=0xd6b3a0, rank=0, x=x@entry=0xc7a3a0, residual=0xb373c10) at /cosma/local/software/GSL/gsl-2.5/linalg/qrpt.c:345 (at 0x00002aaaaad8f54d)
#1 gsl_error (reason=<optimized out>, file=<optimized out>, line=<optimized out>, gsl_errno=<optimized out>) at /cosma/local/software/GSL/gsl-2.5/err/error.c:47 (at 0x00002aaaaad5491d)
#0 abort () from /lib64/libc.so.6 (at 0x00002aaab001b8e0)
MatthieuSchaller commented 4 years ago

Is it possible to extract the value of all the parameters passed to the GSL call? This way we could identify whether the input is somehow garbage or whether it's a genuine corner case for which a different GSL call is needed (or a change in tolerance, etc.)

jchelly commented 4 years ago

That's a good idea. I have it stopped at the crash now. I'll see if I can extract all the inputs to reproduce the call.

jchelly commented 4 years ago

Here's a C++ program which reproduces the crash using the parameters I extracted from DDT: gsl_crash.cxx.gz. It can be run on cosma with

module purge
module load intel_comp/2018 gsl
icpc ./gsl_crash.cxx -lgsl -lgslcblas
./a.out
jchelly commented 4 years ago

I think the function being fit when it fails is equation 12 in https://arxiv.org/pdf/1902.01010.pdf . The quantities in the params array are param[0]=A, param[1]=R_mean, param[2]=sigma_r, param[3]=s.

pelahi commented 4 years ago

Yes, it does appear to just fail when fitting but fail throwing an exception. I'm going to see if I can wrap the gsl calls to catch this and just return chi^2 = infinity which is what my own fitting code did.

pelahi commented 4 years ago

I think this issue has now been addressed. @jchelly, can you confirm?

jchelly commented 4 years ago

This does seem to be fixed now.