Closed jchelly closed 4 years ago
Here's the stack trace:
#17 _INTERNAL_26_______src_z_Linux_util_cpp_d7ee2e5e::__kmp_launch_worker (thr=0x2aaaae16a1c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:585 (at 0x00002aaaad8949c0)
#16 __kmp_launch_thread (this_thr=0x2aaaae16a1c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:5885 (at 0x00002aaaad85baab)
#15 __kmp_invoke_task_func (gtid=-1374248512) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7277 (at 0x00002aaaad85c3ea)
#14 __kmp_invoke_microtask () from /cosma/local/Intel/Parallel_Studio_XE_2018/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaad894563)
#13 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_2944__par_region0_2_121 () at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2959 (at 0x000000000052ea44)
#12 PreCalcSearchSubSet (opt=..., subnumingroup=46912553531888, subPart=<error reading variable: Cannot access memory at address 0x0>, sublevel=2880) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2698 (at 0x000000000054ad6c)
#11 GetOutliersValues (opt=..., nbodies=46924830221952, Part=0x2aad89d61118, sublevel=2880) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:376 (at 0x00000000004faee3)
#10 DetermineDenVRatioDistribution (opt=..., nbodies=46925612085264, Part=0x3, meanr=<error reading variable: Cannot access memory at address 0xb40>, sdlow=@0x2aad89d64d00: 2.318394655446907e-310, sdhigh=@0x2aaaaddef27d: 2.0885377338534226e-236, sublevel=24814) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:338 (at 0x00000000004fd0e4)
#9 Math::FitNonLinLS (fitfunc=..., difffuncs=0x2aaaae16a1c0, nparams=-1374242320, params=0x0, covar=..., npoints=-1982444288, x=0x2aaaaddef27d, y=0x4, W=0x2aad89d60fd0, error=5.4110892682213118e-312, cl=9.532824124368238e-130, fixparam=0x9, binned=72, maxiit=-1982459968, iestimateerror=5230820, icallgsl=true) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:20 (at 0x000000000047ce33)
#8 Math::FitNonLinLSWithGSL (fitfunc=..., difffuncs=0x2aaaae16a1c0, nparams=-1374242320, params=0x0, covar=..., npoints=-1982444288, x=0x2aaaaddef27d, y=0x2aadb8707010, W=0x2aad89d60f48, error=5.4110892682213118e-312, cl=9.532824124368238e-130, fixparam=0x2aadb87078c0, binned=1, maxiit=20, iestimateerror=0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:215 (at 0x000000000047d135)
#7 gsl_multifit_nlinear_driver (maxiter=20, xtol=0.01, gtol=0.01, ftol=0.01, callback=0x0, callback_params=0x0, info=0x2aad89d60ce0, w=0x2aadb8707b80) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/fdf.c:279 (at 0x00002aaaaadd529c)
#6 gsl_multifit_nlinear_iterate (w=w@entry=0x2aadb8707b80) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/fdf.c:215 (at 0x00002aaaaadd51ea)
#5 trust_iterate (vstate=0x2aadb8707f70, swts=0x0, fdf=0x2aad89d60c40, x=0x2aadb87223c0, f=0x2aadb8707ed0, J=0x2aadb8707e20, g=0x2aadb8707ca0, dx=0x2aadb8707c30) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/trust.c:330 (at 0x00002aaaaadd8318)
#4 lm_step (vtrust_state=0x2aad89d60a60, delta=<optimized out>, dx=0x2aadb8707c30, vstate=0x2aadb87082c0) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/lm.c:236 (at 0x00002aaaaadd5f06)
#3 qr_solve (f=0x2aadb8707ed0, x=0x2aadb8708110, vtrust_state=0x2aad89d60a60, vstate=0x2aadb87085e0) at /cosma/local/software/GSL/gsl-2.4/multifit_nlinear/qr.c:232 (at 0x00002aaaaadd67c0)
#2 gsl_linalg_QRPT_lssolve2 (QR=0x2aadb8708650, tau=0x2aadb8708030, p=0x2aadb87080f0, b=b@entry=0x2aadb8707ed0, rank=0, x=x@entry=0x2aadb8708110, residual=0x2aadb8708a80) at /cosma/local/software/GSL/gsl-2.4/linalg/qrpt.c:345 (at 0x00002aaaaad8a25d)
#1 gsl_error (reason=<optimized out>, file=<optimized out>, line=<optimized out>, gsl_errno=<optimized out>) at /cosma/local/software/GSL/gsl-2.4/err/error.c:47 (at 0x00002aaaaad50bad)
#0 abort () from /lib64/libc.so.6 (at 0x00002aaaaddd98e0)
In the call to GetOutliersValues some parameters have values that look odd to me: sublevel=2880 and subnumingroup=46912553531888. Might just be ddt not handling an optimized build or threads though.
Indeed, that sublevel depth is pretty crazy.
I'm now having a similar crash in a 300Mpc DMONLY box where I'm running velociraptor on the fly:
...
54 Beginning substructure search
TIME::55 took 2563.48 to search 457853291 with 14
Searching subset
55 Beginning substructure search
gsl: qrpt.c:345: ERROR: rank must have 0 < rank <= N
Default GSL error handler invoked.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 63157 RUNNING AT m7344
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
(edited because my previous comment seemed to get duplicated somehow)
I think the odd variable values I was seeing in the 100Mpc box might be due to ddt not handling optimization very well. They look more reasonable at -O1. I haven't been able to run it in ddt for long enough to reproduce the crash at -O0 yet.
Here's the stack trace from a build with -O0. The reported parameter values are probably more reliable in this one.
#19 main (argc=13, argv=0x7ffffffecea8) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/main.cxx:387 (at 0x000000000048448f)
#18 SearchSubSub (opt=..., nsubset=231367038, Partsubset=std::vector of length 231367038, capacity 232120637 = {...}, pfof=@0x7ffffffec220: 0x2aafbd029010, ngroup=@0x7ffffffec218: 154287, nhalos=@0x7ffffffec240: 154287, pdata=0x0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2958 (at 0x00000000005cebbb)
#17 __kmpc_fork_call (loc=0x2aaab03ac1c0, argc=-1338320400, microtask=0x0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_csupport.cpp:349 (at 0x00002aaaaf7f0fb0)
#16 __kmp_fork_call (loc=0x2aaab03ac1c0, gtid=-1338320400, call_context=fork_context_gnu, argc=2880, microtask=0x2aaaaaaec840, invoker=0x2aaab003127d <vfprintf+19661>, ap=0x7ffffffe59c0) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:2463 (at 0x00002aaaaf831be7)
#15 __kmp_invoke_task_func (gtid=-1338326592) at /nfs/site/proj/openmp/promo/20180112/tmp/lin_32e-rtl_5_nor_dyn.rel.50.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7277 (at 0x00002aaaaf8303ea)
#14 __kmp_invoke_microtask () from /cosma/local/Intel/Parallel_Studio_XE_2018/compilers_and_libraries_2018.2.199/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaaf868563)
#13 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_2958__par_region0_2_128 () at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2988 (at 0x00000000005d1ea9)
#12 PreCalcSearchSubSet (opt=..., subnumingroup=24814, subPart=@0x7ffffffe22c8: 0xae2cf08, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/search.cxx:2698 (at 0x00000000005dc120)
#11 GetOutliersValues (opt=..., nbodies=24814, Part=0xae2cf08, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:376 (at 0x000000000057291f)
#10 DetermineDenVRatioDistribution (opt=..., nbodies=24814, Part=0xae2cf08, meanr=@0x7ffffffe07f8: -0.70936868354562721, sdlow=@0x7ffffffe0800: 4.2008269113685213, sdhigh=@0x7ffffffe0808: 4.2008269113685213, sublevel=1) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/src/localbgcomp.cxx:338 (at 0x0000000000571c87)
#9 Math::FitNonLinLS (fitfunc=..., difffuncs=0xd5a030, nparams=4, params=0xd27840, covar=..., npoints=9, x=0xd8ff60, y=0xb342510, W=0x7ffffffe0050, error=0.01, cl=0.94999999999999996, fixparam=0xdeb730, binned=1, maxiit=20, iestimateerror=0, icallgsl=true) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:20 (at 0x00000000004efc02)
#8 Math::FitNonLinLSWithGSL (fitfunc=..., difffuncs=0xd5a030, nparams=4, params=0xd27840, covar=..., npoints=9, x=0xd8ff60, y=0xb342510, W=0x7ffffffe0050, error=0.01, cl=0.94999999999999996, fixparam=0xdeb730, binned=1, maxiit=20, iestimateerror=0) at /cosma7/data/dp004/jch/Tree_Comparison/Velociraptor/VELOCIraptor-STF/NBodylib/src/Math/Fitting.cxx:215 (at 0x00000000004f234e)
#7 gsl_multifit_nlinear_driver (maxiter=20, xtol=0.01, gtol=0.01, ftol=0.01, callback=0x0, callback_params=0x0, info=0x7ffffffdfae0, w=0xc1e0d10) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/fdf.c:279 (at 0x00002aaaaadd9fcc)
#6 gsl_multifit_nlinear_iterate (w=w@entry=0xc1e0d10) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/fdf.c:215 (at 0x00002aaaaadd9f1a)
#5 trust_iterate (vstate=0xb373a60, swts=0x0, fdf=0x7ffffffdfc08, x=0xd8ea60, f=0xd6b3a0, J=0xd85540, g=0xd83c10, dx=0xd82c90) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/trust.c:330 (at 0x00002aaaaaddd048)
#4 lm_step (vtrust_state=0x7ffffffdf930, delta=<optimized out>, dx=0xd82c90, vstate=0xdc9810) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/lm.c:236 (at 0x00002aaaaaddac36)
#3 qr_solve (f=0xd6b3a0, x=0xc7a3a0, vtrust_state=0x7ffffffdf930, vstate=0x1062470) at /cosma/local/software/GSL/gsl-2.5/multifit_nlinear/qr.c:232 (at 0x00002aaaaaddb4f0)
#2 gsl_linalg_QRPT_lssolve2 (QR=0xd4d3b0, tau=0xd35f30, p=0xcee900, b=b@entry=0xd6b3a0, rank=0, x=x@entry=0xc7a3a0, residual=0xb373c10) at /cosma/local/software/GSL/gsl-2.5/linalg/qrpt.c:345 (at 0x00002aaaaad8f54d)
#1 gsl_error (reason=<optimized out>, file=<optimized out>, line=<optimized out>, gsl_errno=<optimized out>) at /cosma/local/software/GSL/gsl-2.5/err/error.c:47 (at 0x00002aaaaad5491d)
#0 abort () from /lib64/libc.so.6 (at 0x00002aaab001b8e0)
Is it possible to extract the value of all the parameters passed to the GSL call? This way we could identify whether the input is somehow garbage or whether it's a genuine corner case for which a different GSL call is needed (or a change in tolerance, etc.)
That's a good idea. I have it stopped at the crash now. I'll see if I can extract all the inputs to reproduce the call.
Here's a C++ program which reproduces the crash using the parameters I extracted from DDT: gsl_crash.cxx.gz. It can be run on cosma with
module purge
module load intel_comp/2018 gsl
icpc ./gsl_crash.cxx -lgsl -lgslcblas
./a.out
I think the function being fit when it fails is equation 12 in https://arxiv.org/pdf/1902.01010.pdf . The quantities in the params array are param[0]=A, param[1]=R_mean, param[2]=sigma_r, param[3]=s.
Yes, it does appear to just fail when fitting but fail throwing an exception. I'm going to see if I can wrap the gsl calls to catch this and just return chi^2 = infinity which is what my own fitting code did.
I think this issue has now been addressed. @jchelly, can you confirm?
This does seem to be fixed now.
I'm running Velociraptor on 200 snapshots from the EAGLE 100Mpc DMONLY box. I'm using commit e5a701fe032a8c9be36d542a84427ad787dd3787 from master and configuring with
The config file I'm using is VELOCIraptor-STF/examples/sample_swiftdm_3dfof_subhalo.cfg but with the HDF5 naming convention changed to Eagle.
It runs on 198 of 200 snapshots but two of them fail with this type of message in the log:
The GSL call that generates this message seems to be at line 215 in Fitting.cxx from Nbodylib:
There's an example of a snapshot where this happens in /cosma7/data/Eagle/ScienceRuns/Planck1/L0100N1504/PE/DMONLY/data/snipshot_356_z000p213// on Cosma.