Closed HighwayStar closed 7 years ago
Thanks for the info. Please try with debugging and an optimization flag, i.e.: -g -fbounds-check -fbacktrace -O1 and, if it still crashes, post here the backtrace Thanks
output produced by build with -g -fbounds-check -fbacktrace -O1 :
python run_nicole.py
Checking syntax in file:LINES
Preparing cycle 1
Checking syntax in file:NICOLE.input
... no errors found
Preparing file with observed profiles...100%
Preparing file with input model...100%
Preparing cycle 2
Checking syntax in file:NICOLE.input_2
... no errors found
Starting code execution
*************** N I C O L E v 15.05 ******************
Lorien version: LORIEN Version 4.2
Forward version: NICOLE Forward v3.6
Compex version: NICOLE Compex v3.5
********************************************************
This is the serial build
WARNING!! Outputfile already exists:inversion.mod
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.mod.err
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.pro
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
Warning. Gas pressure near tau=1 is way off typical solar values
Found (cgs): 416.518463 HSRA has: 131000.000
Proceeding anyway (hope you know what you are doing). If the results
are not as expected, this might be the reason
Warning. Density near tau=1 is way off typical solar values
Found (g/cm3): 6.97291391E-10 HSRA has: 3.19000009E-07
Proceeding anyway (hope you know what you are doing). If the results
are not as expected, this might be the reason
Inversion try: 1
At line 120 of file ../compex/compex.f90
Fortran runtime error: Index '0' of dimension 1 of array 'guess_model' below lower bound of 1
Ok, I'll check it tomorrow. If you haven't heard back from me by the end of the week, please give me a nudge. Thanks. Oh, just one more suggestion. To make sure there has not been some file corruption due to an unfinished run, try restoring that directory again from the original ZIP file.
I've tried to restore this directory and full tree, same results; segfault with O flag, no segfault without it.
Hi, any news about this issue?
Yes, I just figured it out. There was a bug that, for some unknown reason, manifested itself only when using optimization. I just uploaded a new version. Should work now. Thanks for catching that
Now I have random free(): invalid pointer crashes. Some times I can run test for 10-20 runs in row with no crashes but sometimes it crashes. I also trying to run it with real data and get free() invalid pointer or dimension error when -fbounds-check flag used.
******************* inv1 *********************
This will test a simple LTE inversion
Starting run. Output will be kept in inv1/log.txt
(Starting at 13:11:43)
*** Error in `../../main/nicole': free(): invalid pointer: 0x0000000010805390 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7283f)[0x7fdcf615283f]
/lib64/libc.so.6(+0x780ae)[0x7fdcf61580ae]
/lib64/libc.so.6(+0x78db6)[0x7fdcf6158db6]
../../main/nicole[0x4c1bcf]
../../main/nicole[0x4bd22d]
../../main/nicole[0x4c0f28]
../../main/nicole[0x40e7b9]
../../main/nicole[0x40220d]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdcf6101b05]
../../main/nicole[0x402236]
======= Memory map: ========
00400000-00502000 r-xp 00000000 fe:09 20581620 /local/tomin/devel/tmp/NICOLE/main/nicole
00701000-00702000 r--p 00101000 fe:09 20581620 /local/tomin/devel/tmp/NICOLE/main/nicole
00702000-0070b000 rw-p 00102000 fe:09 20581620 /local/tomin/devel/tmp/NICOLE/main/nicole
0070b000-1012d000 rw-p 00000000 00:00 0
107c5000-10883000 rw-p 00000000 00:00 0 [heap]
7fdcf5e6e000-7fdcf60e0000 rw-p 00000000 00:00 0
7fdcf60e0000-7fdcf627e000 r-xp 00000000 fe:00 1067444 /lib64/libc-2.19.so
7fdcf627e000-7fdcf647d000 ---p 0019e000 fe:00 1067444 /lib64/libc-2.19.so
7fdcf647d000-7fdcf6481000 r--p 0019d000 fe:00 1067444 /lib64/libc-2.19.so
7fdcf6481000-7fdcf6483000 rw-p 001a1000 fe:00 1067444 /lib64/libc-2.19.so
7fdcf6483000-7fdcf6487000 rw-p 00000000 00:00 0
7fdcf6488000-7fdcf64c3000 r-xp 00000000 fe:00 665166 /usr/lib64/libquadmath.so.0.0.0
7fdcf64c3000-7fdcf66c2000 ---p 0003b000 fe:00 665166 /usr/lib64/libquadmath.so.0.0.0
7fdcf66c2000-7fdcf66c3000 r--p 0003a000 fe:00 665166 /usr/lib64/libquadmath.so.0.0.0
7fdcf66c3000-7fdcf66c4000 rw-p 0003b000 fe:00 665166 /usr/lib64/libquadmath.so.0.0.0
7fdcf66c8000-7fdcf66de000 r-xp 00000000 fe:00 1048645 /lib64/libgcc_s.so.1
7fdcf66de000-7fdcf68dd000 ---p 00016000 fe:00 1048645 /lib64/libgcc_s.so.1
7fdcf68dd000-7fdcf68de000 r--p 00015000 fe:00 1048645 /lib64/libgcc_s.so.1
7fdcf68de000-7fdcf68df000 rw-p 00016000 fe:00 1048645 /lib64/libgcc_s.so.1
7fdcf68e0000-7fdcf69e0000 r-xp 00000000 fe:00 1067447 /lib64/libm-2.19.so
7fdcf69e0000-7fdcf6bdf000 ---p 00100000 fe:00 1067447 /lib64/libm-2.19.so
7fdcf6bdf000-7fdcf6be0000 r--p 000ff000 fe:00 1067447 /lib64/libm-2.19.so
7fdcf6be0000-7fdcf6be1000 rw-p 00100000 fe:00 1067447 /lib64/libm-2.19.so
7fdcf6be8000-7fdcf6d00000 r-xp 00000000 fe:00 664043 /usr/lib64/libgfortran.so.3.0.0
7fdcf6d00000-7fdcf6eff000 ---p 00118000 fe:00 664043 /usr/lib64/libgfortran.so.3.0.0
7fdcf6eff000-7fdcf6f00000 r--p 00117000 fe:00 664043 /usr/lib64/libgfortran.so.3.0.0
7fdcf6f00000-7fdcf6f02000 rw-p 00118000 fe:00 664043 /usr/lib64/libgfortran.so.3.0.0
7fdcf6f08000-7fdcf6f28000 r-xp 00000000 fe:00 1068374 /lib64/ld-2.19.so
7fdcf70f4000-7fdcf70f8000 rw-p 00000000 00:00 0
7fdcf7125000-7fdcf7128000 rw-p 00000000 00:00 0
7fdcf7128000-7fdcf7129000 r--p 00020000 fe:00 1068374 /lib64/ld-2.19.so
7fdcf7129000-7fdcf712a000 rw-p 00021000 fe:00 1068374 /lib64/ld-2.19.so
7fdcf712a000-7fdcf712b000 rw-p 00000000 00:00 0
7fffb93c5000-7fffb93ec000 rw-p 00000000 00:00 0 [stack]
7fffb9400000-7fffb9402000 r-xp 00000000 00:00 0 [vdso]
7fffb9402000-7fffb9404000 r--p 00000000 00:00 0 [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
** ERROR!! NICOLE crashed during this test
(Finished at 13:11:58)
Built with '-g -fbounds-check -fbacktrace -O3 it produce no backtrace but silent crashed message
******************* syn3 *********************
This will test a simple synthesis in NLTE with a magnetic atmosphere
Starting run. Output will be kept in syn3/log.txt
(Starting at 13:21:39)
The run has completed normally
(Finished at 13:21:41)
Checking the results produced...
Results appear to be correct
******************* inv1 *********************
This will test a simple LTE inversion
Starting run. Output will be kept in inv1/log.txt
(Starting at 13:21:41)
** ERROR!! NICOLE crashed during this test
(Finished at 13:21:56)
******************* inv2 *********************
This will test two LTE inversions
Starting run. Output will be kept in inv2/log.txt
(Starting at 13:21:56)
** ERROR!! NICOLE crashed during this test
(Finished at 13:22:5)
==========================================
Unfortunately your build of NICOLE has failed one or more tests :(
Did you make clean before recompiling the new version?
Yes I did make clean and git clean -f and git reset --hard before building new version.
output of python run_nicole.py command in inv1 dir
python run_nicole.py
Checking syntax in file:LINES
Preparing cycle 1
Checking syntax in file:NICOLE.input
... no errors found
Preparing file with observed profiles...100%
Preparing file with input model...100%
Preparing cycle 2
Checking syntax in file:NICOLE.input_2
... no errors found
Starting code execution
*************** N I C O L E v 15.06 ******************
Lorien version: LORIEN Version 4.2
Forward version: NICOLE Forward v3.6
Compex version: NICOLE Compex v3.5
********************************************************
This is the serial build
WARNING!! Outputfile already exists:inversion.mod
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.mod.err
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
Warning. Gas pressure near tau=1 is way off typical solar values
Found (cgs): 416.518463 HSRA has: 131000.000
Proceeding anyway (hope you know what you are doing). If the results
are not as expected, this might be the reason
Warning. Density near tau=1 is way off typical solar values
Found (g/cm3): 6.97291391E-10 HSRA has: 3.19000009E-07
Proceeding anyway (hope you know what you are doing). If the results
are not as expected, this might be the reason
Inversion try: 1
iter= 0 Lambda= 0.100E-02 Regul= 0.250E-02 Chisq= 0.449E+04
iter= 1 Lambda= 0.100E-03 Regul= 0.184E-01 Chisq= 0.392E+04
iter= 2 Lambda= 0.100E-04 Regul= 0.296E-01 Chisq= 0.370E+03
REJECTED: --- iter= 3 Lambda= 0.100E-03 Regul= 0.124E-01 Chisq= 0.661E+03
iter= 4 Lambda= 0.100E-04 Regul= 0.956E-02 Chisq= 0.238E+03
REJECTED: --- iter= 5 Lambda= 0.100E-03 Regul= 0.121E-01 Chisq= 0.585E+03
REJECTED: --- iter= 6 Lambda= 0.100E-02 Regul= 0.120E-01 Chisq= 0.582E+03
REJECTED: --- iter= 7 Lambda= 0.100E-01 Regul= 0.120E-01 Chisq= 0.529E+03
REJECTED: --- iter= 8 Lambda= 0.100E+00 Regul= 0.480E-01 Chisq= 0.253E+03
Chisq= 238.098831 . Best so far= 238.098831
Inversion try: 2
iter= 0 Lambda= 0.100E-02 Regul= 0.845E-02 Chisq= 0.679E+04
iter= 1 Lambda= 0.100E-03 Regul= 0.293E-01 Chisq= 0.397E+02
REJECTED: --- iter= 2 Lambda= 0.100E-02 Regul= 0.333E-01 Chisq= 0.564E+04
REJECTED: --- iter= 3 Lambda= 0.100E-01 Regul= 0.308E-01 Chisq= 0.382E+04
iter= 4 Lambda= 0.100E-02 Regul= 0.866E-02 Chisq= 0.258E+02
REJECTED: --- iter= 5 Lambda= 0.100E-01 Regul= 0.369E-01 Chisq= 0.374E+03
REJECTED: --- iter= 6 Lambda= 0.100E+00 Regul= 0.112E+00 Chisq= 0.459E+02
iter= 7 Lambda= 0.100E-01 Regul= 0.190E-01 Chisq= 0.177E+02
REJECTED: --- iter= 8 Lambda= 0.100E+00 Regul= 0.412E-01 Chisq= 0.490E+02
iter= 9 Lambda= 0.100E-01 Regul= 0.515E-01 Chisq= 0.661E+01
iter= 10 Lambda= 0.100E-02 Regul= 0.489E-01 Chisq= 0.464E+01
REJECTED: --- iter= 11 Lambda= 0.100E-01 Regul= 0.466E-01 Chisq= 0.960E+01
REJECTED: --- iter= 12 Lambda= 0.100E+00 Regul= 0.178E+00 Chisq= 0.579E+02
iter= 13 Lambda= 0.100E-01 Regul= 0.731E-01 Chisq= 0.453E+01
REJECTED: --- iter= 14 Lambda= 0.100E+00 Regul= 0.723E-01 Chisq= 0.534E+01
iter= 15 Lambda= 0.100E-01 Regul= 0.892E-01 Chisq= 0.426E+01
REJECTED: --- iter= 16 Lambda= 0.100E+00 Regul= 0.897E-01 Chisq= 0.604E+01
REJECTED: --- iter= 17 Lambda= 0.100E+01 Regul= 0.102E+00 Chisq= 0.454E+01
iter= 18 Lambda= 0.100E+00 Regul= 0.906E-01 Chisq= 0.423E+01
REJECTED: --- iter= 19 Lambda= 0.100E+01 Regul= 0.105E+00 Chisq= 0.452E+01
REJECTED: --- iter= 20 Lambda= 0.100E+01 Regul= 0.921E-01 Chisq= 0.425E+01
REJECTED: --- iter= 21 Lambda= 0.100E+01 Regul= 0.921E-01 Chisq= 0.425E+01
REJECTED: --- iter= 22 Lambda= 0.100E+01 Regul= 0.921E-01 Chisq= 0.425E+01
Chisq= 4.23459530 . Best so far= 4.23459530
Point 1 of 1 done by process 0
Inversion Cycle: 2
WARNING!! Outputfile already exists:inversion.mod
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.mod.err
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.pro
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
Inversion try: 1
iter= 0 Lambda= 0.100E-02 Regul= 0.464E-01 Chisq= 0.424E+01
At line 258 of file ../numerical_recipes/svdcmp.f90
Fortran runtime error: Index '0' of dimension 1 of array 'w' below lower bound of 1
and one of crash results in inv2
python run_nicole.py
Checking syntax in file:LINES
Preparing cycle 1
Checking syntax in file:NICOLE.input
... no errors found
Preparing file with input model...100%
Preparing cycle 2
Checking syntax in file:NICOLE.input_2
... no errors found
Starting code execution
*************** N I C O L E v 15.06 ******************
Lorien version: LORIEN Version 4.2
Forward version: NICOLE Forward v3.6
Compex version: NICOLE Compex v3.5
********************************************************
This is the serial build
WARNING!! Outputfile already exists:inversion.model_2
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.model_2.err
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
WARNING!! Outputfile already exists:inversion.pro
This could result in a mixture of old and new results in that file.
Hopefully you know what you are doing. Proceeding anyway
Inversion try: 1
iter= 0 Lambda= 0.100E-02 Regul= 0.608E-02 Chisq= 0.270E+02
Clipping microturbulence
iter= 1 Lambda= 0.100E-03 Regul= 0.289E+00 Chisq= 0.591E+01
Clipping microturbulence
REJECTED: --- iter= 2 Lambda= 0.100E-02 Regul= 0.262E+00 Chisq= 0.244E+03
Clipping microturbulence
REJECTED: --- iter= 3 Lambda= 0.100E-01 Regul= 0.309E+00 Chisq= 0.128E+03
Clipping microturbulence
iter= 4 Lambda= 0.100E-02 Regul= 0.291E+00 Chisq= 0.556E+01
Clipping microturbulence
iter= 5 Lambda= 0.100E-03 Regul= 0.302E+00 Chisq= 0.352E+01
Clipping microturbulence
REJECTED: --- iter= 6 Lambda= 0.100E-02 Regul= 0.316E+00 Chisq= 0.436E+01
At line 258 of file ../numerical_recipes/svdcmp.f90
Fortran runtime error: Index '0' of dimension 1 of array 'w' below lower bound of 1
I've tried to built nicole with flag -fimplicit-none and got a lot of error messages like this. Error: Symbol 'x' at (1) has no IMPLICIT type
I've tried some print debug and found that when I add print like this
--- a/lorien/lorien.f90
+++ b/lorien/lorien.f90
@@ -488,6 +488,7 @@ Subroutine Compute_trial_model(Params, Nodes, Guess_model, Lambda, &
!
Do i_param=1,Params%n_free_parameters
Alpha(i_param, i_param)=Alpha(i_param, i_param)*(1.+Lambda)
+ print *,'DEBUG :: compute trial model Alpha(i_param,i_param):',Alpha(i_param, i_param),'i_param:',i_param,'Lambda:',Lambda
End do
Call SVD_solve(Params%n_free_parameters, Params%SVD_threshold, &
Alpha, Beta, DeltaX, Zeroed)
I get NaN in last alpha element rigth before the crash in inv1 test:
iter= 0 Lambda= 0.100E-02 Regul= 0.464E-01 Chisq= 0.424E+01
DEBUG :: compute trial model Alpha(i_param,i_param): 3.94508386 i_param: 1 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 446.370789 i_param: 2 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 7104.19580 i_param: 3 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 889.868469 i_param: 4 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 3.84648681 i_param: 5 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 129.128143 i_param: 6 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 23.9193592 i_param: 7 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 1.24847507 i_param: 8 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 3.40955758 i_param: 9 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 140.744843 i_param: 10 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 636.315491 i_param: 11 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 12.6341963 i_param: 12 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 7.49877357 i_param: 13 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): 0.610701323 i_param: 14 Lambda: 1.00000005E-03
DEBUG :: compute trial model Alpha(i_param,i_param): NaN i_param: 15 Lambda: 1.00000005E-03
At line 258 of file ../numerical_recipes/svdcmp.f90
Fortran runtime error: Index '0' of dimension 1 of array 'w' below lower bound of 1
Hope it helps to find the problem
NaN comes from lorien/lorien.f90 325:
OldX=X(i_param)
If (X(i_param)+Pertur .lt. X_max(i_param)) then
X(i_param)=X(i_param)+Pertur ! X is dimensionless and ~1
Else
X(i_param)=X(i_param)-Pertur
End if
due to lack of precision (X(i_param)-OldX) becomes zero and on next lines devision by zero happens that produces NaN.
I catch some values after line 325 that lead to zero OldX: -7.67014093E+09 X(i_param) -7.67014093E+09 Pertur 9.99999978E-03 OldX: -6.44019001E+25 X(i_param) -6.44019001E+25 Pertur 9.99999978E-03 OldX: 631611904. X(i_param) 631611904. Pertur 9.99999978E-03
Those are bad values. Memory corruption or something like that. X should be of the order of unity plus minus a couple of orders of magnitude at the most. I need to debug it
I had forgotten about this. A bunch of bugs have been fixed in previous versions. Is this still happening with the current version (16.08)?
Current version 17.02 looks good tried to run test dozen of times and had no issues.
Good! Thanks for letting me know. Closing this issue
Test inv1 failed if nicole build with any optimization flag (I've tried -O1, O2, O3)
Build with no optimization flags pass test without segfault. Here is test log with nicole built with -g -O1 flags:
gfortran --version GNU Fortran (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388] Copyright (C) 2013 Free Software Foundation, Inc.