libantioch / antioch

C++ Chemical Kinetics, Thermodynaimics, and Transport Library
https://libantioch.github.io/
Other
22 stars 17 forks source link

Test failing: kinetics_vec_unit_air_5sp #235

Closed dsondak closed 7 years ago

dsondak commented 7 years ago

The kinetics_vec_unit_air_5sp test is failing when I do a make check. I have included a list of my currently loaded modules as well as the .log file error message below. Please let me know if any further information is needed.

Here is a list of my currently loaded modules:

Currently Loaded Modules: 1) c7 3) boost/1.58.0 5) grvy/0.32.0 7) gsl/2.0 9) vtk/6.3.0 11) cmake/3.3.2 13) petsc/3.6.2-cxx-opt 15) texlive/2015-11-05 2) gcc/5.2 4) hdf5/1.8.15-patch1 6) python/2.7.11 8) openmpi/1.10.0 10) eigen/3.2.6 12) mkl/15.3 14) glpk/4.56

The log file contains the following message:

Assertion `!has_nan(exppower)' failed. ../src/kinetics/include/antioch/reaction.h, line 792, compiled Dec 20 2016 at 10:29:26 terminate called after throwing an instance of 'Antioch::LogicError' what(): Error in Antioch internal logic ./kinetics_vec_unit_air_5sp.sh: line 7: 1482 Aborted (core dumped) $PROG $INPUT

pbauman commented 7 years ago

This is with current master?

dsondak commented 7 years ago

Yes. Roy's helping me track down the problem.

roystgnr commented 7 years ago

"helping" appears to be an overestimate. I've at least determined that it's one of the NASA polynomials which is returning NaNs for h_RT_minus_s_R() on an Eigen input.

The problem disappears with -O0.

The problem reappears with -O1.

The problem disappears again if I use all the options which should be enabled by -O1! (as determined via the diff trick https://gcc.gnu.org/onlinedocs/gcc/Overall-Options.html#index-target-help-90 suggested).

I stole David's desk for better than an hour with no luck; I'm trying to see if I can reproduce the problem on my own account now.

pbauman commented 7 years ago

IIRC, didn't we have some sensitivity to Eigen versions in some of the tests (which I'd assumed were Eigen bugs)? What Eigen version is he using? Travis uses Eigen in the build so I wonder if he's using a different Eigen version than what we are on Travis?

pbauman commented 7 years ago

facepalm He posted the module list. Let me double check what version of Eigen Travis is using.

pbauman commented 7 years ago

Here we are: https://github.com/libantioch/antioch/pull/192

Looks like it's 3.2.0-8. So maybe a bug in Eigen?

roystgnr commented 7 years ago

There's definitely regressions in Eigen... I'm having to disable the Eigen::Matrix-of-Eigen::Array tests in kinetics_regression_vec.C

roystgnr commented 7 years ago

But with the one regressed portion-of-a-test commented out, I can't reproduce the failure (in a completely different test) that @dsondak is seeing.

roystgnr commented 7 years ago

Using the module list from @dsondak, with plain "../configure", I still can't compile without the changes in https://github.com/libantioch/antioch/pull/236...

But after those changes are in, I can now replicate the failure in this issue. That's a plus.

pbauman commented 7 years ago

Using the module list from @dsondak, with plain "../configure", I still can't compile without the changes in #236...

You couldn't even compile?

roystgnr commented 7 years ago

Couldn't even compile.

pbauman commented 7 years ago

WTF? So how did he get as far as he did?

roystgnr commented 7 years ago

Hell if I know. But I think I'm just about done investigating the gcc/5.2 test failure. The bug goes away with -O0, even if I add what should be every single optimization flag in -O1, the bug goes away with gcc 6.1, the bug goes away with gcc 4.9...

I get a few different "make check" failures with intel 17, and one of them is in the same test, so we might be in the situation we were in last year where some dubious code gets interpreted as expected most of the time but goes all nasal demons on us. Let me try intel 17 with -O0 and if I'm still getting any non-Heisenbug failures I'll work on those.

roystgnr commented 7 years ago

I'm done with intel too. Stepping through gdb, with "-g -O0",

(gdb) 
test_values<float> (Cf=@0x7fffffff9a70: 1.39999998, eta=@0x7fffffff99d4: 0, Ea=@0x3ff0000000000000: <error reading variable>, 
    D=@0x150: <error reading variable>, Tref=@0x7fffffff9a80: 1, R=@0x7fffffff9a84: 1, rate_base=...) at ../../test/kinetics_settings_unit.C:56 
56          const Scalar rate_exact = Cf*pow(T/Tref,eta)*exp(-Ea/(R*T) + D * T); 
(gdb) 
std::pow (__x=<optimized out>, __y=<optimized out>) at /usr/include/c++/4.8.5/cmath:408 
408       { return __builtin_powf(__x, __y); } 
(gdb) 
test_values<float> (Cf=@0x7fffffff9a70: 1.39999998, eta=@0x7fffffff99d4: 0, Ea=@0x0: <error reading variable>, D=@0x0: <error reading variable>, 
    Tref=@0x1ff96: <error reading variable>, R=@0x42e8000000000000: <error reading variable>, rate_base=...) at ../../test/kinetics_settings_unit.C:59 

If a compiler can't handle pow(float, float) without smashing the stack then clearly the solution is "never use that compiler". Hopefully it's a bug in the gcc 4.8.5 headers that's just getting triggered by icpc somehow? If I thought it was really an icpc 17 bug then I'd probably still boil it down to a test case to send Intel, but I get failures in the same tests from icpc 16 and 15. So it's probably a system header bug that I shouldn't trouble them with... otherwise they've had a broken compiler for three generations, they're not spending nearly enough on QA, and I'd just feel like a scab trying to help.

roystgnr commented 7 years ago

I do kind of want to file a "-O0 still optimizes out local variables" bug report, because WTF, but that seems like too big a compiler issue to be an unintentional bug rather than a deliberate design mistake.