Closed lxmota closed 8 months ago
I believe implicit works. I converted the ACE_MiniErosion_Denudation_Tempus_NoErosion_Parallel test to run implicit with BE, and the results are effectively the same. Note also that we have some stand-along Thermal ACE tests that are run both explicitly and implicitly, e.g.,
ACE_StandAloneThermal2D_2Blocks
ACE_StandAloneThermal2D_2Blocks_Explicit
I verified these awhile ago when I wrote the tests.
It's worth noting that when I switch the erosion tests to run with BE, there are some noticeable differences in the temperature solution. I think this is because these tests are very very coarse. I am thinking to not worry about these since for ACI/NH we will not have erosion. Thoughts? I could also try a larger problem from Jenn with BE to see what it does and whether the results are fundamentally different, as a sanity check.
I did more testing and for certain problems, the Jacobian has nans when the thermal problem is run implicitly. The issue is overflow due to some of the logic to define the thermal material parameters. Specifically, the culprit is qebt used here to define icurr and dfdT:
I put the following print statement after these lines of code:
std::cout << "IKIKT A, G, C, qebt, v, pow = " << A << ", " << G << ", " << C << ", " << qebt << ", " << v << ", " << pow(C + qebt, 1.0 / v) << "\n";
and here is what it prints:
IKIKT A, G, C, qebt, v, pow = 0, 1, 1, 4.98988e+30 [ -4.70914e+29 -1.75747e+30 -6.55899e+30 -1.75747e+30 -1.75747e+30 -6.55899e+30 -2.44785e+31 -6.55899e+30 ], 0.1, 9.56968e+306 [ -9.03128e+306 -3.37052e+307 -1.2579e+308 -3.37052e+307 -3.37052e+307 -1.2579e+308 -inf -1.2579e+308 ]
qebt basically overflows, and so taking a power of it is even worse.
I think the issue can be avoided by changing this line: https://github.com/sandialabs/LCM/blob/main/src/LCM/evaluators/ACEThermalParameters_Def.hpp#L294 . Why is tol_qebt = O(10^30)? This is very large and will certainly overflow. I think if it is made slightly smaller, the nans can be avoided. Indeed, if I change tol_qebt = 6.6e+28, the nans are gone.
@lxmota : @jennifer-frederick said that you were the one that put the tol_qebt value to what it is. What was the rationale for choosing the value that it is set to? Can we make it smaller? It seems like it must be quite a few orders of magnitude smaller to avoid nans when being exponentiated.
We discussed this today at the ACE meeting. @lxmota said the tol_qebt was carefully designed to prevent overflow of the values calculated with the code; however, it was not designed for the derivatives to not overflow. @jennifer-frederick : do you have a document describing the expressions in ACE_ThermalParameters_Def.hpp? The best thing to do would be to work out what tol_qebt should be to avoid NaNs in the derivs. In the meantime, we could just reduce tol_qebt to avoid the numerically issues.
The expressions for f and dfdT from @jennifer-frederick are:
To figure out the value of tol_qebt, it was determined what should be the value of qebt = Qexp(BT) such that, when you take the appropriate power of it, you are below the max double represented in C++,
1.79769313486232e+308
This works out to be 6.6e30. To fix the overflow in the Jacobian, I think we need to take d^2f/dT^2, which will have some expression also involving qebt or its power, and figure out how to prevent overflow there (prevent the values to be above the O(1e308) value above).
This has been resolved, including the issue causing NaNs in the Jacobian.
Likely required for multi-year runs planned for NH/ACI.