Closed amoebaliz closed 1 year ago
I've run your experiment (provided separately) with Intel and GNU, debug and repro, but all appear to be running without any problems.
Could this possibly have already been fixed? What is the hash of your MOM6? I am using 7883f633 (from 6 August).
I was finally able to replicate something here, though not exactly what you are seeing. These are from Intel 18 "debug":
target time yyyy/mm/dd hh:mm:ss= 1980/ 1/ 1 3:57:30
t1, t2, w1, w2= 1 2 0.989035087719298
1.096491228070175E-002
ibuf= 1 2
i1,i2= 1 2
forrtl: error (182): floating invalid - possible uninitialized real/complex variable.
Image PC Routine Line Source
MOM6 0000000005F4B71E Unknown Unknown Unknown
MOM6 0000000005901AC0 Unknown Unknown Unknown
MOM6 000000000588F688 icepack_mechred_m 1239 icepack_mechred.F90
MOM6 00000000058804E8 icepack_mechred_m 386 icepack_mechred.F90
MOM6 0000000002C7151D ice_ridging_mod_m 380 ice_ridge.F90
MOM6 0000000003442050 sis_transport_mp_ 265 SIS_transport.F90
MOM6 0000000000C6A31E sis_dyn_trans_mp_ 615 SIS_dyn_trans.F90
MOM6 0000000002165A33 ice_model_mod_mp_ 327 ice_model.F90
MOM6 0000000002161E56 ice_model_mod_mp_ 156 ice_model.F90
MOM6 0000000000F3E4CC MAIN__ 1034 coupler_main.F90
MOM6 00000000004015EE Unknown Unknown Unknown
MOM6 000000000601E34F Unknown Unknown Unknown
MOM6 00000000004014DA Unknown Unknown Unknown
I am using the GFDL branches of the packages, plus whatever flags are in the mkmf templates, so it's probably slightly different from what you are running.
It took quite a few timesteps for a problem to emerge. (looks like about ~48 timesteps here).
The error here is coming from Icepack, so perhaps this is where the problem is arising. Seemed to happen at the 4-hour mark, although I don't see anything signficant about that timestep. (Coupling steps are 1 hr).
Will keep looking.
Edit: Not exactly sure why, but I now get your error in regrid_edge_values.F90. Could have been an unrelated initialization issue in Icepack which I can no longer replicate.
Somehow this line has resulted in a div-by-zero: https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L291
The zero is from h1 + (h2 + h3)
, even though there is a test here to ensure that h2 + h3
is nonzero:
https://github.com/NOAA-GFDL/MOM6/blob/e88cdaa785b9fab4e24286031bdd409caf673034/src/ALE/regrid_edge_values.F90#L266-L276
I don't know how it happened, but somehow h1
is not only negative, but exactly equal to the negative of h2+h3
:
136: h0 0.000000000000000E+000
136: h1 -5215.31998046875
136: h2 2.20500000000000
136: h3 5213.11498046875
136: h2+h3 5215.31998046875
136: h1+(h2+h3) 0.000000000000000E+000
While I could just add an additional check for h1+(h2+h3) == 0
, I am guessing it might be more important to understand how this happened.
I see that the line that is crashing is using the 2018_answers version of the code. There is a newer option within the code (generally set by selecting DEFAULT_2018_ANSWERS = FALSE
or probably more specifically REMAPPING_2018_ANSWERS = False
in your MOM_input files) in which many of the regridding solvers are mathematically equivalent, but have been refactored to make them much more robust and less prone to problems arising from roundoff.
Basically, instead of naively doing Gaussian elimination for matrix solvers or just the standard Thomas algorithm for solving tridiagonal matrices, the newer versions expand the expressions with mathematical expressions to replace differences of numbers that might be very close together other positive definite expressions, effectively replacing 1 / (1*h - (1-gamma)*h)
with 1 / (gamma*h)
, where h
is a positive-definite layer thickness and we can prove that 0 < gamma <= 1
. Because these algorithmic changes do change answers, we have added run-time flags to use these older versions of the solver expressions to reproduce old solutions, but perhaps we need to be more aggressive in discouraging their use.
This raises several questions for me: (1) Was there a deliberate choice to use the older (and less robust) 2018 versions of the expressions in this experiment? (2) If so, why? (3) If not, how is it that this particular line of code is being called? and (4) Is the negative thickness here a symptom of the use of the older and less robust solvers, or is it a symptom of something else that is unrelated to the remapping solvers?
@Hallberg-NOAA I don't think this is happening with DEFAULT_2018_ANSWERS = True
. The division-by-zero is happening on L291, which is outside of the 2018-answer block. I was also able to replicate this error using the flags provided by Liz, which has all of the 2018 flag answers set to false.
The error appears to be coming from the ALE sponge, not the ALE code.
The issue seems to be that mask_z(4,:,:)
has uninitialized values in its halos, which are then propagated to non-halo values of mask_u(4,:,:)
, which eventually lead to the non-monotonic and un-physical reference thicknesses and failures inside the ALE code.
Although mask_z
is initialized here:
the horiz_interp_and_extrap_tracers
function call on the next line contains a deallocation and re-allocation of mask_z
(among other variables). When that happens, we should assume that the values get wiped out and replaced with random numbers, although the actual behavior is going to depend on the compiler.
In this case, GCC is probably just re-using the old value, which is why it still works, whereas Intel is allocating a new part of the heap, with random model-crashing values.
In any case, we have to assume that the array is uninitialized, and we either need to re-initialize these values, or we need avoid re-allocating the arrays inside of the function.
Having said all that, even if I re-initialize this array, I see the icepack error later in the run (at ~hour 12) so we are not yet through this problem.
Following up on this issue:
regrid_edges
I have looked further into this error, and it seems that the SIS2 field trcrn(1,:)
has uninitialized values inside it. These get passed into the icepack library, where the bad initialization is detected.
I do not believe that this is due to icepack, but is rather due to SIS2. I believe something is happening in this code block:
I am not very familiar with SIS2, but I will keep looking and see if anything makes sense here. There is a lot of pointer manipulation going on, so it's possible that something is not being set properly.
I've looked at this more closely. This seems to be a bug in SIS2.
The TrReg
(tracer registry) always starts with the Tr_ice_enth_ptr
, but for reasons that I cannot understand, it only fills the first category:
324 ntrcr = 0
...
327 ntrcr=ntrcr+1
328 trcrn(ntrcr,1) = Tr_ice_enth_ptr(i,j,1,1) ! surface temperature? this is taken from th e thinnest ice category
...
331 do k=1,nL_ice ; ... ; enddo
337 do k=1,nL_snow ; ... ; enddo
343 do k=1,nL_ice ; ... ; enddo
If I just add this:
trcrn(ntrcr,2:nCat) = 0.
then the model appears to run (so far at least; up to day 2) although I am not sure if this is the preferred solution. We should talk more on Monday.
This commit appears to have fixed the issue: https://github.com/NOAA-GFDL/SIS2/commit/fa3f6b19920684a4bd172a05d1824aca18a9ff56
(This was in PR https://github.com/NOAA-GFDL/SIS2/pull/179)
I feel comfortable that all of the issues here have been resolved, so I will close the issue.
I am running a regional MOM6 configuration that includes sponges (u, v, tracer) and ice. Currently, the simulation fails if the code is compiled with intel but runs (very slowly!) if compiled with gnu.
Backtrace for intel compile, REPRO mode:
Backtrace for intel compile, DEBUG mode:
or