VP solver robustness issues ("bad departure points") (was: Verify b4b-ness of different MPI decompositions for the VP solver / performance evaluation of `repro-vp` branch)

phil-blain commented 2 years ago

Tried a new suite with dynpicard, no OpenMP

# Test         Grid    PEs        Sets    BFB-compare
smoke          gx3     1x1        diag1,dynpicard
sleep 180
smoke          gx3     8x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard
smoke          gx3     2x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard
smoke          gx3     4x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard

none of the three MPI cases are bfb with the 1x1 case.

phil-blain commented 2 years ago

Tried setting b4bflag=reprosum in the 2x1 case. Still not b4b with the 1x1 case.

phil-blain commented 2 years ago

Tried with maxits_nonlin=500 Still not b4b with the 1x1 case.

phil-blain commented 2 years ago

OK, looking at the differences between 1x1 and {2x1, 4x1, 8x1} there are some differences, bigger in 4x1 and 8x1.

try with more nonlinear iterations
try disabling advection and thermo

phil-blain commented 2 years ago

With 5000 nonlinear iterations, differences are in the same range for 2, 4, 8 procs (vs 1). i.e. [-1E-6, 1E6].

phil-blain commented 2 years ago

note: this tests runs 1 day, and over these 24 time steps, only 4 need over 500 iterations to reach reltol_nonlin=1E-8 (default value in the namelist). With 5000 as maxits_nonlin, we still do not get more than max 1398 iterations.

phil-blain commented 2 years ago

With reltol_nonlin=1E-12 and maxits_nonlin=5000, the difference vs 1 proc is more or less the same for 2, 4, 8 procs, and looks something like this:

interestingly :

there are more differences in the South hemisphere (likely due to the fact that the concentration is very close to 1 in all the North hemisphere
the differences are not only near the coast, and do not appear to be near only the ice edge either

Note that we do not nearly reach 5000 iterations with this value of reltol_nonlin.

phil-blain commented 2 years ago

To get the number of iterations to reach the required tolerance:

\grep Arctic -B1 path_to_case_directory/logs/cice.runlog.220606-19* |\grep monitor

this is with diagfreq = 1 and monitor_nonlin=.true.

phil-blain commented 2 years ago

If we disable thermo, ridging and transport, then with the same settings as above (reltol_nonlin=1E-12), we get differences in the range [-1E-16,1E-17] for uvel,vvel which is much more in line with what is expected. aice and hi are b4b.

If we only disable thermo (and same settings again), then we get slightly higher differences in the veloc components:

-1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
 9 : 2005-01-02 00:00:00       0    11600    3594 : -3.7470e-16  8.6176e-20  3.4478e-16 : aice     
11 : 2005-01-02 00:00:00       0    11600    3594 : -4.4409e-16  7.4066e-20  4.4409e-16 : hi   
13 : 2005-01-02 00:00:00       0    11600    3594 : -5.2082e-17  8.6478e-17  6.9237e-13 : uvel          
14 : 2005-01-02 00:00:00       0    11600    3594 : -5.5376e-13 -6.9204e-17  1.0734e-17 : vvel

and now aice and hi start to not be b4b, but are still small

phil-blain commented 2 years ago

try the diag (and maybe ident) precond
do a long (5 years) gx1 run with different MPI decomp and compare the thickness field at the end (mid-Jan).

phil-blain commented 2 years ago

With precond=diag, no thermo, no transport:

    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name    
    13 : 2005-01-02 00:00:00       0    11600    3594 : -5.6379e-18  6.6466e-20  3.9514e-16 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -2.1605e-16 -5.4175e-21  2.0172e-16 : vvel

phil-blain commented 2 years ago

With `precond='diag', with thermo, with transport, we get a model abort:

istep1:         2    idate:  20050101    sec:      7200
 (JRA55_data) reading forcing file 1st ts = /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Rank 2 [Mon Jun 13 21:08:20 2022] [c0-0c0s9n1] application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
    (icepack_warnings_setabort) T :file icepack_itd.F90 :line          900
 (cleanup_itd) aggregate ice area out of bounds
  (cleanup_itd)aice:   1.00245455360081
  (cleanup_itd)n, aicen:           1  0.676531837003209
  (cleanup_itd)n, aicen:           2  0.224493247425031
  (cleanup_itd)n, aicen:           3  4.769818624818129E-002
  (cleanup_itd)n, aicen:           4  3.766552529401467E-002
  (cleanup_itd)n, aicen:           5  1.606575763037559E-002
 (icepack_warnings_aborted) ... (icepack_step_therm2)

weird as I did his test two years ago (https://github.com/phil-blain/CICE/issues/33#issuecomment-654247421), although with reltol_nonlin=1E-8...)

@JFLemieux73 je garde une trace de mes expériences MPI dans cette issue si tu veux rester au courant.

phil-blain commented 2 years ago

OK, it's because I forgot to also re-enable ridging, the model did not like that (is that expected?...)

EDIT after discussing with JF, yes it is expected, convergence can cause that.

phil-blain commented 2 years ago

OK, with ridging, advection, and transport, reltol=1E-12, precond diag;


    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     9 : 2005-01-02 00:00:00       0    11600    3594 : -3.9378e-06 -4.8300e-10  4.4782e-06 : aice          
    11 : 2005-01-02 00:00:00       0    11600    3594 : -1.9175e-06 -7.3383e-11  4.1247e-06 : hi            
    13 : 2005-01-02 00:00:00       0    11600    3594 : -2.8908e-07  7.0716e-09  3.3297e-05 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -5.4352e-07  6.0686e-09  2.5828e-05 : vvel

phil-blain commented 2 years ago

idem with pgmres:

phb001@xc4elogin1(daley): [17:06:02] $ cdo infov diff.nc 2>/dev/null | \grep -E 'name|aice|hi|vel'
    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     9 : 2005-01-02 00:00:00       0    11600    3594 : -3.8860e-06  3.3100e-09  2.7253e-05 : aice          
    11 : 2005-01-02 00:00:00       0    11600    3594 : -9.5537e-06 -1.0926e-09  2.0109e-06 : hi            
    13 : 2005-01-02 00:00:00       0    11600    3594 : -4.9737e-07  5.9547e-10  2.8301e-06 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -2.5361e-06 -5.8569e-10  1.2014e-06 : vvel

I don't think it's only the preconditoner since these results are similar as the 'diag' precond.

phil-blain commented 2 years ago

So I did side by side, step by step debugging of 1x1 vs 2x1. The values are the same on both side until the first normalization in the FGMRES algorithm. Since we do a global sum of different numbers, in a different order (that mathematically sum to the same result on all decompositions), then because of floating point arithmetic we get a different norm (of the residual), and then we propagate that to the whole of the vectors by normalizing.

So in the end it is not surprising that we get different results. We will run a QC test of different decompositions against each other to ensure we get the same climate.

phil-blain commented 2 years ago

Résultats mitigés:

80x1 vs 40x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Passed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:
INFO:__main__:Quality Control Test PASSED

40x1 vs 24x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED

80x1 vs 24x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED

phil-blain commented 2 years ago

OK that was a false alarm, the 24x1 run hit walltime and was killed, but the history folder contained outputs from an older run done with EVP, so that the quality control script was using results from two different runs (and that, fortunately, failed!).

I re-ran the 24x1 with a longer walltime and both comparisons (with 40x1 and 80x1) now pass)

phil-blain commented 2 years ago

I put some more thought into the problem of reproducibility for the global sums, after a comment by @dupontf regarding performing the global sum using quadruple precision.

It turns out we already have that capability in CICE, and also even better algorithms: https://cice-consortium-cice.readthedocs.io/en/master/developer_guide/dg_other.html?highlight=reprosum#reproducible-sums

I looked more closely at the code and realized I could leverage this capability with only slight modifications. With these modifications done, running 1x1 and 2x1 side by side, I can verify that the global sums done in the dynamics solver are the same on both side, at least for this configuration:

maxits_nonlin = 4
maxits_fgmres = 1
precond = diag.
history_precision = 8
npt = 1

With these settings the restarts are bit4bit!

note that I had to also add dump_last = .true. in the namelist for the code to create a restart at the end of the run, if not it defaults to dump_freq = 1d and the scripts would use these older restart from a previous run before I changed npt to do a single time step

phil-blain commented 2 years ago

Also passes (b4b) after 1 day (24 time steps)

phil-blain commented 2 years ago

And as expected, with precond=pgmres it still fails as we skip some halo updates.

phil-blain commented 2 years ago

And it passes with precond=pgmres if we add back those halo updates.

phil-blain commented 2 years ago

So in preparation of a PR with these changes, (https://github.com/CICE-Consortium/CICE/compare/main...phil-blain:CICE:repro-vp), I'm noticing the new code is noticeably slower than the old.

EDIT: original version is https://github.com/phil-blain/CICE/commits/repro-vp@%7B2022-07-14%7D

This is a little bit surprising ...

Old code took 3.5 hours to do the 5 years of the QC simulation
New code took 4 hours to do 3.5 years.

Note that this is without bfbflag.... so the differences are:

twice the number of global sums (X + Y)
more global comms because of the move the of the global sum inside the first CGS loop
global_sum_prod loops through the whole arrays. not just ice points as calc_L2norm_squared was doing (loop on icellu)

phil-blain commented 2 years ago

OK so I played with Intel Trace Analyzer and collector (ITAC) and Intel Vtune, for both versions (old and new) of the code, following this tutorial: https://www.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-lin/top.html

First, running Application Performance Snapshot reveals both versions are MPI bound, and have very poor vectorization (note that both runs are 40x1):

old

new

This reveals however that it's not only the added communications that slow the new code, since "MPI time" is 31%, vs. 43% for the old code.

I then ran the VTune "HPC Performance Characterization Analysis" for both versions and used the "Compare" feature. This a ranking of the hotspots for time difference between the new and old versions (right column, CPU Time: Difference), with the corresponding timings for those functions in the new code (CPU Time: mod-vtune-g):

I confirmed by running under GDB (mpirun -gdb) that MPIDI_SHMGR_release_generic is called (amonsgt other MPI subroutines) by MPI_ALLREDUCE. So in the VP solver, it is only called by global_sum_prod to actually perform the MPI reduction. Notice that the time difference for that function is almost half of the number for the new code, which makes sense since the new code has approx. twice the number of calls to MPI_ALLREDUCE of the old code (since the new code does one global sum for X and another for Y components). I write "approx." because there is also more calls because of the modified CGS loop. My analysis of these timings is that this modification to the CGS algorithm does not play a big part in the additional time (since the new code spends almost twice as many time in this function, but not a lot more than twice.)
The new code spends a lot of time actually computing the local reductions (functions global_sum_prod_dbl and compute_sums_dbl).

phil-blain commented 2 years ago

Getting back to the performance regression after finally getting rid of all the bugs (famous last words) in my new code (see https://github.com/phil-blain/CICE/issues/39#issuecomment-1192815691 and following comments).

I re-ran the QC test cases on main (007fbff) and the current tip of my repro-vp branch (579e19f), both 80x1 so using all cores of a single node (i.e. twice the number in my previous test).

Old code took 2:25 to simulate the 5 years
New code took 2:40 to simulate 3.8 years .... and then died with "bad departure points" on 2008-11-14 :'(

The listings shows that at least uvel is unrealistically large:

 Warning: Departure points out of bounds in remap
 my_task, i, j =          43           8          17
 dpx, dpy =  -45563.7247538909        12271.9759813932
 HTN(i,j), HTN(i+1,j) =   33338.1913820475        33168.9296994831
 HTE(i,j), HTE(i,j+1) =   47781.5593319368        47977.5239199294
 (print_state) bad departure points
 (print_state) istep1, my_task, i, j, iblk:       33867          43           8          17          11
 (print_state) Global block:         884
 (print_state) Global i and j:          31         368
 (print_state) Lat, Lon (degrees):   67.5273801992820       -16.4194498698244

 aice   9.070213892498866E-006
 aice0  0.999990929786107
...
 uvel(i,j)   12.6565902094141
 vvel(i,j)  -3.40888221705366

 atm states and fluxes
             uatm    =  -0.848320343386380
             vatm    =    2.65704819499085
             potT    =    271.607269287109
             Tair    =    271.607269287109
             Qa      =   2.464670687913895E-003
             rhoa    =    1.30000000000000
             swvdr   =   0.000000000000000E+000
             swvdf   =   0.000000000000000E+000
             swidr   =   0.000000000000000E+000
             swidf   =   0.000000000000000E+000
             flw     =    258.931945800781
             frain   =   0.000000000000000E+000
             fsnow   =   4.522630479186773E-005

 ocn states and fluxes
             frzmlt  =   -1000.00000000000
             sst     =    1.18273509903225
             sss     =    34.0000000000000
             Tf      =   -1.90458264992426
             uocn    =   0.000000000000000E+000
             vocn    =   0.000000000000000E+000
             strtltxU=   0.000000000000000E+000
             strtltyU=   0.000000000000000E+000

 srf states and fluxes
             Tref    =   2.460446333168104E-003
             Qref    =   2.265709251383878E-008
             Uref    =   1.428371434883044E-005
             fsens   =   6.097383853908757E-005
             flat    =  -1.995661045829828E-005
             evap    =  -7.027261700667970E-012
             flwout  =  -2.690714878249707E-003

 (abort_ice)ABORTED:
 (abort_ice) error = (diagnostic_abort)ERROR: bad departure points
Abort(128) on node 43 (rank 43 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 43

EDIT I re-ran this from the restart of 2008-01-01, and tweaking the namelist so restarts are written every month. It failed at the same date in the same way. I re-ran it from the restart of 2008-11-01, it again failed at the same date in the same way. I set diagfreq to 1 to check at which time steps it aborts, it is at the 3rd time step of the day.

I changed maxits_nonlin to 6, and this allowed the run to continue without aborting...

phil-blain commented 2 years ago

I checked back in my case directory for my earlier long run (ppp6_intel_smoke_gx1_40x1_dynpicard_medium_qc.40_repro, https://github.com/phil-blain/CICE/issues/40#issuecomment-1184774930) and it turns out I re-ran it with a longer walltime after it hit walltime the first time I ran it.

This second time, I also got "bad departure point", at the same exact location (iglob,jglob=31, 368), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.

 Finished writing ./history/iceh_inst.2006-11-13-00000.nc

 Warning: Departure points out of bounds in remap
 my_task, i, j =          21          16           9
 dpx, dpy =  -34861.6880018654       -10107.4117000807
 HTN(i,j), HTN(i+1,j) =   33338.1913820475        33168.9296994831
 HTE(i,j), HTE(i,j+1) =   47781.5593319368        47977.5239199294
 istep1, my_task, iblk =       16347          21           8
 Global block:         302
 Global i and j:          31         368

 (abort_ice)ABORTED:
 (abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

note that the local indices are different since that run was 40x1...

EDIT this is reassuring in a way, because it means:

my buggy bugfix to the new code since then has not influenced this abort
my recent rebase of my repro-vp branch onto the latest main did not either

phil-blain commented 2 years ago

This second time, I also got "bad departure point", at the same exact location (iglob,jglob=31, 368), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.

Thinking about it, this probably means that it's something in the forcing, as the QC test cycles the 2005 forcing:

$ \grep -E 'fyear_init|ycycle' configuration/scripts/options/set_nml.qc
fyear_init     = 2005
ycycle         = 1

phil-blain commented 2 years ago

In DDT, the range of uvel and vvel (min/max as found by the "Statistics" tab of the Multidimensional array viewer) at the start of the nonlinear iterations is very reasonable:

[-0.41, 0.37] for uvel
[-0.32, 0.32] for vvel

But the range of bx and by (RHS) are more different:

[-13.17, 12.22] for bx
[-1.10, 5.71] for by

phil-blain commented 2 years ago

Putting that problem aside for now, I re-ran the HPC Performance Characterization Analysis after refactoring the new code to use a single call to MPI_ALLREDUCE instead of two each time (so stacking X and Y components before doing the global sum).

Unfortunately:

it does not seem to have reduced by much the time spent in MPIDI_SHMGR_release_generic
stacking the fields takes so much time that the total running time is longer in this new version...

I'm not able to upload a screenshot to GitHub right now for some reason, I'll try again tomorrow.

EDIT here is what I wanted to show:

sorted by CPU time for the newest code ("repro-pairs", difference on the right): vtune-280922 sorted by CPU time for the new code ("repro-no-pairs"):

phil-blain commented 2 years ago

OK, I refactored again following comments in https://github.com/CICE-Consortium/CICE/pull/763. New timings are very encouraging, no changes in non bfbflag mode, and a small slowdown with bfbflag.

I've re-ran the QC test with this new version of the code (https://github.com/CICE-Consortium/CICE/compare/main...phil-blain:CICE:repro-vp), in the two modes:

one with bfbflag=off (default), so the computations are the same as before (global_sum of local scalars): vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019/
one with bfbflag=lsum8, so computation are through the new code path (global_sum of an array), but with the default way to compute the local reduction in compute_sums_dbl: vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329.

ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019

"bad departure points" on 2006-04-15:

``` Finished writing ./history/iceh_inst.2006-04-14-00000.nc istep1: 11256 idate: 20060415 sec: 0 Warning: Departure points out of bounds in remap my_task, i, j = 33 2 7 dpx, dpy = -89431.4326348548 47176.1296418077 HTN(i,j), HTN(i+1,j) = 47391.9657405840 47391.9657405840 HTE(i,j), HTE(i,j+1) = 59395.4550164216 59395.4550164216 (print_state) bad departure points (print_state) istep1, my_task, i, j, iblk: 11256 33 2 7 2 (print_state) Global block: 74 (print_state) Global i and j: 265 22 (print_state) Lat, Lon (degrees): -68.0019337826775 -102.437492832671 aice 1.863136163916390E-003 aice0 0.998136863836084 n = 1 aicen 4.050043477947559E-004 vicen 1.074026607958722E-004 vsnon 2.152998460453039E-005 hin 0.265188908170192 hsn 5.315988512656941E-002 Tsfcn -7.05308600110335 n = 2 aicen 3.907395258195736E-004 vicen 3.942037081099123E-004 vsnon 1.085370481838984E-004 hin 1.00886570736112 hsn 0.277773404050288 Tsfcn -7.46888936706571 n = 3 aicen 3.753718609340793E-004 vicen 7.077149155167459E-004 vsnon 1.706338480023648E-004 hin 1.88537018666146 hsn 0.454572827003489 Tsfcn -7.52744021340778 n = 4 aicen 5.148500074552784E-004 vicen 1.765957115592584E-003 vsnon 2.497219361276893E-004 hin 3.43004193458418 hsn 0.485038229603951 Tsfcn -7.53895640784653 n = 5 aicen 1.771704219127033E-004 vicen 9.175253715555438E-004 vsnon 9.635301563084063E-005 hin 5.17877285412592 hsn 0.543843687849411 Tsfcn -7.55704861652791 qice, cat 1 layer 1 -144191293.640763 qi/rhoi -157242.414003013 qice, cat 1 layer 2 -159129292.410324 qi/rhoi -173532.488997082 qice, cat 1 layer 3 -171829331.086202 qi/rhoi -187382.040442969 qice, cat 1 layer 4 -182841842.222701 qi/rhoi -199391.321944058 qice, cat 1 layer 5 -192530961.586381 qi/rhoi -209957.428120372 qice, cat 1 layer 6 -202798490.095634 qi/rhoi -221154.296723701 qice, cat 1 layer 7 -213716787.463419 qi/rhoi -233060.836928483 qice, cat 2 layer 1 -199779132.758855 qi/rhoi -217861.649682503 qice, cat 2 layer 2 -231920434.545279 qi/rhoi -252912.142361264 qice, cat 2 layer 3 -245129831.029287 qi/rhoi -267317.154884718 qice, cat 2 layer 4 -251034694.054212 qi/rhoi -273756.482065662 qice, cat 2 layer 5 -255262158.260578 qi/rhoi -278366.584798886 qice, cat 2 layer 6 -261939135.334309 qi/rhoi -285647.912033052 qice, cat 2 layer 7 -280055795.066004 qi/rhoi -305404.356669579 qice, cat 3 layer 1 -277973318.988761 qi/rhoi -303133.390391233 qice, cat 3 layer 2 -269659547.844703 qi/rhoi -294067.118696514 qice, cat 3 layer 3 -265950400.949903 qi/rhoi -290022.247491715 qice, cat 3 layer 4 -264134014.633383 qi/rhoi -288041.455434442 qice, cat 3 layer 5 -263407467.771512 qi/rhoi -287249.146970024 qice, cat 3 layer 6 -264616132.339291 qi/rhoi -288567.210838922 qice, cat 3 layer 7 -281653330.628534 qi/rhoi -307146.489235042 qice, cat 4 layer 1 -274022546.007721 qi/rhoi -298825.022909184 qice, cat 4 layer 2 -265900352.935201 qi/rhoi -289967.669504036 qice, cat 4 layer 3 -263428092.060393 qi/rhoi -287271.638015696 qice, cat 4 layer 4 -262729556.198319 qi/rhoi -286509.875897840 qice, cat 4 layer 5 -262776163.281876 qi/rhoi -286560.701506953 qice, cat 4 layer 6 -263342147.175282 qi/rhoi -287177.914040656 qice, cat 4 layer 7 -274398781.855302 qi/rhoi -299235.312819304 qice, cat 5 layer 1 -270752145.082871 qi/rhoi -295258.609686883 qice, cat 5 layer 2 -263974477.967447 qi/rhoi -287867.478699506 qice, cat 5 layer 3 -262638225.967348 qi/rhoi -286410.279135604 qice, cat 5 layer 4 -263107334.422465 qi/rhoi -286921.847788947 qice, cat 5 layer 5 -264012714.478509 qi/rhoi -287909.176094339 qice, cat 5 layer 6 -264931147.943505 qi/rhoi -288910.739305895 qice, cat 5 layer 7 -272736325.005804 qi/rhoi -297422.382776231 qice(i,j) -8608303403.09208 qsnow, cat 1 layer 1 -113513562.676045 qs/rhos -343980.492957714 Tsnow -4.73907547849646 qsnow, cat 2 layer 1 -112274135.626779 qs/rhos -340224.653414481 Tsnow -2.95567588531885 qsnow, cat 3 layer 1 -111471885.336910 qs/rhos -337793.591930030 Tsnow -1.80132570276835 qsnow, cat 4 layer 1 -111451023.076751 qs/rhos -337730.372959851 Tsnow -1.77130719840959 qsnow, cat 5 layer 1 -111398409.078330 qs/rhos -337570.936600999 Tsnow -1.69560142497558 qsnow(i,j) -560109015.794814 sice, cat 1 layer 1 18.5497370693517 sice, cat 1 layer 2 15.9557345221535 sice, cat 1 layer 3 13.9033037539197 sice, cat 1 layer 4 12.2738437262536 sice, cat 1 layer 5 10.9612714615272 sice, cat 1 layer 6 9.75189907460159 sice, cat 1 layer 7 8.82710021862978 sice, cat 2 layer 1 9.69380979534134 sice, cat 2 layer 2 5.20949171453501 sice, cat 2 layer 3 3.28161271801342 sice, cat 2 layer 4 2.32638366968108 sice, cat 2 layer 5 1.77068917491420 sice, cat 2 layer 6 1.40462704605691 sice, cat 2 layer 7 1.15582840975086 sice, cat 3 layer 1 3.441128858396138E-002 sice, cat 3 layer 2 5.583334566015622E-002 sice, cat 3 layer 3 6.868278425595450E-002 sice, cat 3 layer 4 8.082159565937108E-002 sice, cat 3 layer 5 9.504227906209066E-002 sice, cat 3 layer 6 0.112032282626814 sice, cat 3 layer 7 0.122182221196500 sice, cat 4 layer 1 3.430162176600550E-002 sice, cat 4 layer 2 5.598699231879993E-002 sice, cat 4 layer 3 7.265396146022764E-002 sice, cat 4 layer 4 9.206751985323419E-002 sice, cat 4 layer 5 0.116962492121375 sice, cat 4 layer 6 0.148280941700695 sice, cat 4 layer 7 0.174687166678514 sice, cat 5 layer 1 4.437160891411421E-002 sice, cat 5 layer 2 6.754356854468051E-002 sice, cat 5 layer 3 9.306091059025316E-002 sice, cat 5 layer 4 0.127612334716539 sice, cat 5 layer 5 0.173411405898090 sice, cat 5 layer 6 0.232649443743668 sice, cat 5 layer 7 0.293799832484727 uvel(i,j) 24.8420646207930 vvel(i,j) -13.1044804560577 atm states and fluxes uatm = 7.83973264694214 vatm = 7.14443778991699 potT = 266.660705566406 Tair = 266.660705566406 Qa = 1.939329667948186E-003 rhoa = 1.30000000000000 swvdr = 0.000000000000000E+000 swvdf = 0.000000000000000E+000 swidr = 0.000000000000000E+000 swidf = 0.000000000000000E+000 flw = 256.630615234375 frain = 0.000000000000000E+000 fsnow = 1.231965143233538E-005 ocn states and fluxes frzmlt = -1000.00000000000 sst = -0.563221058256973 sss = 34.0000000000000 Tf = -1.90458264992426 uocn = 0.000000000000000E+000 vocn = 0.000000000000000E+000 strtltxU= 0.000000000000000E+000 strtltyU= 0.000000000000000E+000 srf states and fluxes Tref = 0.497702610322736 Qref = 3.649059471071920E-006 Uref = 1.948956371724535E-002 fsens = 3.964439903809392E-002 flat = -1.349099926321760E-002 evap = -4.748024256393200E-009 flwout = -0.527381737217707 (abort_ice)ABORTED: (abort_ice) error = (diagnostic_abort)ERROR: bad departure points Abort(128) on node 33 (rank 33 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 33 ```

ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329

"bad departure points" on 2008-11-13 (same date as above):

``` Finished writing ./history/iceh_inst.2008-11-13-00000.nc Warning: Departure points out of bounds in remap my_task, i, j = 43 8 17 dpx, dpy = -38954.6534512499 -25171.7289148755 HTN(i,j), HTN(i+1,j) = 33338.1913820475 33168.9296994831 HTE(i,j), HTE(i,j+1) = 47781.5593319368 47977.5239199294 (print_state) bad departure points (print_state) istep1, my_task, i, j, iblk: 33867 43 8 17 11 (print_state) Global block: 884 (print_state) Global i and j: 31 368 (print_state) Lat, Lon (degrees): 67.5273801992820 -16.4194498698244 aice 8.998129733063038E-006 aice0 0.999991001870267 n = 1 aicen 6.166408153407531E-006 vicen 1.371671230758920E-007 vsnon 2.051894152446559E-008 hin 2.224424975828012E-002 hsn 3.327535416728280E-003 Tsfcn -2.94978949808477 n = 2 aicen 7.099941826167991E-007 vicen 7.571732102563899E-007 vsnon 3.729412402735414E-008 hin 1.06644987916057 hsn 5.252736563263177E-002 Tsfcn -6.87457430330818 n = 3 aicen 1.179783885046477E-006 vicen 2.173329905687473E-006 vsnon 7.941363015055925E-008 hin 1.84214239000379 hsn 6.731201464701378E-002 Tsfcn -7.43936784722567 n = 4 aicen 7.383969670736533E-007 vicen 2.345299021999684E-006 vsnon 7.287445154549425E-008 hin 3.17620348752834 hsn 9.869278287301686E-002 Tsfcn -7.98924194641191 n = 5 aicen 2.035465449185765E-007 vicen 1.065078906998072E-006 vsnon 3.339674217574401E-008 hin 5.23260617085949 hsn 0.164074227784625 Tsfcn -8.05718393169389 qice, cat 1 layer 1 -134114512.690995 qi/rhoi -146253.558005447 qice, cat 1 layer 2 -137245811.209414 qi/rhoi -149668.278309067 qice, cat 1 layer 3 -138074685.651941 qi/rhoi -150572.176283469 qice, cat 1 layer 4 -138615718.832379 qi/rhoi -151162.179751777 qice, cat 1 layer 5 -139647806.220230 qi/rhoi -152287.683991527 qice, cat 1 layer 6 -137134604.812257 qi/rhoi -149547.006338339 qice, cat 1 layer 7 -119866283.234378 qi/rhoi -130715.685097468 qice, cat 2 layer 1 -264082182.538482 qi/rhoi -287984.931884931 qice, cat 2 layer 2 -258605851.995290 qi/rhoi -282012.924749498 qice, cat 2 layer 3 -257159375.800275 qi/rhoi -280435.524318729 qice, cat 2 layer 4 -256213323.905183 qi/rhoi -279403.842862795 qice, cat 2 layer 5 -254917292.887028 qi/rhoi -277990.504784108 qice, cat 2 layer 6 -251719223.950059 qi/rhoi -274502.970501700 qice, cat 2 layer 7 -241844975.175298 qi/rhoi -263734.978380914 qice, cat 3 layer 1 -262910969.087461 qi/rhoi -286707.708928529 qice, cat 3 layer 2 -259993594.924048 qi/rhoi -283526.275816847 qice, cat 3 layer 3 -258721761.256272 qi/rhoi -282139.325252205 qice, cat 3 layer 4 -256998556.017034 qi/rhoi -280260.148328282 qice, cat 3 layer 5 -254228366.082832 qi/rhoi -277239.221464375 qice, cat 3 layer 6 -248873140.956305 qi/rhoi -271399.281304585 qice, cat 3 layer 7 -236903948.318805 qi/rhoi -258346.726629013 qice, cat 4 layer 1 -260723419.435865 qi/rhoi -284322.158599634 qice, cat 4 layer 2 -261188818.447261 qi/rhoi -284829.682058082 qice, cat 4 layer 3 -259703731.768740 qi/rhoi -283210.176410839 qice, cat 4 layer 4 -256438998.059903 qi/rhoi -279649.943358673 qice, cat 4 layer 5 -250779761.834650 qi/rhoi -273478.475283151 qice, cat 4 layer 6 -240812281.018365 qi/rhoi -262608.812451870 qice, cat 4 layer 7 -221381004.458880 qi/rhoi -241418.761678168 qice, cat 5 layer 1 -263180165.881072 qi/rhoi -287001.271407931 qice, cat 5 layer 2 -262057641.851250 qi/rhoi -285777.144875954 qice, cat 5 layer 3 -259642261.670439 qi/rhoi -283143.142497752 qice, cat 5 layer 4 -255142469.069725 qi/rhoi -278236.062235250 qice, cat 5 layer 5 -246468029.930808 qi/rhoi -268776.477569038 qice, cat 5 layer 6 -229342590.286182 qi/rhoi -250100.970868247 qice, cat 5 layer 7 -199301944.726038 qi/rhoi -217341.270148351 qice(i,j) -7974035103.98514 qsnow, cat 1 layer 1 -111850289.231331 qs/rhos -338940.270397974 Tsnow -2.34580740644548 qsnow, cat 2 layer 1 -114039075.191528 qs/rhos -345572.955125843 Tsnow -5.49523035415153 qsnow, cat 3 layer 1 -114430599.292543 qs/rhos -346759.391795586 Tsnow -6.05859059619481 qsnow, cat 4 layer 1 -114779537.984019 qs/rhos -347816.781769754 Tsnow -6.56067510434653 qsnow, cat 5 layer 1 -114272673.891737 qs/rhos -346280.829974960 Tsnow -5.83135326446358 qsnow(i,j) -569372175.591159 sice, cat 1 layer 1 20.2199089222525 sice, cat 1 layer 2 19.7796379588405 sice, cat 1 layer 3 19.5425664182877 sice, cat 1 layer 4 19.3695520400775 sice, cat 1 layer 5 19.1220019983790 sice, cat 1 layer 6 19.2504047186910 sice, cat 1 layer 7 21.0537447311902 sice, cat 2 layer 1 7.30450537072847 sice, cat 2 layer 2 8.34503156955870 sice, cat 2 layer 3 9.01434221320233 sice, cat 2 layer 4 9.42490589431130 sice, cat 2 layer 5 9.68872759926925 sice, cat 2 layer 6 10.0215411680662 sice, cat 2 layer 7 10.6293880430649 sice, cat 3 layer 1 7.89036051946482 sice, cat 3 layer 2 8.77482868295520 sice, cat 3 layer 3 9.32735705319974 sice, cat 3 layer 4 9.68496440713753 sice, cat 3 layer 5 10.0141348422266 sice, cat 3 layer 6 10.5840616583562 sice, cat 3 layer 7 11.4582699427706 sice, cat 4 layer 1 10.0090552217314 sice, cat 4 layer 2 9.97659382521376 sice, cat 4 layer 3 9.98121624821825 sice, cat 4 layer 4 10.1028251455134 sice, cat 4 layer 5 10.4543489692913 sice, cat 4 layer 6 11.3768727784806 sice, cat 4 layer 7 12.9025059949524 sice, cat 5 layer 1 9.95929460551644 sice, cat 5 layer 2 9.93124990759990 sice, cat 5 layer 3 9.95343881720853 sice, cat 5 layer 4 10.1928707849040 sice, cat 5 layer 5 10.8583013186153 sice, cat 5 layer 6 12.3650416588220 sice, cat 5 layer 7 14.8938537499649 uvel(i,j) 10.8207370697916 vvel(i,j) 6.99214692079876 atm states and fluxes uatm = -0.848320343386380 vatm = 2.65704819499085 potT = 271.607269287109 Tair = 271.607269287109 Qa = 2.464670687913895E-003 rhoa = 1.30000000000000 swvdr = 0.000000000000000E+000 swvdf = 0.000000000000000E+000 swidr = 0.000000000000000E+000 swidf = 0.000000000000000E+000 flw = 258.931945800781 frain = 0.000000000000000E+000 fsnow = 4.522630479186773E-005 ocn states and fluxes frzmlt = -1000.00000000000 sst = 1.18235130585605 sss = 34.0000000000000 Tf = -1.90458264992426 uocn = 0.000000000000000E+000 vocn = 0.000000000000000E+000 strtltxU= 0.000000000000000E+000 strtltyU= 0.000000000000000E+000 srf states and fluxes Tref = 2.440707262935295E-003 Qref = 2.245595718767618E-008 Uref = 1.412707772499829E-005 fsens = 6.076961697553104E-005 flat = -1.955840982400989E-005 evap = -6.887109315733744E-012 flwout = -2.668255669750822E-003 (abort_ice)ABORTED: (abort_ice) error = (diagnostic_abort)ERROR: bad departure points Abort(128) on node 43 (rank 43 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 43 ```

For both cases, bump maxits_nonlin to 5 instead of 4 allows the run to continue, and QC then passes against the main simulation (done with maxits_nonlin=4) as well as a new run with maxits_nonlin=5 (ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154627, ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154716/)

phil-blain commented 2 years ago

In both cases, restarting from the time step before the abort, and settings coriolis = 'zero' allows the run to continue.

phil-blain commented 2 years ago

In both cases, the cell where it fails is right on the ice edge.

phil-blain commented 2 years ago

In both cases, bumping dim_pgmres (number of inner iterations of the PGMRES preconditioner) from 5 to 10 allows the run to continue, keeping maxits_nonlin=4.

phil-blain commented 2 years ago

In both cases, dropping the linear tolerance (reltol_fgmres) from 1E-2 to 1E-1 allows the run to continue (!)

phil-blain commented 2 years ago

The change of default parameters was implemented in https://github.com/CICE-Consortium/CICE/pull/774. I'm keeping this open since the underlying robustness issue is not solved.

Discussing with JF, it seems the preconditioner is probably not doing a good enough job, which leads to the FGMRES solver having trouble converging...