Open phil-blain opened 2 years ago
Tried setting b4bflag=reprosum
in the 2x1 case. Still not b4b with the 1x1 case.
Tried with maxits_nonlin=500
Still not b4b with the 1x1 case.
OK, looking at the differences between 1x1 and {2x1, 4x1, 8x1} there are some differences, bigger in 4x1 and 8x1.
With 5000 nonlinear iterations, differences are in the same range for 2, 4, 8 procs (vs 1). i.e. [-1E-6, 1E6].
note: this tests runs 1 day, and over these 24 time steps, only 4 need over 500 iterations to reach reltol_nonlin=1E-8
(default value in the namelist). With 5000 as maxits_nonlin, we still do not get more than max 1398 iterations.
With reltol_nonlin=1E-12
and maxits_nonlin=5000
, the difference vs 1 proc is more or less the same for 2, 4, 8 procs, and looks something like this:
interestingly :
Note that we do not nearly reach 5000 iterations with this value of reltol_nonlin
.
To get the number of iterations to reach the required tolerance:
\grep Arctic -B1 path_to_case_directory/logs/cice.runlog.220606-19* |\grep monitor
this is with diagfreq = 1
and monitor_nonlin=.true.
reltol_nonlin=1E-12
), we get differences in the range [-1E-16,1E-17] for uvel,vvel which is much more in line with what is expected. aice
and hi
are b4b.-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter name
9 : 2005-01-02 00:00:00 0 11600 3594 : -3.7470e-16 8.6176e-20 3.4478e-16 : aice
11 : 2005-01-02 00:00:00 0 11600 3594 : -4.4409e-16 7.4066e-20 4.4409e-16 : hi
13 : 2005-01-02 00:00:00 0 11600 3594 : -5.2082e-17 8.6478e-17 6.9237e-13 : uvel
14 : 2005-01-02 00:00:00 0 11600 3594 : -5.5376e-13 -6.9204e-17 1.0734e-17 : vvel
and now aice and hi start to not be b4b, but are still small
With precond=diag
, no thermo, no transport:
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter name
13 : 2005-01-02 00:00:00 0 11600 3594 : -5.6379e-18 6.6466e-20 3.9514e-16 : uvel
14 : 2005-01-02 00:00:00 0 11600 3594 : -2.1605e-16 -5.4175e-21 2.0172e-16 : vvel
With `precond='diag', with thermo, with transport, we get a model abort:
istep1: 2 idate: 20050101 sec: 7200
(JRA55_data) reading forcing file 1st ts = /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Rank 2 [Mon Jun 13 21:08:20 2022] [c0-0c0s9n1] application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
(icepack_warnings_setabort) T :file icepack_itd.F90 :line 900
(cleanup_itd) aggregate ice area out of bounds
(cleanup_itd)aice: 1.00245455360081
(cleanup_itd)n, aicen: 1 0.676531837003209
(cleanup_itd)n, aicen: 2 0.224493247425031
(cleanup_itd)n, aicen: 3 4.769818624818129E-002
(cleanup_itd)n, aicen: 4 3.766552529401467E-002
(cleanup_itd)n, aicen: 5 1.606575763037559E-002
(icepack_warnings_aborted) ... (icepack_step_therm2)
weird as I did his test two years ago (https://github.com/phil-blain/CICE/issues/33#issuecomment-654247421), although with reltol_nonlin=1E-8
...)
@JFLemieux73 je garde une trace de mes expériences MPI dans cette issue si tu veux rester au courant.
OK, it's because I forgot to also re-enable ridging, the model did not like that (is that expected?...)
EDIT after discussing with JF, yes it is expected, convergence can cause that.
OK, with ridging, advection, and transport, reltol=1E-12, precond diag;
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter name
9 : 2005-01-02 00:00:00 0 11600 3594 : -3.9378e-06 -4.8300e-10 4.4782e-06 : aice
11 : 2005-01-02 00:00:00 0 11600 3594 : -1.9175e-06 -7.3383e-11 4.1247e-06 : hi
13 : 2005-01-02 00:00:00 0 11600 3594 : -2.8908e-07 7.0716e-09 3.3297e-05 : uvel
14 : 2005-01-02 00:00:00 0 11600 3594 : -5.4352e-07 6.0686e-09 2.5828e-05 : vvel
idem with pgmres:
phb001@xc4elogin1(daley): [17:06:02] $ cdo infov diff.nc 2>/dev/null | \grep -E 'name|aice|hi|vel'
-1 : Date Time Level Gridsize Miss : Minimum Mean Maximum : Parameter name
9 : 2005-01-02 00:00:00 0 11600 3594 : -3.8860e-06 3.3100e-09 2.7253e-05 : aice
11 : 2005-01-02 00:00:00 0 11600 3594 : -9.5537e-06 -1.0926e-09 2.0109e-06 : hi
13 : 2005-01-02 00:00:00 0 11600 3594 : -4.9737e-07 5.9547e-10 2.8301e-06 : uvel
14 : 2005-01-02 00:00:00 0 11600 3594 : -2.5361e-06 -5.8569e-10 1.2014e-06 : vvel
I don't think it's only the preconditoner since these results are similar as the 'diag' precond.
So I did side by side, step by step debugging of 1x1 vs 2x1. The values are the same on both side until the first normalization in the FGMRES algorithm. Since we do a global sum of different numbers, in a different order (that mathematically sum to the same result on all decompositions), then because of floating point arithmetic we get a different norm (of the residual), and then we propagate that to the whole of the vectors by normalizing.
So in the end it is not surprising that we get different results. We will run a QC test of different decompositions against each other to ensure we get the same climate.
Résultats mitigés:
80x1 vs 40x1:
INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Passed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:
INFO:__main__:Quality Control Test PASSED
40x1 vs 24x1:
INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED
80x1 vs 24x1:
INFO:__main__:Running QC test on the following directories:
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__: /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED
OK that was a false alarm, the 24x1 run hit walltime and was killed, but the history
folder contained outputs from an older run done with EVP, so that the quality control script was using results from two different runs (and that, fortunately, failed!).
I re-ran the 24x1 with a longer walltime and both comparisons (with 40x1 and 80x1) now pass)
I put some more thought into the problem of reproducibility for the global sums, after a comment by @dupontf regarding performing the global sum using quadruple precision.
It turns out we already have that capability in CICE, and also even better algorithms: https://cice-consortium-cice.readthedocs.io/en/master/developer_guide/dg_other.html?highlight=reprosum#reproducible-sums
I looked more closely at the code and realized I could leverage this capability with only slight modifications. With these modifications done, running 1x1 and 2x1 side by side, I can verify that the global sums done in the dynamics solver are the same on both side, at least for this configuration:
With these settings the restarts are bit4bit!
note that I had to also add dump_last = .true.
in the namelist for the code to create a restart at the end of the run, if not it defaults to dump_freq = 1d
and the scripts would use these older restart from a previous run before I changed npt to do a single time step
Also passes (b4b) after 1 day (24 time steps)
And as expected, with precond=pgmres
it still fails as we skip some halo updates.
And it passes with precond=pgmres
if we add back those halo updates.
So in preparation of a PR with these changes, (https://github.com/CICE-Consortium/CICE/compare/main...phil-blain:CICE:repro-vp), I'm noticing the new code is noticeably slower than the old.
EDIT: original version is https://github.com/phil-blain/CICE/commits/repro-vp@%7B2022-07-14%7D
This is a little bit surprising ...
Note that this is without bfbflag
.... so the differences are:
global_sum_prod
loops through the whole arrays. not just ice points as calc_L2norm_squared
was doing (loop on icellu
)OK so I played with Intel Trace Analyzer and collector (ITAC) and Intel Vtune, for both versions (old and new) of the code, following this tutorial: https://www.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-lin/top.html
First, running Application Performance Snapshot reveals both versions are MPI bound, and have very poor vectorization (note that both runs are 40x1):
old
new
This reveals however that it's not only the added communications that slow the new code, since "MPI time" is 31%, vs. 43% for the old code.
I then ran the VTune "HPC Performance Characterization Analysis" for both versions and used the "Compare" feature. This a ranking of the hotspots for time difference between the new and old versions (right column, CPU Time: Difference
), with the corresponding timings for those functions in the new code (CPU Time: mod-vtune-g
):
I confirmed by running under GDB (mpirun -gdb
) that MPIDI_SHMGR_release_generic
is called (amonsgt other MPI subroutines) by MPI_ALLREDUCE
. So in the VP solver, it is only called by global_sum_prod
to actually perform the MPI reduction. Notice that the time difference for that function is almost half of the number for the new code, which makes sense since the new code has approx. twice the number of calls to MPI_ALLREDUCE
of the old code (since the new code does one global sum for X and another for Y components). I write "approx." because there is also more calls because of the modified CGS loop. My analysis of these timings is that this modification to the CGS algorithm does not play a big part in the additional time (since the new code spends almost twice as many time in this function, but not a lot more than twice.)
The new code spends a lot of time actually computing the local reductions (functions global_sum_prod_dbl
and compute_sums_dbl
).
Getting back to the performance regression after finally getting rid of all the bugs (famous last words) in my new code (see https://github.com/phil-blain/CICE/issues/39#issuecomment-1192815691 and following comments).
I re-ran the QC test cases on main
(007fbff) and the current tip of my repro-vp branch (579e19f), both 80x1
so using all cores of a single node (i.e. twice the number in my previous test).
The listings shows that at least uvel
is unrealistically large:
Warning: Departure points out of bounds in remap
my_task, i, j = 43 8 17
dpx, dpy = -45563.7247538909 12271.9759813932
HTN(i,j), HTN(i+1,j) = 33338.1913820475 33168.9296994831
HTE(i,j), HTE(i,j+1) = 47781.5593319368 47977.5239199294
(print_state) bad departure points
(print_state) istep1, my_task, i, j, iblk: 33867 43 8 17 11
(print_state) Global block: 884
(print_state) Global i and j: 31 368
(print_state) Lat, Lon (degrees): 67.5273801992820 -16.4194498698244
aice 9.070213892498866E-006
aice0 0.999990929786107
...
uvel(i,j) 12.6565902094141
vvel(i,j) -3.40888221705366
atm states and fluxes
uatm = -0.848320343386380
vatm = 2.65704819499085
potT = 271.607269287109
Tair = 271.607269287109
Qa = 2.464670687913895E-003
rhoa = 1.30000000000000
swvdr = 0.000000000000000E+000
swvdf = 0.000000000000000E+000
swidr = 0.000000000000000E+000
swidf = 0.000000000000000E+000
flw = 258.931945800781
frain = 0.000000000000000E+000
fsnow = 4.522630479186773E-005
ocn states and fluxes
frzmlt = -1000.00000000000
sst = 1.18273509903225
sss = 34.0000000000000
Tf = -1.90458264992426
uocn = 0.000000000000000E+000
vocn = 0.000000000000000E+000
strtltxU= 0.000000000000000E+000
strtltyU= 0.000000000000000E+000
srf states and fluxes
Tref = 2.460446333168104E-003
Qref = 2.265709251383878E-008
Uref = 1.428371434883044E-005
fsens = 6.097383853908757E-005
flat = -1.995661045829828E-005
evap = -7.027261700667970E-012
flwout = -2.690714878249707E-003
(abort_ice)ABORTED:
(abort_ice) error = (diagnostic_abort)ERROR: bad departure points
Abort(128) on node 43 (rank 43 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 43
EDIT I re-ran this from the restart of 2008-01-01, and tweaking the namelist so restarts are written every month. It failed at the same date in the same way. I re-ran it from the restart of 2008-11-01, it again failed at the same date in the same way. I set diagfreq
to 1 to check at which time steps it aborts, it is at the 3rd time step of the day.
I changed maxits_nonlin
to 6, and this allowed the run to continue without aborting...
I checked back in my case directory for my earlier long run (ppp6_intel_smoke_gx1_40x1_dynpicard_medium_qc.40_repro
, https://github.com/phil-blain/CICE/issues/40#issuecomment-1184774930) and it turns out I re-ran it with a longer walltime after it hit walltime the first time I ran it.
This second time, I also got "bad departure point", at the same exact location (iglob,jglob=31, 368
), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.
Finished writing ./history/iceh_inst.2006-11-13-00000.nc
Warning: Departure points out of bounds in remap
my_task, i, j = 21 16 9
dpx, dpy = -34861.6880018654 -10107.4117000807
HTN(i,j), HTN(i+1,j) = 33338.1913820475 33168.9296994831
HTE(i,j), HTE(i,j+1) = 47781.5593319368 47977.5239199294
istep1, my_task, iblk = 16347 21 8
Global block: 302
Global i and j: 31 368
(abort_ice)ABORTED:
(abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
note that the local indices are different since that run was 40x1...
EDIT this is reassuring in a way, because it means:
repro-vp
branch onto the latest main
did not eitherThis second time, I also got "bad departure point", at the same exact location (
iglob,jglob=31, 368
), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.
Thinking about it, this probably means that it's something in the forcing, as the QC test cycles the 2005 forcing:
$ \grep -E 'fyear_init|ycycle' configuration/scripts/options/set_nml.qc
fyear_init = 2005
ycycle = 1
In DDT, the range of uvel
and vvel
(min/max as found by the "Statistics" tab of the Multidimensional array viewer) at the start of the nonlinear iterations is very reasonable:
But the range of bx
and by
(RHS) are more different:
Putting that problem aside for now, I re-ran the HPC Performance Characterization Analysis after refactoring the new code to use a single call to MPI_ALLREDUCE
instead of two each time (so stacking X and Y components before doing the global sum).
Unfortunately:
MPIDI_SHMGR_release_generic
I'm not able to upload a screenshot to GitHub right now for some reason, I'll try again tomorrow.
EDIT here is what I wanted to show:
sorted by CPU time for the newest code ("repro-pairs", difference on the right): sorted by CPU time for the new code ("repro-no-pairs"):
OK, I refactored again following comments in https://github.com/CICE-Consortium/CICE/pull/763. New timings are very encouraging, no changes in non bfbflag
mode, and a small slowdown with bfbflag
.
I've re-ran the QC test with this new version of the code (https://github.com/CICE-Consortium/CICE/compare/main...phil-blain:CICE:repro-vp), in the two modes:
bfbflag=off
(default), so the computations are the same as before (global_sum
of local scalars): vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019/
bfbflag=lsum8
, so computation are through the new code path (global_sum
of an array), but with the default way to compute the local reduction in compute_sums_dbl
: vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329
."bad departure points" on 2006-04-15:
"bad departure points" on 2008-11-13 (same date as above):
For both cases, bump maxits_nonlin
to 5 instead of 4 allows the run to continue, and QC then passes against the main
simulation (done with maxits_nonlin=4
) as well as a new run with maxits_nonlin=5
(ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154627
, ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154716/
)
In both cases, restarting from the time step before the abort, and settings coriolis = 'zero'
allows the run to continue.
In both cases, the cell where it fails is right on the ice edge.
In both cases, bumping dim_pgmres
(number of inner iterations of the PGMRES preconditioner) from 5 to 10 allows the run to continue, keeping maxits_nonlin=4
.
In both cases, dropping the linear tolerance (reltol_fgmres
) from 1E-2 to 1E-1 allows the run to continue (!)
The change of default parameters was implemented in https://github.com/CICE-Consortium/CICE/pull/774. I'm keeping this open since the underlying robustness issue is not solved.
Discussing with JF, it seems the preconditioner is probably not doing a good enough job, which leads to the FGMRES solver having trouble converging...
test for INC000000215856 Test for INC000000215856.docx
test for image INC000000215856
Test for INC000000215856 powerpoint.pptx test for powerpoint
Test for INC000000215856 gif and png
test
Tried a new suite with
dynpicard
, no OpenMPnone of the three MPI cases are bfb with the 1x1 case.