phil-blain / CICE

Development repository for the CICE sea-ice model
Other
0 stars 0 forks source link

Investigate decomp_suite failures with dynpicard option #39

Open phil-blain opened 3 years ago

phil-blain commented 3 years ago

Running the decomp_suite with the VP dynamics results in some segfaults (due to NaN initialisation), some errors ("bad departure points") and some non BFB restarts, see https://github.com/CICE-Consortium/CICE/issues/518.

I'll use this issue to document my findings in investigating those.

phil-blain commented 3 years ago

Cases that are missing data

$ ./results.csh |\grep MISS
MISS daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data
MISS daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data
MISS daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data

these in fact are cases were the model segfaulted/crashed, so it is the data for the case itself that is missing, not the one that we are comparing against.

EDIT 2022/05: reported in https://github.com/CICE-Consortium/CICE/issues/608 and subsequently fixed.

daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day

core file reveals diag[xy] contain NaNs (due to initialization probably) at cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3517. From a quick look it seems the OpenMP directive is missing i,j,ij in the private variables.

daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day

core file reveals same as above (need to switch thread in the core with thread $num, see https://sourceware.org/gdb/onlinedocs/gdb/Threads.html. The right thread to use is the one stopped in ../sysdeps/unix/sysv/linux/x86_64/sigaction.c:62)

daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day

core file reveals same as above.

Cases that fail to run

$ ./results.csh |\grep FAIL | \grep ' run'
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard run
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day run -1 -1 -1

daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard, daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day

missing an input file, see https://github.com/CICE-Consortium/CICE/pull/602#issuecomment-860818756

daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard

"ERROR: bad departure points"

daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day

NaNs in gridbox_corners... probably not related to VP solver...

#4  0x0000000000de0fbc in ice_grid::gridbox_corners () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:2219
#5  0x0000000000d76302 in ice_grid::init_grid2 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:570
#6  0x0000000000401b83 in cice_initmod::cice_init () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_InitMod.F90:121
#7  0x00000000004019c7 in cice_initmod::cice_initialize () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_InitMod.F90:52
#8  0x0000000000401671 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE.F90:43
#9  0x00000000004015f2 in main ()
#10 0x00000000021d930f in __libc_start_main (main=..., argc=..., argv=..., init=..., fini=..., rtld_fini=..., stack_end=...) at ../csu/libc-start.c:308
#11 0x00000000004014da in _start () at ../sysdeps/x86_64/start.S:120
(gdb) f 4
#4  0x0000000000de0fbc in ice_grid::gridbox_corners () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:2219
2219                work_g2(:,:) = lont_bounds(icorner,:,:,iblk) + c360

array lont_bounds(icorner,:,:,iblk) is all NaN.

EDIT the above was already reported in https://github.com/CICE-Consortium/CICE/issues/599#issue-865459873 ("Problems in ice_grid.F90"), on the gbox128 grid.

The other 3 cases are mentioned above in the MISS section.

phil-blain commented 3 years ago

Cases that fail 'test'

$ ./results.csh |\grep FAIL | \grep ' test'
FAIL daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard test 
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard test 
FAIL daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short test 
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard test 
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard test 
FAIL daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard test 
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day test 
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day test 
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day test 
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day test 
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day test 

daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard

fails to restart exactly

daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short

fails to restart exactly

daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard

fails to restart exactly

daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard

fails to restart exactly

The other cases also fail to run (see above).

phil-blain commented 3 years ago

Others failing tests are because they are not BFB.

phil-blain commented 3 years ago

I fixed the buggy OpenMP directive:

diff --git i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index 457a73a..367d29e 100644
--- i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -3507,7 +3507,7 @@ subroutine precondition(zetaD       ,        &
          wx = vx
          wy = vy
       elseif (precond_type == 'diag') then ! Jacobi preconditioner (diagonal)
-         !$OMP PARALLEL DO PRIVATE(iblk)
+         !$OMP PARALLEL DO PRIVATE(iblk, ij, i, j)
          do iblk = 1, nblocks
             do ij = 1, icellu(iblk)
                i = indxui(ij, iblk)

So let's do another round (~/cice-dirs/suites/vp-decomp-openmp-fix)

Cases that are missing data

daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day

run segfaulted.

#8  0x00000000009ca4e4 in ice_dyn_vp::calc_l2norm_squared (nx_block=..., ny_block=..., icellu=..., indxui=..., indxuj=..., tpu=..., tpv=..., l2norm=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2479
#9  0x0000000000a29749 in ice_dyn_vp::L_ice_dyn_vp_mp_pgmres__3290__par_loop4_2_56 ()
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3292
#10 0x0000000001f68883 in __kmp_invoke_microtask ()
#11 0x0000000001f12b2a in __kmp_invoke_task_func ()
#12 0x0000000001f143d6 in __kmp_fork_call ()
#13 0x0000000001edfb25 in __kmpc_fork_call ()
#14 0x0000000000a09894 in ice_dyn_vp::pgmres (zetad=..., cb=..., vrel=..., umassdti=..., bx=..., by=..., diagx=..., diagy=..., tolerance=..., maxinner=..., maxouter=..., 
    solx=..., soly=..., nbiter=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3290
#15 0x0000000000a35680 in ice_dyn_vp::precondition (zetad=..., cb=..., vrel=..., umassdti=..., vx=..., vy=..., diagx=..., diagy=..., precond_type=..., wx=..., wy=..., 
    .tmp.PRECOND_TYPE.len_V$69f7=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3528
#16 0x00000000009da5f8 in ice_dyn_vp::fgmres (zetad=..., cb=..., vrel=..., umassdti=..., halo_info_mask=..., bx=..., by=..., diagx=..., diagy=..., tolerance=..., maxinner=..., 
    maxouter=..., solx=..., soly=..., nbiter=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2858
#17 0x000000000094ab6e in ice_dyn_vp::anderson_solver (icellt=..., icellu=..., indxti=..., indxtj=..., indxui=..., indxuj=..., aiu=..., ntot=..., waterx=..., watery=..., 
    bxfix=..., byfix=..., umassdti=..., sol=..., fpresx=..., fpresy=..., zetad=..., cb=..., halo_info_mask=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:932
#18 0x0000000000914ba7 in ice_dyn_vp::implicit_solver (dt=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:483
#19 0x00000000014c41d7 in ice_step_mod::step_dyn_horiz (dt=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/general/ice_step_mod.F90:886
#20 0x000000000040cd41 in cice_runmod::ice_step () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_RunMod.F90:284
#21 0x000000000040b849 in cice_runmod::cice_run () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_RunMod.F90:83
#22 0x0000000000400cd0 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE.F90:49
#23 0x0000000000400c32 in main ()
#24 0x00000000020f7edf in __libc_start_main (main=..., argc=..., argv=..., init=..., fini=..., rtld_fini=..., stack_end=...) at ../csu/libc-start.c:308
#25 0x0000000000400b1a in _start () at ../sysdeps/x86_64/start.S:120
(gdb) f 8
#8  0x00000000009ca4e4 in ice_dyn_vp::calc_l2norm_squared (nx_block=..., ny_block=..., icellu=..., indxui=..., indxuj=..., tpu=..., tpv=..., l2norm=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2479

à cause d'un out-of-range dans le calcul de la norme...:

(gdb) p tpu(i,j)
$3 = -1.4002696118736502e+97
(gdb) p tpv(i,j)
$4 = -2.4873702247927307e+218
(gdb) p tpv(i,j)**2
Cannot perform exponentiation: Numerical result out of range
(gdb) p tpu(i,j)**2
$5 = 1.9607549859367831e+194

daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day

same as above

Cases that failed to run

 ./results.csh |\grep FAIL | \grep ' run'
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard run
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard run 
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day run -1 -1 -1

daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard, daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day

same as above (missing file)

daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard, daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard

(horizontal_remap)ERROR: bad departure points

The next two are same as in first suite above. The last two are those that segfaulted (see just above)

phil-blain commented 3 years ago

Note: re-running the same suite a second time leads to different results:

This really suggests that there is some non-reproducibility in the code...

apcraig commented 2 years ago

Just a quick update. I'm playing with the OpenMP in the entire code and tested evp, eap, and vp. I can also confirm that running different thread counts with vp produces different answers. If "re-running the same suite a second time leads to different results", that suggests the code is not bit-for-bit reproducible when rerun? I tried to test that and for my quick tests, the same run does seem to be reproducible. It's a little too bad because that's an easier problem to debug. I also tested a 32x1x16x16x16 and 64x1x16x16x16 case and they are not bit-for-bit. Same decomp, no OpenMP, just different block distribution. If I get a chance, I will try to look into this more. At this point, I will probably defer further OpenMP optimization with vp. I think there are several tasks to do

phil-blain commented 2 years ago

Hi Tony, thanks for these details and tests. This isssue is definitely still on my list, I hope I'll have to time to go back to the VP solver this Winter/early Spring.

I'll take a look at the PR when you submit it.

phil-blain commented 2 years ago

I'm finally going back to this. I've re-ran the decomp_suite with -s dynpicard on our 2 machines, testing the latest master main. I've ran the suite twice on each machine, and get the same results on the same machine, and across machines, modulo -init=snans,arrays which is inactive on daley but active on banting, and modulo the exact failure mode. So at least that's that.

Summary:

$ ./results.csh |tail -5
203 measured results of 203 total results
157 of 203 tests PASSED
0 of 203 tests PENDING
0 of 203 tests MISSING data
46 of 203 tests FAILED

cases that fail "run"

$ ./results.csh |\grep FAIL|\grep ' run'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day run -1 -1 -1

on banting I also get banting_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day but this is due to https://github.com/CICE-Consortium/CICE/issues/599#issue-865459873 ("Problems in ice_grid.F90")

daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard

fails at the first run of the test with SIGILL or SIGSEGV (varies). Usually no core is produced (got a core once but more or less useful since this case is not compiled in debug mode, when I recompiled in debug mode I did not get a core...),

forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine            Line        Source
cice               0000000001130484  Unknown               Unknown  Unknown
cice               00000000009C4700  Unknown               Unknown  Unknown
Unknown            00002AAAAFFF5F8B  Unknown               Unknown  Unknown
[NID 00724] 2022-05-12 16:13:50 Apid 16619909: initiated application termination

daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day

fails with SIGSEGV, got a core on one machine but not the other.

 Finished writing ./history/iceh_ic.2005-01-01-03600.nc
*** stack smashing detected ***: <unknown> terminated
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
cice               0000000002650764  Unknown               Unknown  Unknown
cice               0000000001EE45C0  Unknown               Unknown  Unknown
cice               0000000001ADBA36  ice_transport_rem        3473  ice_transport_remap.F90
cice               00000000019BF6B6  ice_transport_rem         755  ice_transport_remap.F90
cice               00000000023BF723  Unknown               Unknown  Unknown
cice               00000000023699CA  Unknown               Unknown  Unknown
cice               000000000236B276  Unknown               Unknown  Unknown
cice               00000000023369C5  Unknown               Unknown  Unknown
cice               00000000019A4C79  ice_transport_rem         642  ice_transport_remap.F90
cice               000000000192A728  ice_transport_dri         553  ice_transport_driver.F90
cice               00000000018B5E0C  ice_step_mod_mp_s         959  ice_step_mod.F90
cice               000000000040ED7F  cice_runmod_mp_ic         285  CICE_RunMod.F90
cice               000000000040D7C1  cice_runmod_mp_ci          85  CICE_RunMod.F90
cice               0000000000401690  MAIN__                     49  CICE.F90
cice               00000000004015F2  Unknown               Unknown  Unknown
cice               000000000273228F  Unknown               Unknown  Unknown
cice               00000000004014DA  Unknown               Unknown  Unknown
_pmiu_daemon(SIGCHLD): [NID 01334] [c6-0c2s13n2] [Thu May 12 16:31:38 2022] PE RANK 0 exit signal Aborted

"stack smashing detected" that I've never seen. Here is the backtrace:

(gdb) bt
#0  0x0000000001a05d33 in ice_transport_remap::locate_triangles (nx_block=52, ny_block=60, ilo=2, ihi=51, jlo=2, jhi=59, nghost=1, edge=..., icells=..., indxi=..., indxj=..., dpx=..., dpy=..., dxu=...,
    dyu=..., xp=..., yp=..., iflux=..., jflux=..., triarea=..., l_fixed_area=.FALSE., edgearea=..., .tmp.EDGE.len_V$2530=80)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:1891
#1  0x00000000019bb5d7 in horizontal_remap::L_ice_transport_remap_mp_horizontal_remap__642__par_loop2_2_6 ()
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:715
#2  0x00000000023bf723 in __kmp_invoke_microtask ()
#3  0x00000000023699ca in __kmp_invoke_task_func ()
#4  0x000000000236b276 in __kmp_fork_call ()
#5  0x00000000023369c5 in __kmpc_fork_call ()
#6  0x00000000019a4c79 in ice_transport_remap::horizontal_remap (dt=3600, ntrace=26, uvel=..., vvel=..., mm=..., tm=..., l_fixed_area=.FALSE., tracer_type=..., depend=..., has_dependents=...,
    integral_order=3, l_dp_midpt=.TRUE., grid_ice=..., uvele=..., vveln=..., .tmp.GRID_ICE.len_V$6dc=256) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:642
#7  0x000000000192a728 in ice_transport_driver::transport_remap (dt=3600) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_driver.F90:553
#8  0x00000000018b5e0c in ice_step_mod::step_dyn_horiz (dt=3600) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/general/ice_step_mod.F90:959
#9  0x000000000040ed7f in cice_runmod::ice_step () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
#10 0x000000000040d7c1 in cice_runmod::cice_run () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
#11 0x0000000000401690 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE.F90:49

cases that fail "test"

$ ./results.csh |\grep FAIL|\grep ' test'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard test
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day test

which are the same as above (if "run" fails, "test" fails)

cases that fail "bfbcomp"

first the decomp test itself:

$ ./results.csh |\grep FAIL | \grep daley_intel_decomp
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_slenderX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_roundrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectcart bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_spacecurve bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_rakeX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop

and then the rest:

$ ./results.csh |\grep FAIL | \grep different-data
FAIL daley_intel_restart_gx3_1x1x50x58x4_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x1x25x116x1_dslenderX1_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_6x2x4x29x18_dspacecurve_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_5x2x33x23x4_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x5x10x20_drakeX2_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard_maskhalo bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x1x120x125x1_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x1x1x800_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x2x2x200_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x8x8x80_dspiralcenter_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_10x1x10x29x4_dsquarepop_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x1x25x29x4_drakeX2_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_smoke_gx3_1x1x25x58x8_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x1x5x116x1_debug_dslenderX1_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_6x2x4x29x18_debug_dspacecurve_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x10x12x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_5x2x33x23x4_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_4x2x19x19x10_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_run2day_short bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x5x10x20_debug_drakeX2_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x1x120x125x1_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x1x1x800_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x2x2x200_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x3x3x100_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x8x8x80_debug_dspiralcenter_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x1x25x29x4_debug_drakeX2_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data

the good news

phil-blain commented 2 years ago

daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard

Running with the memory debugging library, however, hides the segfault and the code runs correctly ...


On XC-50, all executables use static linking, so it's not possible for DDT to preload its memory debugging library; you have to relink your executable with DDT's memory debugging library. Instructions can be found in the "Static Linking" section of the Arm Forge user guide (section 12.4.1). The exact linking flags to use vary if your program is multithreaded and if it uses C++.

Note: Adding --wrap=dlopen,--wrap=dlclose for threaded programs makes the link fail at least with the Intel Fortran compiler. This is normal, as ARM explains in the "Compiler notes and known issues" for the Intel Compilers section:

If you are compiling static binaries, linking on a Cray XT/XE machine in the Arm DDT memory debugging library is not straightforward for F90 applications. You must manually rerun the last ld command (as seen with ifort -v) to include -L{ddt-path}/lib/64 -ldmalloc in two locations:

  • Include immediately prior to where -lc is located.

  • Include the -zmuldefs option at the start of the ld line.

This is not easy to understand as the wording is weird. A few notes:

In practice, -lc appears twice in the ld invocation, and it suffices to add /opt/forge/20.0.1/lib/64/libdmallocth.a (or one of the other 3 libraries) immediately before the first -lc for the link to succeed.

phil-blain commented 2 years ago

OK, I tested the failing tests above without dynpicard and it failed in the same way. This was due to OMP_STACKSIZE being unset and the newly active OpenMP directive in ice_transport_remap.F90 since d1e972a7 (Update OMP (#680), 2022-02-18).

Uncommenting the variable in the machine file (comment added in 8c23df8f (- Update version and copyright. (#691), 2022-02-23) makes both test pass.

phil-blain commented 2 years ago

OK so with the little code modifications mentioned in https://github.com/phil-blain/CICE/issues/40#issuecomment-1175467783, which I will push tomorrow, the decomp_suite passes [1] with dynpicard when also adding these settings:

precond = diag # or ident, not yet tested
bfbflag = reprosum # maybe works with dpddd and lsum16, not yet tested.

This is very encouraging as it shows not only that the OpenMP implementation is OK, but also that we did not "miss" anything MPI-related (like halo updates, etc) in the VP implementation.

EDIT forgot the end note:


[1] I do have one failure, ppp6_intel_restart_gx3_16x2x1x1x800_droundrobin_dynpicard but this test is a known failure even with EVP on our new machines (cf. https://gitlab.science.gc.ca/hpc/hpcr_upgrade_2/issues/244 [internal]). This is most likely a bug in Intel MPI.

phil-blain commented 2 years ago

OK, unsurprisingly it also passes with ident.

phil-blain commented 2 years ago

However, I have 3 failures with precond=pgmres:

$ ./results.csh | \grep FAIL
FAIL ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day test
FAIL ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_short run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_short test
FAIL ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread test
phil-blain commented 2 years ago

ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day

 Finished writing ./history/iceh_ic.2005-01-01-03600.nc
forrtl: severe (408): fort: (3): Subscript #1 of the array I8_ARR_TLSUM_LEVEL has value -42107522 which is less than the lower bound of 0
Abort(594434) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffd8720efc0, rbuf=0x7ffd8720f1e0, count=-42107517, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -42107517
Abort(403247618) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffe05282140, rbuf=0x7ffe05282360, count=-42107517, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -42107517

No core.

Running in DDT reveals the failure is here, line 963:

https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L943-L964

ilevel is a large negative value (so probably uninitialized)...

EDIT reading the code, it seems impossible for ilevel to be uninitialized. So it seems the error is somewhere else, and it manages to corrupt things here.

EDIT2 The MPI_ALLREDUCE calls on procs 4,5 aborts because count is negative. This count is veclth and calculated here: https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L904

max_levels(nflds) is negative. It is computed here in ice_reprosum_calc: https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L659-L680

just before calling ice_reprosum_int. digits(0_i8) is 63 (checked with a simple program), arr_gmax_exp(1) is 2147483647, arr_gmin_exp(1) is -41, so I think this: https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L667-L669

overflows... I'm not sure though if it's normal for arr_gmax_exp(1) to be that big...

It ultimately comes from here: https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L589-L594

MINEXPONENT(1._r8) is -1021... to be continued...

OK, so it is arr_exp = exponent(arr(isum,ifld)) which gives 2147483647 (highest value possible for a 32-bit integer) when given -nan as input.

Note that I can't print arr(isum,ifld) from inside the loop (inside the OpenMP region), I had to go up the stack for DDT and GDB to be able to print it, or else I would get "no such vector element".

phil-blain commented 2 years ago

ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_short

similaire à ci-dessus:

 Finished writing ./history/iceh_ic.2005-01-01-03600.nc
forrtl: severe (408): fort: (3): Subscript #1 of the array I8_ARR_TLSUM_LEVEL has value -41297762 which is less than the lower bound of 0

forrtl: error (76): Abort trap signal
Abort(269029890) on node 18 (rank 18 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffe38904cc0, rbuf=0x7ffe38904ee0, count=-41297756, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -41297756

Core is truncated.

phil-blain commented 2 years ago

ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread

 (JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
[compute:33089:0:33089] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:  33089) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x0000000000f16ec9 ice_global_reductions_mp_global_sum_prod_dbl_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90:895
 2 0x0000000000b3d728 ice_dyn_vp_mp_anderson_solver_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:903
 3 0x0000000000b0b11c ice_dyn_vp_mp_implicit_solver_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:475
 4 0x00000000018744ff ice_step_mod_mp_step_dyn_horiz_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/general/ice_step_mod.F90:950
 5 0x000000000041b610 cice_runmod_mp_ice_step_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
 6 0x000000000041a055 cice_runmod_mp_cice_run_()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
 7 0x000000000040e040 MAIN__()  /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:49
 8 0x000000000040dfa2 main()  ???:0
 9 0x00000000000237b3 __libc_start_main()  ???:0
10 0x000000000040deae _start()  ???:0
=================================
forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source
cice               0000000001E9129B  Unknown               Unknown  Unknown
libpthread-2.28.s  00001501CE9A3B20  Unknown               Unknown  Unknown
cice               0000000000F16EC9  ice_global_reduct         895  ice_global_reductions.F90
cice               0000000000B3D728  ice_dyn_vp_mp_and         903  ice_dyn_vp.F90
cice               0000000000B0B11C  ice_dyn_vp_mp_imp         475  ice_dyn_vp.F90
cice               00000000018744FF  ice_step_mod_mp_s         950  ice_step_mod.F90
cice               000000000041B610  cice_runmod_mp_ic         285  CICE_RunMod.F90
cice               000000000041A055  cice_runmod_mp_ci          85  CICE_RunMod.F90
cice               000000000040E040  MAIN__                     49  CICE.F90
cice               000000000040DFA2  Unknown               Unknown  Unknown
libc-2.28.so       00001501CE1D07B3  __libc_start_main     Unknown  Unknown
cice               000000000040DEAE  Unknown               Unknown  Unknown

Only a single core (!), it is fortunately usable:

(gdb) bt
#0  0x00001501ce1e47ff in raise () from /lib64/libc.so.6
#1  0x00001501ce1cecfe in abort () from /lib64/libc.so.6
#2  0x0000000001e8b690 in for.issue_diagnostic ()
#3  0x0000000001e9129b in for.signal_handler ()
#4  <signal handler called>
#5  0x0000000000f16ec9 in ice_global_reductions::global_sum_prod_dbl (array1=..., array2=..., dist=..., field_loc=2, mmask=<error reading variable: Location address is not set.>,
    lmask=<error reading variable: Location address is not set.>) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90:895
#6  0x0000000000b3d728 in ice_dyn_vp::anderson_solver (icellt=..., icellu=..., indxti=..., indxtj=..., indxui=..., indxuj=..., aiu=..., ntot=436, uocn=..., vocn=..., waterxu=..., wateryu=..., bxfix=...,
    byfix=..., umassdti=..., sol=..., fpresx=..., fpresy=..., zetax2=..., etax2=..., rep_prs=..., cb=..., halo_info_mask=...)
    at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:903
#7  0x0000000000b0b11c in ice_dyn_vp::implicit_solver (dt=3600) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:475
#8  0x00000000018744ff in ice_step_mod::step_dyn_horiz (dt=3600) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/general/ice_step_mod.F90:950
#9  0x000000000041b610 in cice_runmod::ice_step () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
#10 0x000000000041a055 in cice_runmod::cice_run () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
#11 0x000000000040e040 in icemodel () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:49
(gdb) p i
$1 = 8
(gdb) p j
$2 = 12
(gdb) p iblock
$3 = 2
(gdb) p  array1(i,j,iblock)
$4 = nan(0x7baddadbaddad)
(gdb) p array2(i,j,iblock)
$5 = nan(0x7baddadbaddad)

This is when computing the norm of the residual vector (Fx,Fy) just after it's been computed, so it's a bit mysterious... (well not too much since the global sum is over all points whereas before we were summing only ice points (using icellu, indxui, indxuj)...

EDIT what is mysterious is that I should get this failure also with precond=diag and ident...

phil-blain commented 2 years ago

So I reran with diag and got the same 3 failures. So I'm not sure what I did when I wrote https://github.com/phil-blain/CICE/issues/39#issuecomment-1175553653, but I was mixed up...

EDIT I did rebase though...

$ git logo origin/repro-vp..upstream/main
21bd95b cice.setup: remove 'suite.jobs' at start of 'suite.submit' (#731) Philippe Blain (Fri Jul 15 10:43)  (upstream/main, upstream/HEAD)
1585c31 Add unit test for optional arguments, "optargs" (#730) Tony Craig (Fri Jul 15 07:43)
d088bfb Update some CICE variable names to clarify grid (#729) Tony Craig (Fri Jul 15 07:42)
471c010 add Cgrid-related fixes for nuopc/cmeps  (#728) Denise Worthen (Thu Jun 23 14:47)

EDIT2 I see the same failure when running with diag on the version of my branch before the rebase. So that's not it.

phil-blain commented 1 year ago

OK so in the end all 3 failures are at the same place, where we compute the norm of (Fx,Fy). These are the only variables given to global_sum_prod that were not initialized to zero beforehand in the code.

It's still weird that I would get the failure only in certain decompositions though...

EDIT not that weird, since it was using uninitialized values, so anything can happen...

phil-blain commented 1 year ago

If I initialize (Fx,Fy) to 0, it fixes those errors, but then I ran the suite again from scratch and got some new failures (MPI aborts, bad departure points, etc.)

EDIT here are the failures (suite: decomp-vp-repro-init-fxy):

$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_slenderX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_roundrobin bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_5x2x33x23x4_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum test

ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum

MPI abort at time step 121:

 Restart read/written          120    20050106
Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

I re-ran the test 3 other times and they all passed.

ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum

bad departure point at time step 48:

 Warning: Departure points out of bounds in remap
 my_task, i, j =          13           2           3
 dpx, dpy =  -97023.9228738988        214611.660265582
 HTN(i,j), HTN(i+1,j) =   213803.742672313        214293.709816433
 HTE(i,j), HTE(i,j+1) =   144794.922414856        143905.700041556
 istep1, my_task, iblk =          58          13          58
 Global block:        1155
 Global i and j:          97         101

I re-ran the test 4 other times,


I then ran a new decomp suite twice (one to generate baseline, one to compare with the baseline) and got some differences between both. So it points to some non-reproducibility even on the same number of procs...

EDIT I did that (that=run decomp suite twice, once with bgen and one with bcmp) first with my new code, suites: decomp-vp-repro-rerun-[12].

In decomp-vp-repro-rerun-1 (bgen) I get:

$ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_slenderX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data

So:

Note: I did also get "bad departure points" twice in

In decomp-vp-repro-rerun-2 (bcmp) I get:

```bash $ sgrep FAIL results.log FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_slenderX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_roundrobin bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_roundrobin compare decomp-vp-repro-rerun-1 2.77 1.65 0.63 different-data FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 compare decomp-vp-repro-rerun-1 3.14 1.71 0.87 different-data FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-rerun-1 different-data FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-rerun-1 12.39 8.10 1.75 different-data FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-rerun-1 different-data FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-rerun-1 18.84 11.88 3.79 different-data FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum run FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum test FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-rerun-1 different-data FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-rerun-1 16.61 10.01 1.49 different-data FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data ```

So:


This prompted me to run 2 decomp suites (bgen/bcmp) on main, i.e.without my changes but with dynpicard, and I still got non-reproducible results, suites: decomp-at-21bd95b-[23]].

In addition, in decomp-at-21bd95b-3, I also got an MPI abort at time step 121 for case ppp5_intel_restart_gbox180_16x1x6x6x60_debugblocks_diag1_dspacecurve_dynpicard:

istep1:       121    idate:  20050106    sec:      3600
Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

Note that this is the same test as above in decomp-vp-repro-init-fxy.

When I re-submitted the test, both runs (it's a restart test) ran correctly to completion....

phil-blain commented 1 year ago

I get the same results (bgen then bcmp from same commit on main, with dynpicard giving different results) with the nothread_suite, which does not have any OpenMP test cases (modulo those with OpenMP compiling flag active but only one thread actually used for the case). This is for the code from main, i.e. even before my changes. EDIT suites: nothread-at-21bd95b-[34].

I then ran the same suite with a (self-compiled) OpenMPI instead of Intel MPI and it seems that I do not get any of these errors (still on main and with dynpicard). [EDIT suites: nothread-at-21bd95b-ompi-tm[-2] ] I'll repeat these tests, but it does point to something with Intel MPI...

phil-blain commented 1 year ago

OK so I dug a bit into this and found this Intel MPI variable : I_MPI_CBWR=1. This disables "topology-aware collectives" (i,e. MPI_ALLREDUCE and friends) and makes sure that re-running the same code on the same number of procs on the same machine leads to reproducible results. See:

Apparently OpenMPI does that out-of-the-box, and it seems Cray MPT also at least under the circumstances under which we were running on daley/banting (exclusive nodes). I did find some references to MPICH_ALLREDUCE_NO_SMP and MPICH_REDUCE_NO_SMP at https://gitlab.science.gc.ca/hpc_migrations/hpcr_upgrade_1/wikis/getting-started/compiler_notes#xc50-daleybanting [internal] and a very few places on the web and on GitHub, this is (despite the name!) specific to Cray MPT. It is apparently documented in man mpi or man intro_mpi on the CLE.


I ran 2 nothread suites with intel + intel MPI + I_MPI_CBWR=1 at essentially 21bd95b (main with https://github.com/CICE-Consortium/CICE/pull/745 on top) and both runs passed the baseline compare for all tests [EDIT suites: nothread-at-21bd95b-impi-cbwr[-2]]. So it seems the variable indeed works.

apcraig commented 1 year ago

Thanks @phil-blain, that's some rough debugging. Yuck.

Do we understand why the dynpicard is particularly susceptible? Why don't we see this with some other configurations?

phil-blain commented 1 year ago

Because dynpicard uses global sums ( MPI_ALLREDUCE) in its algorithm, whereas the rest of the code only uses them for diagnostics. And the base_suite only does cmprest (bit4bit comparisons of restarts) and not cmplog (b4b comparison of logs) and I only ever ran the base_suite on our previous Cray machines.

I'll walk my steps backwards from here, I think I got to the bottom of it now.

apcraig commented 1 year ago

OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct? Just to clarify, are you just seeing different results with different pe counts/decompositions? Is the global reduction in dynpicard using the internal CICE global sum method yet?

phil-blain commented 1 year ago

OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?

Not for Intel MPI, no, unless I set this I_MPI_CBWR=1 variable. With this environment variable, results are reproducible on the same pe count + decomp.

Just to clarify, are you just seeing different results with different pe counts/decompositions?

Yes, with the code in main, running the decomp suite with dynpicard, all bfbcomp tests fail.

Is the global reduction in dynpicard using the internal CICE global sum method yet?

Not with the code on main, no. And that's why the bfbcomp tests fail. I have updated the code to correctly use the CICE global sum implementation, see https://github.com/phil-blain/CICE/issues/40#issuecomment-1175467783. Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression for dynpicard, so I'd like to understand that a bit more also before I open my PR.

apcraig commented 1 year ago

OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?

Not for Intel MPI, no, unless I set this I_MPI_CBWR=1 variable. With this environment variable, results are reproducible on the same pe count + decomp.

Interesting and surprising! What machine is that? In my experience, this is a requirement of MPI in most installations and I've never seen non-reproducibility for POP-based runs, and I check it a lot (in CESM/RASM/etc). POP has a lot of global sums, so it's a good test. I assume this is just a setting on this one particular machine?

Just to clarify, are you just seeing different results with different pe counts/decompositions?

Yes, with the code in main, running the decomp suite with dynpicard, all bfbcomp tests fail.

That's what I'd expect. I think the bfbcomp testing has benefited from the fact that there were no global sums (or similar) in CICE up to now.

Is the global reduction in dynpicard using the internal CICE global sum method yet?

Not with the code on main, no. And that's why the bfbcomp tests fail. I have updated the code to correctly use the CICE global sum implementation, see #40 (comment). Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression for dynpicard, so I'd like to understand that a bit more also before I open my PR.

Let me know if I can help. I think it's perfectly fine to do some "bfbcomp" testing with slower global sums for the dynpicard in particular, but to use the fastest global sums in production and other testing. The separate issue is whether the CICE global sum implementation is slower than it should be. Thanks.

phil-blain commented 1 year ago

OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?

Not for Intel MPI, no, unless I set this I_MPI_CBWR=1 variable. With this environment variable, results are reproducible on the same pe count + decomp.

Interesting and surprising! What machine is that? In my experience, this is a requirement of MPI in most installations and I've never seen non-reproducibility for POP-based runs, and I check it a lot (in CESM/RASM/etc). POP has a lot of global sums, so it's a good test. I assume this is just a setting on this one particular machine?

It's one of our new Lenovo clusters (see https://www.hpcwire.com/off-the-wire/canadian-weather-forecasts-to-run-on-nvidia-powered-system/). I was also surprised, but if you follow the links to stackoverflow/stackexchange which I posted above, it is clearly indicated in the MPI standard that it is only a recommendation that repeated runs yield the same results for collective reductions. Apparently OpenMPI follows that recommendation, but Intel MPI has to be convinced with that variable. It's an environment variable for Intel MPI, so no it's not specific to that machine.

WIth Intel MPI, the non reproducibility is (as far as I understand) linked to the pinning of MPI processes to specific CPUs. So if from run to run the ranks are pinnned to different CPUs, then the reductions might give different results because the reduction algorithm take advantage of the processor topology. If you always run on machines with exclusive node access, then it's possible that the pinning is always the same so you do not notice the difference. That was the case on our previous Crays.

Just to clarify, are you just seeing different results with different pe counts/decompositions?

Yes, with the code in main, running the decomp suite with dynpicard, all bfbcomp tests fail.

That's what I'd expect. I think the bfbcomp testing has benefited from the fact that there were no global sums (or similar) in CICE up to now.

Indeed.

Is the global reduction in dynpicard using the internal CICE global sum method yet?

Not with the code on main, no. And that's why the bfbcomp tests fail. I have updated the code to correctly use the CICE global sum implementation, see #40 (comment). Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression for dynpicard, so I'd like to understand that a bit more also before I open my PR.

Let me know if I can help. I think it's perfectly fine to do some "bfbcomp" testing with slower global sums for the dynpicard in particular, but to use the fastest global sums in production and other testing. The separate issue is whether the CICE global sum implementation is slower than it should be. Thanks.

Yes, that's my plan. But I noticed that even with bfbflag off, the new code is stiill slower (see https://github.com/phil-blain/CICE/issues/40#issuecomment-1188260610 and later comments). I'll get back to this soon and I'll let you know if / how I could use help. Maybe I'll open a "draft" PR with my changes and we can discuss there the performance implications. Thanks!

phil-blain commented 1 year ago

OK, retracing my steps back. I ran 2 decomp suites (bgen/bcmp) with dynpicard and I_MPI_CBWR=1, on main (still technically 21bd95b with https://github.com/CICE-Consortium/CICE/pull/745 on top) [suites: decomp-at-21bd95b-vp-impi-cbwr-[12].

phil-blain commented 1 year ago

Next step: back to my new code. I ran a decomp suite with dynpicard,reprosum and I_MPI_CBWR=1 [suite: decomp-vp-repro-impi-cbwr-1].

 $ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_5x2x33x23x4_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data

This is a bit unfortunate, especially the restart failures. To me this hints to a bug in the code.

phil-blain commented 1 year ago

I next ran the same thing, but adding -s debug [suite: decomp-vp-repro-debug-impi-cbwr-1]:

$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day bfbcomp ppp6_intel_smoke_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum_run2day different-data
phil-blain commented 1 year ago

and I next re-ran a debug suite, with -init=snan,arrays added in the Macros file [suite: decomp-vp-repro-debug-init-snan]:

$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum_short run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum_short test
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum_short run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum_short test

nothing unexpected here; all bfbcomp tests and all restart tests passed this time.

phil-blain commented 1 year ago

I reran a second identical suite, baseline comparing with the previous one [suite: decomp-vp-repro-debug-impi-cbwr-2]:

$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-debug-cbwr-dynpicard 398.80 321.82 43.06 different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-debug-cbwr-dynpicard 109.13 76.86 17.08 different-data

differences start at the second time step, and they do not start at the last decimal at all:

diff --git 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
index 5924f69..d515b0d 100644
--- 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305
+++ 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
@@ -922,47 +922,47 @@ heat used (W/m^2)      =        2.70247926206599542      21.66078047047012589
 istep1:         2    idate:  20050101    sec:      7200
  (JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
                                              Arctic                 Antarctic
-total ice area  (km^2) =    1.55991254493588358E+07   1.56018697755621299E+07
+total ice area  (km^2) =    1.55989861815027408E+07   1.56018575740428381E+07
 total ice extent(km^2) =    1.57251572666864432E+07   1.93395172319125347E+07
-total ice volume (m^3) =    1.48535756598763418E+13   2.40341818246218164E+13
-total snw volume (m^3) =    1.96453741997257983E+12   5.12234084165053809E+12
-tot kinetic energy (J) =    1.02831514062509187E+14   2.19297132090383406E+14
-rms ice speed    (m/s) =        0.12005519150472969       0.13595187216180987
-average albedo         =        0.96921950449670136       0.80142868450106208
-max ice volume     (m) =        3.77905590440176198       2.86245209411921220
-max ice speed    (m/s) =        0.49255344388362082       0.34786466500096180
+total ice volume (m^3) =    1.48535756598763867E+13   2.40341818246218164E+13
+total snw volume (m^3) =    1.96452855810116943E+12   5.12233980000631152E+12
+tot kinetic energy (J) =    1.10879416990511859E+14   2.30027827620289031E+14
+rms ice speed    (m/s) =        0.12466465545409244       0.13923836296261000
+average albedo         =        0.96921968750838638       0.80142904447165109
+max ice volume     (m) =        3.77907249513972854       2.86247619373129503
+max ice speed    (m/s) =        0.48479403870651594       0.35054796852363712
 max strength    (kN/m) =      129.27453302836647708      58.25651456094256275
  ----------------------------
 arwt rain h2o kg in dt =    1.45524672839061462E+11   5.77214180149894043E+11

This is really hard for me to understand, I would expect any numerical error to accumulate slowly and start in the last decimals..

phil-blain commented 1 year ago

The above was mistakenly without I_MPI_CBWR. I ran a third suite with the variable set [suite: decomp-vp-repro-debug-impi-cbwr-3], again comparing to the first one:

$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-debug-cbwr-dynpicard 398.80 321.82 43.06 different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-debug-cbwr-dynpicard 109.13 76.86 17.08 different-data
phil-blain commented 1 year ago

I then ran 2 suites with I_MPI_FABRICS=ofi, which fixes the failures in MPI_WAITALL for some reason [suites: decomp-vp-repro-fabrics-ofi[-2]].

first suite:

FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short test
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day bfbcomp ppp6_intel_smoke_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum_run2day different-data

Second suite:

FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 compare decomp-vp-repro-ofi 3.33 1.88 0.89 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 13.26 9.05 1.76 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 17.92 11.25 3.70 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short compare decomp-vp-repro-ofi 10.86 7.92 0.86 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum compare decomp-vp-repro-ofi 12.81 8.83 1.71 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 44.34 32.05 6.93 different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 50.06 39.24 4.69 different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-ofi 131.35 93.69 20.38 different-data

ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum:

istep1:         2    idate:  20050101    sec:      7200
 (JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc

 Warning: Departure points out of bounds in remap
 my_task, i, j =          15           3           2
 dpx, dpy =  -127523.129089799       -25378.2381045416
 HTN(i,j), HTN(i+1,j) =   116473.775950696        114412.135485485
 HTE(i,j), HTE(i,j+1) =   163630.115888484        165329.256508164
 istep1, my_task, iblk =           2          15          62
 Global block:        1228
 Global i and j:          11         109

 (abort_ice)ABORTED:
 (abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 15 (rank 15 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 15
phil-blain commented 1 year ago

I then took a step back and ran the nothread_suite with my new code and without reprosum [suites: nothread-vp-repro[-2]]

I took the time to fix two bugs:

My initial fix for the first bug (https://github.com/phil-blain/CICE/commit/ef5858ece94a0d4431127182f54fc3639bb37574) was not sufficient as I still had two failures:

This lead me to complete the bugfix in 52fd683: (bx,by) were uninitialized on cells with no ice.

First suite (note: compiled at ef5858e, only ppp6_intel_smoke_gx3_24x1_bgcskl_debug_diag1_dynpicard and ppp6_intel_smoke_gx3_32x1_alt05_debug_diag1_dynpicard_short recompiled at 52fd683)

FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squareice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_roundrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectcart bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_spacecurve bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakepop bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_restart_gx3_1x1x50x58x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x116x1_diag1_dslenderX1_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_12x1x4x29x9_diag1_dspacecurve_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_6x1x50x58x1_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_8x1x19x19x5_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_20x1x5x29x20_diag1_dsectrobin_dynpicard_short bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_32x1x5x10x12_diag1_drakeX2_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard_maskhalo bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x29x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data

Second suite:

$ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squareice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_roundrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectcart bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_spacecurve bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakepop bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_restart_gx3_1x1x50x58x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x116x1_diag1_dslenderX1_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_12x1x4x29x9_diag1_dspacecurve_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_6x1x50x58x1_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_8x1x19x19x5_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_20x1x5x29x20_diag1_dsectrobin_dynpicard_short bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_32x1x5x10x12_diag1_drakeX2_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard_maskhalo bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x29x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
phil-blain commented 1 year ago

Then I ran again 2 nothread suites (bgen/bcmp), but with reprosum [suites:  nothread-vp-repro-reprosum-[12]].

All passed (bfbcomp, restart, compares).

phil-blain commented 1 year ago

Next, I ran the decomp suite with reprosum [suite: decomp-vp-repro-reprosum-1]

 $ sgrep FAIL results.log
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum run
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data

Both run failures are "bad departure points"

I recompiled the first one with -init=snan,arrays and this uncovered (by accident!) a bug in ice_grid (l_readCenter is not initialized unless we go through popgrid_nc). This lead to TLAT being NaN in gridbox_corners. I'll fix that and retry.

EDIT PR for that bugfix is here: https://github.com/CICE-Consortium/CICE/pull/758

phil-blain commented 1 year ago

With that bug fixed (a4cf10e) I ran a second suite (bcmp) [suite: decomp-vp-repro-reprosum-2]

FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 compare decomp-vp-repro-reprosum 3.35 1.97 0.86 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 20.32 13.45 3.79 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum compare decomp-vp-repro-reprosum 103.40 77.58 13.17 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 16.51 10.52 1.37 different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 12.86 8.47 1.10 different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
phil-blain commented 1 year ago

I checked grep -r -L "min/max TLAT:" */*/logs/cice.runlog* to find all tests in all suites for which the code did not go through Tlonlat, because l_readCenter happened to be initialized to .true. This did not reveal anything interesting as most runs were not in debug mode.

I checked grep r -l "bad departure points" */*/logs/cice.runlog* to get a feel of which test case still experience "bad departure points":

decomp-vp-repro-fabrics-ofi-2/ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum.220809-085602/logs/cice.runlog.220809-125951
decomp-vp-repro-fabrics-ofi-2/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.220809-085602/logs/cice.runlog.220809-125952
decomp-vp-repro-fabrics-ofi/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.220809-082813/logs/cice.runlog.220809-123209
decomp-vp-repro-init-fxy/ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum.20220726-1/logs/cice.runlog.220726-151404
decomp-vp-repro-init-fxy/ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum.20220726-1/logs/cice.runlog.220726-174327
decomp-vp-repro-reprosum/ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum.220810-133944/logs/cice.runlog.220810-174337
decomp-vp-repro-reprosum/ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum.220810-133944/logs/cice.runlog.220810-174336
decomp-vp-repro-rerun-1/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.20220726-2/logs/cice.runlog.220726-180226
decomp-vp-repro-rerun-1/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.20220726-2/logs/cice.runlog.220728-164400

it seems it is these 4 tests:

A few remarks:

phil-blain commented 1 year ago

So I ran the decomp suite without reprosum [suite: decomp-vp-repro-no-reprosum].

phil-blain commented 1 year ago

OK. let's get to the bottom of the "bad departure points" error.

I cooked myself up a stress test suite by creating one set_nml option per output field (f_* = 'd') and then creating a suite that runs a smoke_gx3_16x2x3x3x100 test 144 times, once per output field option (this guarantees separate test directory names). Since histfreq is not changed (from the default 'm') the additional option should play no role whatsoever as it's just an output field and anyway we run for less than one month.

I used the smoke test instead of restart just to simplify things (it usually fails in the first run of the restart test) and I added run10day so it runs for the same length as the restart test.

OK so this points to some weird OpenMP stuff in the new code.

phil-blain commented 1 year ago

So I scrutinized my commits, and found the error: 693fd29

diff --git a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index d90a2a8..87c87ec 100644
--- a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -878,6 +878,8 @@ subroutine anderson_solver (icellt  , icellu , &
                             vrel     (:,:,iblk))

             ! Compute nonlinear residual norm (PDE residual)
+            Fx = c0
+            Fy = c0
             call matvec (nx_block             , ny_block           , &
                          icellu       (iblk)  , icellt       (iblk), &
                          indxui     (:,iblk)  , indxuj     (:,iblk), &

the problem is that we are inside an OpenMP loop here, but we initialize the whole F[xy] arrays. This is similar to undefined behaviour, I think. What was happening when I had "bad departure points" was that the nlres_norm computed afterwards:

             call residual_vec (nx_block           , ny_block          , &
                                icellu       (iblk),                     &
                                indxui     (:,iblk), indxuj    (:,iblk), &
                                bx       (:,:,iblk), by      (:,:,iblk), &
                                Au       (:,:,iblk), Av      (:,:,iblk), &
                                Fx       (:,:,iblk), Fy      (:,:,iblk))
          enddo
          !$OMP END PARALLEL DO
          nlres_norm = sqrt(global_sum_prod(Fx(:,:,:), Fx(:,:,:), distrb_info, field_loc_NEcorner) + &
                            global_sum_prod(Fy(:,:,:), Fy(:,:,:), distrb_info, field_loc_NEcorner))
          if (my_task == master_task .and. monitor_nonlin) then
             write(nu_diag, '(a,i4,a,d26.16)') "monitor_nonlin: iter_nonlin= ", it_nl, &
                                               " nonlin_res_L2norm= ", nlres_norm
          endif

was identically zero. I checked that by running my stress test suite with monitor_nonlin = .true.. So somehow the F[xy] arrays ended up being all zeros, even if the "last" thread to go through the code should have at least written correctly to its section of the arrays (weird!). And then we would exit the nonlinear iterations too early:

          ! Compute relative tolerance at first iteration
          if (it_nl == 0) then
             tol_nl = reltol_nonlin*nlres_norm
          endif

          ! Check for nonlinear convergence
          if (nlres_norm < tol_nl) then
             exit

In the failing runs, the abort where after the solver exited after only 1 nonlinear iteration, so I guess the solution was not "solved" enough and that lead to the "bad departure points" error.

phil-blain commented 1 year ago

Fixed in be571c5

phil-blain commented 1 year ago

With this fix, the decomp_suite PASSes completely [decomp-vp-repro-reprosum-init-fxy-fix]! I ran it twice, (bgen/bcmp) and all compare tests also PASS [decomp-vp-repro-reprosum-[34]] !

So it seems I got to the bottom of everything.

EDIT decomp-vp-repro-reprosum-[34] were in fact ran with evp, not vp. I'll redo them.

OK, new suites decomp-vp-repro-reprosum-cbwr-[12], all PASS.

apcraig commented 1 year ago

Excellent @phil-blain, looks like this was a real challenging bug to sort out!

phil-blain commented 1 year ago

Thanks! yeah OpenMP is tricky! It definitely did not help that the failures would disappear when compiling in debug mode!

phil-blain commented 1 year ago

A little recap with new suites (these are all with dynpicard and the new code):

phil-blain commented 1 year ago

And here are similar tests with EVP, cmplog such that bfbcomp tests check log files instead of restarts, and this change such that baseline compares also check against log files even if the restart are bit4bit:

diff --git a/./configuration/scripts/tests/baseline.script b/./configuration/scripts/tests/baseline.script
index bb8f50a..82a770b 100644
--- a/./configuration/scripts/tests/baseline.script
+++ b/./configuration/scripts/tests/baseline.script
@@ -65,7 +65,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
     ${ICE_CASEDIR}/casescripts/comparebfb.csh ${base_dir} ${test_dir}
     set bfbstatus = $status

-    if ( ${bfbstatus} != 0 ) then
+    #if ( ${bfbstatus} != 0 ) then

       set test_file = `ls -1t ${ICE_RUNDIR}/cice.runlog* | head -1`
       set base_file = `ls -1t ${ICE_BASELINE}/${ICE_BASECOM}/${ICE_TESTNAME}/cice.runlog* | head -1`
@@ -97,7 +97,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
         endif
       endif

-    endif
+    #endif

   endif

So all in all the same behaviour as the Picard solver with respect to global sums.

phil-blain commented 1 year ago

I ran an EVP decomp suite pair with I_MPI_CBWR=0, bfbflag=off, on robert, on which nodes are exclusive. [decomp-evp-no-cbwr-rstlog-robert-[12].

The 3 complog tests that were failing on ppp6 PASS, which pretty much confirms my theory above about the non-reproducibility being due to job placement on different procs when nodes are shared.

phil-blain commented 1 year ago

I reran the decomp_suite with VP, and bfbflag=ddpdd, bfbflag=lsum16. Both PASS. This is nice because lsum16 is probably a lot less of a performance-hit than the other two :)