Open phil-blain opened 3 years ago
$ ./results.csh |\grep MISS
MISS daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data
MISS daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data
MISS daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day missing-data
these in fact are cases were the model segfaulted/crashed, so it is the data for the case itself that is missing, not the one that we are comparing against.
EDIT 2022/05: reported in https://github.com/CICE-Consortium/CICE/issues/608 and subsequently fixed.
core file reveals diag[xy]
contain NaNs (due to initialization probably) at cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3517
. From a quick look it seems the OpenMP directive is missing i,j,ij in the private variables.
core file reveals same as above (need to switch thread in the core with thread $num
, see https://sourceware.org/gdb/onlinedocs/gdb/Threads.html. The right thread to use is the one stopped in ../sysdeps/unix/sysv/linux/x86_64/sigaction.c:62
)
core file reveals same as above.
$ ./results.csh |\grep FAIL | \grep ' run'
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard run
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day run -1 -1 -1
missing an input file, see https://github.com/CICE-Consortium/CICE/pull/602#issuecomment-860818756
"ERROR: bad departure points"
NaNs in gridbox_corners... probably not related to VP solver...
#4 0x0000000000de0fbc in ice_grid::gridbox_corners () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:2219
#5 0x0000000000d76302 in ice_grid::init_grid2 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:570
#6 0x0000000000401b83 in cice_initmod::cice_init () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_InitMod.F90:121
#7 0x00000000004019c7 in cice_initmod::cice_initialize () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_InitMod.F90:52
#8 0x0000000000401671 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE.F90:43
#9 0x00000000004015f2 in main ()
#10 0x00000000021d930f in __libc_start_main (main=..., argc=..., argv=..., init=..., fini=..., rtld_fini=..., stack_end=...) at ../csu/libc-start.c:308
#11 0x00000000004014da in _start () at ../sysdeps/x86_64/start.S:120
(gdb) f 4
#4 0x0000000000de0fbc in ice_grid::gridbox_corners () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/infrastructure/ice_grid.F90:2219
2219 work_g2(:,:) = lont_bounds(icorner,:,:,iblk) + c360
array lont_bounds(icorner,:,:,iblk)
is all NaN.
EDIT the above was already reported in https://github.com/CICE-Consortium/CICE/issues/599#issue-865459873 ("Problems in ice_grid.F90"), on the gbox128 grid.
The other 3 cases are mentioned above in the MISS section.
$ ./results.csh |\grep FAIL | \grep ' test'
FAIL daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard test
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard test
FAIL daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short test
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard test
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard test
FAIL daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard test
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day test
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day test
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day test
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day test
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day test
fails to restart exactly
fails to restart exactly
fails to restart exactly
fails to restart exactly
The other cases also fail to run (see above).
Others failing tests are because they are not BFB.
I fixed the buggy OpenMP directive:
diff --git i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index 457a73a..367d29e 100644
--- i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -3507,7 +3507,7 @@ subroutine precondition(zetaD , &
wx = vx
wy = vy
elseif (precond_type == 'diag') then ! Jacobi preconditioner (diagonal)
- !$OMP PARALLEL DO PRIVATE(iblk)
+ !$OMP PARALLEL DO PRIVATE(iblk, ij, i, j)
do iblk = 1, nblocks
do ij = 1, icellu(iblk)
i = indxui(ij, iblk)
So let's do another round (~/cice-dirs/suites/vp-decomp-openmp-fix
)
run segfaulted.
#8 0x00000000009ca4e4 in ice_dyn_vp::calc_l2norm_squared (nx_block=..., ny_block=..., icellu=..., indxui=..., indxuj=..., tpu=..., tpv=..., l2norm=...)
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2479
#9 0x0000000000a29749 in ice_dyn_vp::L_ice_dyn_vp_mp_pgmres__3290__par_loop4_2_56 ()
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3292
#10 0x0000000001f68883 in __kmp_invoke_microtask ()
#11 0x0000000001f12b2a in __kmp_invoke_task_func ()
#12 0x0000000001f143d6 in __kmp_fork_call ()
#13 0x0000000001edfb25 in __kmpc_fork_call ()
#14 0x0000000000a09894 in ice_dyn_vp::pgmres (zetad=..., cb=..., vrel=..., umassdti=..., bx=..., by=..., diagx=..., diagy=..., tolerance=..., maxinner=..., maxouter=...,
solx=..., soly=..., nbiter=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3290
#15 0x0000000000a35680 in ice_dyn_vp::precondition (zetad=..., cb=..., vrel=..., umassdti=..., vx=..., vy=..., diagx=..., diagy=..., precond_type=..., wx=..., wy=...,
.tmp.PRECOND_TYPE.len_V$69f7=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:3528
#16 0x00000000009da5f8 in ice_dyn_vp::fgmres (zetad=..., cb=..., vrel=..., umassdti=..., halo_info_mask=..., bx=..., by=..., diagx=..., diagy=..., tolerance=..., maxinner=...,
maxouter=..., solx=..., soly=..., nbiter=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2858
#17 0x000000000094ab6e in ice_dyn_vp::anderson_solver (icellt=..., icellu=..., indxti=..., indxtj=..., indxui=..., indxuj=..., aiu=..., ntot=..., waterx=..., watery=...,
bxfix=..., byfix=..., umassdti=..., sol=..., fpresx=..., fpresy=..., zetad=..., cb=..., halo_info_mask=...)
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:932
#18 0x0000000000914ba7 in ice_dyn_vp::implicit_solver (dt=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:483
#19 0x00000000014c41d7 in ice_step_mod::step_dyn_horiz (dt=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/general/ice_step_mod.F90:886
#20 0x000000000040cd41 in cice_runmod::ice_step () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_RunMod.F90:284
#21 0x000000000040b849 in cice_runmod::cice_run () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE_RunMod.F90:83
#22 0x0000000000400cd0 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/drivers/standalone/cice/CICE.F90:49
#23 0x0000000000400c32 in main ()
#24 0x00000000020f7edf in __libc_start_main (main=..., argc=..., argv=..., init=..., fini=..., rtld_fini=..., stack_end=...) at ../csu/libc-start.c:308
#25 0x0000000000400b1a in _start () at ../sysdeps/x86_64/start.S:120
(gdb) f 8
#8 0x00000000009ca4e4 in ice_dyn_vp::calc_l2norm_squared (nx_block=..., ny_block=..., icellu=..., indxui=..., indxuj=..., tpu=..., tpv=..., l2norm=...)
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice2/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:2479
à cause d'un out-of-range dans le calcul de la norme...:
(gdb) p tpu(i,j)
$3 = -1.4002696118736502e+97
(gdb) p tpv(i,j)
$4 = -2.4873702247927307e+218
(gdb) p tpv(i,j)**2
Cannot perform exponentiation: Numerical result out of range
(gdb) p tpu(i,j)**2
$5 = 1.9607549859367831e+194
same as above
./results.csh |\grep FAIL | \grep ' run'
FAIL daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard run
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard run
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day run -1 -1 -1
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day run -1 -1 -1
same as above (missing file)
(horizontal_remap)ERROR: bad departure points
The next two are same as in first suite above. The last two are those that segfaulted (see just above)
Note: re-running the same suite a second time leads to different results:
This really suggests that there is some non-reproducibility in the code...
Just a quick update. I'm playing with the OpenMP in the entire code and tested evp, eap, and vp. I can also confirm that running different thread counts with vp produces different answers. If "re-running the same suite a second time leads to different results", that suggests the code is not bit-for-bit reproducible when rerun? I tried to test that and for my quick tests, the same run does seem to be reproducible. It's a little too bad because that's an easier problem to debug. I also tested a 32x1x16x16x16 and 64x1x16x16x16 case and they are not bit-for-bit. Same decomp, no OpenMP, just different block distribution. If I get a chance, I will try to look into this more. At this point, I will probably defer further OpenMP optimization with vp. I think there are several tasks to do
Hi Tony, thanks for these details and tests. This isssue is definitely still on my list, I hope I'll have to time to go back to the VP solver this Winter/early Spring.
I'll take a look at the PR when you submit it.
I'm finally going back to this. I've re-ran the decomp_suite
with -s dynpicard
on our 2 machines, testing the latest master main. I've ran the suite twice on each machine, and get the same results on the same machine, and across machines, modulo -init=snans,arrays
which is inactive on daley but active on banting, and modulo the exact failure mode. So at least that's that.
Summary:
$ ./results.csh |tail -5
203 measured results of 203 total results
157 of 203 tests PASSED
0 of 203 tests PENDING
0 of 203 tests MISSING data
46 of 203 tests FAILED
$ ./results.csh |\grep FAIL|\grep ' run'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day run -1 -1 -1
on banting I also get banting_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2day
but this is due to https://github.com/CICE-Consortium/CICE/issues/599#issue-865459873 ("Problems in ice_grid.F90")
fails at the first run of the test with SIGILL or SIGSEGV (varies). Usually no core is produced (got a core once but more or less useful since this case is not compiled in debug mode, when I recompiled in debug mode I did not get a core...),
forrtl: severe (168): Program Exception - illegal instruction
Image PC Routine Line Source
cice 0000000001130484 Unknown Unknown Unknown
cice 00000000009C4700 Unknown Unknown Unknown
Unknown 00002AAAAFFF5F8B Unknown Unknown Unknown
[NID 00724] 2022-05-12 16:13:50 Apid 16619909: initiated application termination
fails with SIGSEGV, got a core on one machine but not the other.
Finished writing ./history/iceh_ic.2005-01-01-03600.nc
*** stack smashing detected ***: <unknown> terminated
forrtl: error (65): floating invalid
Image PC Routine Line Source
cice 0000000002650764 Unknown Unknown Unknown
cice 0000000001EE45C0 Unknown Unknown Unknown
cice 0000000001ADBA36 ice_transport_rem 3473 ice_transport_remap.F90
cice 00000000019BF6B6 ice_transport_rem 755 ice_transport_remap.F90
cice 00000000023BF723 Unknown Unknown Unknown
cice 00000000023699CA Unknown Unknown Unknown
cice 000000000236B276 Unknown Unknown Unknown
cice 00000000023369C5 Unknown Unknown Unknown
cice 00000000019A4C79 ice_transport_rem 642 ice_transport_remap.F90
cice 000000000192A728 ice_transport_dri 553 ice_transport_driver.F90
cice 00000000018B5E0C ice_step_mod_mp_s 959 ice_step_mod.F90
cice 000000000040ED7F cice_runmod_mp_ic 285 CICE_RunMod.F90
cice 000000000040D7C1 cice_runmod_mp_ci 85 CICE_RunMod.F90
cice 0000000000401690 MAIN__ 49 CICE.F90
cice 00000000004015F2 Unknown Unknown Unknown
cice 000000000273228F Unknown Unknown Unknown
cice 00000000004014DA Unknown Unknown Unknown
_pmiu_daemon(SIGCHLD): [NID 01334] [c6-0c2s13n2] [Thu May 12 16:31:38 2022] PE RANK 0 exit signal Aborted
"stack smashing detected" that I've never seen. Here is the backtrace:
(gdb) bt
#0 0x0000000001a05d33 in ice_transport_remap::locate_triangles (nx_block=52, ny_block=60, ilo=2, ihi=51, jlo=2, jhi=59, nghost=1, edge=..., icells=..., indxi=..., indxj=..., dpx=..., dpy=..., dxu=...,
dyu=..., xp=..., yp=..., iflux=..., jflux=..., triarea=..., l_fixed_area=.FALSE., edgearea=..., .tmp.EDGE.len_V$2530=80)
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:1891
#1 0x00000000019bb5d7 in horizontal_remap::L_ice_transport_remap_mp_horizontal_remap__642__par_loop2_2_6 ()
at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:715
#2 0x00000000023bf723 in __kmp_invoke_microtask ()
#3 0x00000000023699ca in __kmp_invoke_task_func ()
#4 0x000000000236b276 in __kmp_fork_call ()
#5 0x00000000023369c5 in __kmpc_fork_call ()
#6 0x00000000019a4c79 in ice_transport_remap::horizontal_remap (dt=3600, ntrace=26, uvel=..., vvel=..., mm=..., tm=..., l_fixed_area=.FALSE., tracer_type=..., depend=..., has_dependents=...,
integral_order=3, l_dp_midpt=.TRUE., grid_ice=..., uvele=..., vveln=..., .tmp.GRID_ICE.len_V$6dc=256) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_remap.F90:642
#7 0x000000000192a728 in ice_transport_driver::transport_remap (dt=3600) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/dynamics/ice_transport_driver.F90:553
#8 0x00000000018b5e0c in ice_step_mod::step_dyn_horiz (dt=3600) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/cicedynB/general/ice_step_mod.F90:959
#9 0x000000000040ed7f in cice_runmod::ice_step () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
#10 0x000000000040d7c1 in cice_runmod::cice_run () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
#11 0x0000000000401690 in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice3/cicecore/drivers/standalone/cice/CICE.F90:49
$ ./results.csh |\grep FAIL|\grep ' test'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard test
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day test
which are the same as above (if "run" fails, "test" fails)
first the decomp
test itself:
$ ./results.csh |\grep FAIL | \grep daley_intel_decomp
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_slenderX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_roundrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectcart bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_spacecurve bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_rakeX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
and then the rest:
$ ./results.csh |\grep FAIL | \grep different-data
FAIL daley_intel_restart_gx3_1x1x50x58x4_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x1x25x116x1_dslenderX1_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_6x2x4x29x18_dspacecurve_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_5x2x33x23x4_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x5x10x20_drakeX2_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard_maskhalo bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x1x120x125x1_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x1x1x800_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x2x2x200_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x8x8x80_dspiralcenter_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_10x1x10x29x4_dsquarepop_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x1x25x29x4_drakeX2_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_smoke_gx3_1x1x25x58x8_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x1x5x116x1_debug_dslenderX1_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_6x2x4x29x18_debug_dspacecurve_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x10x12x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_5x2x33x23x4_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_4x2x19x19x10_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_run2day_short bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x5x10x20_debug_drakeX2_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x1x120x125x1_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x1x1x800_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x2x2x200_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x3x3x100_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x8x8x80_debug_dspiralcenter_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x1x25x29x4_debug_drakeX2_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
I tried to run it in DDT after recompiling in debug mode. It's not obvious where it fails as it's inside an OpenMP loop, but I'm not sure of how to use DDT with OpenMP... I'm experimenting.
I tried to recompile with the DDT memory debugging library. That was fun. See end of this post.
Running with the memory debugging library, however, hides the segfault and the code runs correctly ...
On XC-50, all executables use static linking, so it's not possible for DDT to preload its memory debugging library; you have to relink your executable with DDT's memory debugging library. Instructions can be found in the "Static Linking" section of the Arm Forge user guide (section 12.4.1). The exact linking flags to use vary if your program is multithreaded and if it uses C++.
For a non-multithreaded, non C++ program : -Wl,-allow-multiple-definition,-undefined=malloc /path/to/ddt/lib/64/libdmalloc.a
For multithreaded (OpenMP), non C++ program : -Wl,-wrap=dlopen, -wrap=dlclose,-allow-multiple-definition,-undefined=malloc /path/to/ddt/lib/64/libdmallocth.a
For a non-multithreaded C++ program : -Wl,-allow-multiple-definition,-undefined=malloc,-undefined=_ZdaPv /path/to/ddt/lib/64/libdmallocxx.a
For a multithreaded C++ program : -Wl,-wrap=dlopen,-wrap=dlclose,-allow-multiple-definition,-undefined=malloc,-undefined=_ZdaPv /path/to/ddt/lib/64/libdmallocthcxx.a
Note: Adding --wrap=dlopen,--wrap=dlclose
for threaded programs makes the link fail at least with the Intel Fortran compiler.
This is normal, as ARM explains in the "Compiler notes and known issues" for the Intel Compilers section:
If you are compiling static binaries, linking on a Cray XT/XE machine in the Arm DDT memory debugging library is not straightforward for F90 applications. You must manually rerun the last
ld
command (as seen withifort -v
) to include-L{ddt-path}/lib/64 -ldmalloc
in two locations:
Include immediately prior to where
-lc
is located.Include the
-zmuldefs
option at the start of the ld line.
This is not easy to understand as the wording is weird. A few notes:
We use the Cray wrapper ftn
to compile and link, not the Intel compiler driver ifort
. Replace ftn
with ftn -v
at link time to see the full ld
invocation.
They mention including -L{ddt-path}/lib/64 -ldmalloc
, but:
In the "Static Linking" section mentionned above, they recommended to not use -L
along with -l
and instead link directly to the static library using its full path.
In the "Static Linking" section mentionned above, they specify different libraries to use depending on multithreaded-ness and usage of C++, whereas here they only mention libdmalloc
.
The two bullets are supposed to be two locations where -L{ddt-path}/lib/64 -ldmalloc
should be included, but the second bullet is not a location, it is a separate recommendation to also add -zmuldefs
as the first argument to ld
In the "Static Linking" section mentionned above, they recommended to add, amongst other flags, -allow-multiple-definition
, which is the same exact thing as -zmuldefs
.
The location of this argument in the ld
invocation does not seem to matter, so we do not need to add it a second time.
So in the end it's not clear at all where is the other location!
In practice, -lc
appears twice in the ld
invocation, and it suffices to add /opt/forge/20.0.1/lib/64/libdmallocth.a
(or one of the other 3 libraries) immediately before the first -lc
for the link to succeed.
OK, I tested the failing tests above without dynpicard
and it failed in the same way. This was due to OMP_STACKSIZE
being unset and the newly active OpenMP directive in ice_transport_remap.F90
since d1e972a7 (Update OMP (#680), 2022-02-18).
Uncommenting the variable in the machine file (comment added in 8c23df8f (- Update version and copyright. (#691), 2022-02-23) makes both test pass.
OK so with the little code modifications mentioned in https://github.com/phil-blain/CICE/issues/40#issuecomment-1175467783, which I will push tomorrow, the decomp_suite
passes [1] with dynpicard
when also adding these settings:
precond = diag # or ident, not yet tested
bfbflag = reprosum # maybe works with dpddd and lsum16, not yet tested.
This is very encouraging as it shows not only that the OpenMP implementation is OK, but also that we did not "miss" anything MPI-related (like halo updates, etc) in the VP implementation.
EDIT forgot the end note:
[1] I do have one failure, ppp6_intel_restart_gx3_16x2x1x1x800_droundrobin_dynpicard
but this test is a known failure even with EVP on our new machines (cf. https://gitlab.science.gc.ca/hpc/hpcr_upgrade_2/issues/244 [internal]). This is most likely a bug in Intel MPI.
OK, unsurprisingly it also passes with ident
.
However, I have 3 failures with precond=pgmres
:
$ ./results.csh | \grep FAIL
FAIL ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day test
FAIL ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_short run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_short test
FAIL ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread run -1 -1 -1
FAIL ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread test
Finished writing ./history/iceh_ic.2005-01-01-03600.nc
forrtl: severe (408): fort: (3): Subscript #1 of the array I8_ARR_TLSUM_LEVEL has value -42107522 which is less than the lower bound of 0
Abort(594434) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffd8720efc0, rbuf=0x7ffd8720f1e0, count=-42107517, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -42107517
Abort(403247618) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffe05282140, rbuf=0x7ffe05282360, count=-42107517, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -42107517
No core.
Running in DDT reveals the failure is here, line 963:
ilevel
is a large negative value (so probably uninitialized)...
EDIT reading the code, it seems impossible for ilevel
to be uninitialized. So it seems the error is somewhere else, and it manages to corrupt things here.
EDIT2
The MPI_ALLREDUCE
calls on procs 4,5 aborts because count is negative. This count is veclth
and calculated here:
https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L904
max_levels(nflds)
is negative. It is computed here in ice_reprosum_calc
:
https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L659-L680
just before calling ice_reprosum_int
.
digits(0_i8)
is 63 (checked with a simple program), arr_gmax_exp(1)
is 2147483647, arr_gmin_exp(1)
is -41, so I think this:
https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L667-L669
overflows... I'm not sure though if it's normal for arr_gmax_exp(1)
to be that big...
It ultimately comes from here: https://github.com/phil-blain/CICE/blob/bce31c2f85da934f0778f5adf938800d1977521a/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90#L589-L594
MINEXPONENT(1._r8)
is -1021... to be continued...
OK, so it is arr_exp = exponent(arr(isum,ifld))
which gives 2147483647
(highest value possible for a 32-bit integer) when given -nan
as input.
Note that I can't print arr(isum,ifld)
from inside the loop (inside the OpenMP region), I had to go up the stack for DDT and GDB to be able to print it, or else I would get "no such vector element".
similaire à ci-dessus:
Finished writing ./history/iceh_ic.2005-01-01-03600.nc
forrtl: severe (408): fort: (3): Subscript #1 of the array I8_ARR_TLSUM_LEVEL has value -41297762 which is less than the lower bound of 0
forrtl: error (76): Abort trap signal
Abort(269029890) on node 18 (rank 18 in comm 0): Fatal error in PMPI_Allreduce: Invalid count, error stack:
PMPI_Allreduce(433): MPI_Allreduce(sbuf=0x7ffe38904cc0, rbuf=0x7ffe38904ee0, count=-41297756, datatype=dtype=0x4c000831, op=MPI_SUM, comm=comm=0x84000004) failed
PMPI_Allreduce(375): Negative count, value is -41297756
Core is truncated.
(JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
[compute:33089:0:33089] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid: 33089) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x0000000000f16ec9 ice_global_reductions_mp_global_sum_prod_dbl_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90:895
2 0x0000000000b3d728 ice_dyn_vp_mp_anderson_solver_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:903
3 0x0000000000b0b11c ice_dyn_vp_mp_implicit_solver_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:475
4 0x00000000018744ff ice_step_mod_mp_step_dyn_horiz_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/general/ice_step_mod.F90:950
5 0x000000000041b610 cice_runmod_mp_ice_step_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
6 0x000000000041a055 cice_runmod_mp_cice_run_() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
7 0x000000000040e040 MAIN__() /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:49
8 0x000000000040dfa2 main() ???:0
9 0x00000000000237b3 __libc_start_main() ???:0
10 0x000000000040deae _start() ???:0
=================================
forrtl: error (75): floating point exception
Image PC Routine Line Source
cice 0000000001E9129B Unknown Unknown Unknown
libpthread-2.28.s 00001501CE9A3B20 Unknown Unknown Unknown
cice 0000000000F16EC9 ice_global_reduct 895 ice_global_reductions.F90
cice 0000000000B3D728 ice_dyn_vp_mp_and 903 ice_dyn_vp.F90
cice 0000000000B0B11C ice_dyn_vp_mp_imp 475 ice_dyn_vp.F90
cice 00000000018744FF ice_step_mod_mp_s 950 ice_step_mod.F90
cice 000000000041B610 cice_runmod_mp_ic 285 CICE_RunMod.F90
cice 000000000041A055 cice_runmod_mp_ci 85 CICE_RunMod.F90
cice 000000000040E040 MAIN__ 49 CICE.F90
cice 000000000040DFA2 Unknown Unknown Unknown
libc-2.28.so 00001501CE1D07B3 __libc_start_main Unknown Unknown
cice 000000000040DEAE Unknown Unknown Unknown
Only a single core (!), it is fortunately usable:
(gdb) bt
#0 0x00001501ce1e47ff in raise () from /lib64/libc.so.6
#1 0x00001501ce1cecfe in abort () from /lib64/libc.so.6
#2 0x0000000001e8b690 in for.issue_diagnostic ()
#3 0x0000000001e9129b in for.signal_handler ()
#4 <signal handler called>
#5 0x0000000000f16ec9 in ice_global_reductions::global_sum_prod_dbl (array1=..., array2=..., dist=..., field_loc=2, mmask=<error reading variable: Location address is not set.>,
lmask=<error reading variable: Location address is not set.>) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/comm/mpi/ice_global_reductions.F90:895
#6 0x0000000000b3d728 in ice_dyn_vp::anderson_solver (icellt=..., icellu=..., indxti=..., indxtj=..., indxui=..., indxuj=..., aiu=..., ntot=436, uocn=..., vocn=..., waterxu=..., wateryu=..., bxfix=...,
byfix=..., umassdti=..., sol=..., fpresx=..., fpresy=..., zetax2=..., etax2=..., rep_prs=..., cb=..., halo_info_mask=...)
at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:903
#7 0x0000000000b0b11c in ice_dyn_vp::implicit_solver (dt=3600) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/dynamics/ice_dyn_vp.F90:475
#8 0x00000000018744ff in ice_step_mod::step_dyn_horiz (dt=3600) at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/general/ice_step_mod.F90:950
#9 0x000000000041b610 in cice_runmod::ice_step () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:285
#10 0x000000000041a055 in cice_runmod::cice_run () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:85
#11 0x000000000040e040 in icemodel () at /fs/homeu2/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:49
(gdb) p i
$1 = 8
(gdb) p j
$2 = 12
(gdb) p iblock
$3 = 2
(gdb) p array1(i,j,iblock)
$4 = nan(0x7baddadbaddad)
(gdb) p array2(i,j,iblock)
$5 = nan(0x7baddadbaddad)
This is when computing the norm of the residual vector (Fx,Fy) just after it's been computed, so it's a bit mysterious... (well not too much since the global sum is over all points whereas before we were summing only ice points (using icellu, indxui, indxuj
)...
EDIT what is mysterious is that I should get this failure also with precond=diag
and ident
...
So I reran with diag
and got the same 3 failures. So I'm not sure what I did when I wrote https://github.com/phil-blain/CICE/issues/39#issuecomment-1175553653, but I was mixed up...
EDIT I did rebase though...
$ git logo origin/repro-vp..upstream/main
21bd95b cice.setup: remove 'suite.jobs' at start of 'suite.submit' (#731) Philippe Blain (Fri Jul 15 10:43) (upstream/main, upstream/HEAD)
1585c31 Add unit test for optional arguments, "optargs" (#730) Tony Craig (Fri Jul 15 07:43)
d088bfb Update some CICE variable names to clarify grid (#729) Tony Craig (Fri Jul 15 07:42)
471c010 add Cgrid-related fixes for nuopc/cmeps (#728) Denise Worthen (Thu Jun 23 14:47)
EDIT2 I see the same failure when running with diag
on the version of my branch before the rebase. So that's not it.
OK so in the end all 3 failures are at the same place, where we compute the norm of (Fx,Fy). These are the only variables given to global_sum_prod
that were not initialized to zero beforehand in the code.
It's still weird that I would get the failure only in certain decompositions though...
EDIT not that weird, since it was using uninitialized values, so anything can happen...
If I initialize (Fx,Fy) to 0, it fixes those errors, but then I ran the suite again from scratch and got some new failures (MPI aborts, bad departure points, etc.)
EDIT here are the failures (suite: decomp-vp-repro-init-fxy
):
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_slenderX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_roundrobin bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_5x2x33x23x4_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum test
ppp6_intel_restart_gbox180_16x1x6x6x60_debugblocks_dspacecurve_dynpicard_reprosum
MPI abort at time step 121:
Restart read/written 120 20050106
Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code
I re-ran the test 3 other times and they all passed.
ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum
bad departure point at time step 48:
Warning: Departure points out of bounds in remap
my_task, i, j = 13 2 3
dpx, dpy = -97023.9228738988 214611.660265582
HTN(i,j), HTN(i+1,j) = 213803.742672313 214293.709816433
HTE(i,j), HTE(i,j+1) = 144794.922414856 143905.700041556
istep1, my_task, iblk = 58 13 58
Global block: 1155
Global i and j: 97 101
I re-ran the test 4 other times,
got a second time "bad departure point", but at time step 2:
istep1: 2 idate: 20050101 sec: 7200
(JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Warning: Departure points out of bounds in remap
my_task, i, j = 15 3 2
dpx, dpy = -127523.129089799 -25378.2381045416
HTN(i,j), HTN(i+1,j) = 116473.775950696 114412.135485485
HTE(i,j), HTE(i,j+1) = 163630.115888484 165329.256508164
istep1, my_task, iblk = 2 15 62
Global block: 1228
Global i and j: 11 109
(abort_ice)ABORTED:
(abort_ice) error = (horizontal_remap)ERROR: bad departure points
I then ran a new decomp suite twice (one to generate baseline, one to compare with the baseline) and got some differences between both. So it points to some non-reproducibility even on the same number of procs...
EDIT I did that (that=run decomp
suite twice, once with bgen and one with bcmp) first with my new code, suites: decomp-vp-repro-rerun-[12]
.
In decomp-vp-repro-rerun-1
(bgen) I get:
$ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_slenderX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
So:
Note: I did also get "bad departure points" twice in
ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum
, on the first run, time steps 57 and 107. Then I reran the test and both runs succeeded, although the restart test failed as indicated above.In decomp-vp-repro-rerun-2
(bcmp) I get:
So:
restart_gx3_16x2x1x1x800
which is known to fail as mentioned in the end note at https://github.com/phil-blain/CICE/issues/39#issuecomment-1175553653)This prompted me to run 2 decomp
suites (bgen/bcmp) on main
, i.e.without my changes but with dynpicard
, and I still got non-reproducible results, suites: decomp-at-21bd95b-[23]
].
In addition, in decomp-at-21bd95b-3
, I also got an MPI abort at time step 121 for case ppp5_intel_restart_gbox180_16x1x6x6x60_debugblocks_diag1_dspacecurve_dynpicard
:
istep1: 121 idate: 20050106 sec: 3600
Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code
Note that this is the same test as above in decomp-vp-repro-init-fxy
.
When I re-submitted the test, both runs (it's a restart test) ran correctly to completion....
I get the same results (bgen then bcmp from same commit on main
, with dynpicard
giving different results) with the nothread_suite
, which does not have any OpenMP test cases (modulo those with OpenMP compiling flag active but only one thread actually used for the case). This is for the code from main
, i.e. even before my changes. EDIT suites: nothread-at-21bd95b-[34]
.
I then ran the same suite with a (self-compiled) OpenMPI instead of Intel MPI and it seems that I do not get any of these errors (still on main
and with dynpicard
). [EDIT suites: nothread-at-21bd95b-ompi-tm[-2]
] I'll repeat these tests, but it does point to something with Intel MPI...
OK so I dug a bit into this and found this Intel MPI variable : I_MPI_CBWR=1
.
This disables "topology-aware collectives" (i,e. MPI_ALLREDUCE
and friends) and makes sure that re-running the same code on the same number of procs on the same machine leads to reproducible results. See:
Apparently OpenMPI does that out-of-the-box, and it seems Cray MPT also at least under the circumstances under which we were running on daley/banting (exclusive nodes). I did find some references to MPICH_ALLREDUCE_NO_SMP
and MPICH_REDUCE_NO_SMP
at https://gitlab.science.gc.ca/hpc_migrations/hpcr_upgrade_1/wikis/getting-started/compiler_notes#xc50-daleybanting [internal] and a very few places on the web and on GitHub, this is (despite the name!) specific to Cray MPT. It is apparently documented in man mpi
or man intro_mpi
on the CLE.
I ran 2 nothread
suites with intel + intel MPI + I_MPI_CBWR=1
at essentially 21bd95b (main
with https://github.com/CICE-Consortium/CICE/pull/745 on top) and both runs passed the baseline compare for all tests [EDIT suites: nothread-at-21bd95b-impi-cbwr[-2]
]. So it seems the variable indeed works.
Thanks @phil-blain, that's some rough debugging. Yuck.
Do we understand why the dynpicard is particularly susceptible? Why don't we see this with some other configurations?
Because dynpicard uses global sums ( MPI_ALLREDUCE
) in its algorithm, whereas the rest of the code only uses them for diagnostics. And the base_suite
only does cmprest
(bit4bit comparisons of restarts) and not cmplog
(b4b comparison of logs) and I only ever ran the base_suite on our previous Cray machines.
I'll walk my steps backwards from here, I think I got to the bottom of it now.
OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct? Just to clarify, are you just seeing different results with different pe counts/decompositions? Is the global reduction in dynpicard using the internal CICE global sum method yet?
OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?
Not for Intel MPI, no, unless I set this I_MPI_CBWR=1
variable. With this environment variable, results are reproducible on the same pe count + decomp.
Just to clarify, are you just seeing different results with different pe counts/decompositions?
Yes, with the code in main
, running the decomp
suite with dynpicard
, all bfbcomp
tests fail.
Is the global reduction in dynpicard using the internal CICE global sum method yet?
Not with the code on main
, no. And that's why the bfbcomp
tests fail. I have updated the code to correctly use the CICE global sum implementation, see https://github.com/phil-blain/CICE/issues/40#issuecomment-1175467783. Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression for dynpicard
, so I'd like to understand that a bit more also before I open my PR.
OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?
Not for Intel MPI, no, unless I set this
I_MPI_CBWR=1
variable. With this environment variable, results are reproducible on the same pe count + decomp.
Interesting and surprising! What machine is that? In my experience, this is a requirement of MPI in most installations and I've never seen non-reproducibility for POP-based runs, and I check it a lot (in CESM/RASM/etc). POP has a lot of global sums, so it's a good test. I assume this is just a setting on this one particular machine?
Just to clarify, are you just seeing different results with different pe counts/decompositions?
Yes, with the code in
main
, running thedecomp
suite withdynpicard
, allbfbcomp
tests fail.
That's what I'd expect. I think the bfbcomp testing has benefited from the fact that there were no global sums (or similar) in CICE up to now.
Is the global reduction in dynpicard using the internal CICE global sum method yet?
Not with the code on
main
, no. And that's why thebfbcomp
tests fail. I have updated the code to correctly use the CICE global sum implementation, see #40 (comment). Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression fordynpicard
, so I'd like to understand that a bit more also before I open my PR.
Let me know if I can help. I think it's perfectly fine to do some "bfbcomp" testing with slower global sums for the dynpicard in particular, but to use the fastest global sums in production and other testing. The separate issue is whether the CICE global sum implementation is slower than it should be. Thanks.
OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct?
Not for Intel MPI, no, unless I set this
I_MPI_CBWR=1
variable. With this environment variable, results are reproducible on the same pe count + decomp.Interesting and surprising! What machine is that? In my experience, this is a requirement of MPI in most installations and I've never seen non-reproducibility for POP-based runs, and I check it a lot (in CESM/RASM/etc). POP has a lot of global sums, so it's a good test. I assume this is just a setting on this one particular machine?
It's one of our new Lenovo clusters (see https://www.hpcwire.com/off-the-wire/canadian-weather-forecasts-to-run-on-nvidia-powered-system/). I was also surprised, but if you follow the links to stackoverflow/stackexchange which I posted above, it is clearly indicated in the MPI standard that it is only a recommendation that repeated runs yield the same results for collective reductions. Apparently OpenMPI follows that recommendation, but Intel MPI has to be convinced with that variable. It's an environment variable for Intel MPI, so no it's not specific to that machine.
WIth Intel MPI, the non reproducibility is (as far as I understand) linked to the pinning of MPI processes to specific CPUs. So if from run to run the ranks are pinnned to different CPUs, then the reductions might give different results because the reduction algorithm take advantage of the processor topology. If you always run on machines with exclusive node access, then it's possible that the pinning is always the same so you do not notice the difference. That was the case on our previous Crays.
Just to clarify, are you just seeing different results with different pe counts/decompositions?
Yes, with the code in
main
, running thedecomp
suite withdynpicard
, allbfbcomp
tests fail.That's what I'd expect. I think the bfbcomp testing has benefited from the fact that there were no global sums (or similar) in CICE up to now.
Indeed.
Is the global reduction in dynpicard using the internal CICE global sum method yet?
Not with the code on
main
, no. And that's why thebfbcomp
tests fail. I have updated the code to correctly use the CICE global sum implementation, see #40 (comment). Once I'm sure I get no failures with this code, I'll make a PR. It also leads to a serious performance regression fordynpicard
, so I'd like to understand that a bit more also before I open my PR.Let me know if I can help. I think it's perfectly fine to do some "bfbcomp" testing with slower global sums for the dynpicard in particular, but to use the fastest global sums in production and other testing. The separate issue is whether the CICE global sum implementation is slower than it should be. Thanks.
Yes, that's my plan. But I noticed that even with bfbflag
off, the new code is stiill slower (see https://github.com/phil-blain/CICE/issues/40#issuecomment-1188260610 and later comments). I'll get back to this soon and I'll let you know if / how I could use help. Maybe I'll open a "draft" PR with my changes and we can discuss there the performance implications. Thanks!
OK, retracing my steps back. I ran 2 decomp
suites (bgen/bcmp) with dynpicard
and I_MPI_CBWR=1
, on main
(still technically 21bd95b with https://github.com/CICE-Consortium/CICE/pull/745 on top) [suites: decomp-at-21bd95b-vp-impi-cbwr-[12]
.
ppp6_intel_restart_gx3_16x2x1x1x800_droundrobin_diag1_dynpicard
(same as usual).Next step: back to my new code. I ran a decomp
suite with dynpicard,reprosum
and I_MPI_CBWR=1
[suite: decomp-vp-repro-impi-cbwr-1
].
$ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_5x2x33x23x4_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
This is a bit unfortunate, especially the restart failures. To me this hints to a bug in the code.
I next ran the same thing, but adding -s debug
[suite: decomp-vp-repro-debug-impi-cbwr-1
]:
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day bfbcomp ppp6_intel_smoke_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum_run2day different-data
the gbox180 case is the same as mentioned a few times above (fails in MPI_WAITALL
just after the restart is read in the second run of the test).
the failing bfbcomp test was resubmitted a few times and it passes about half the time...
and I next re-ran a debug suite, with -init=snan,arrays
added in the Macros file [suite: decomp-vp-repro-debug-init-snan
]:
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum_short run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum_short test
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum_short run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum_short test
nothing unexpected here; all bfbcomp tests and all restart tests passed this time.
I reran a second identical suite, baseline comparing with the previous one [suite: decomp-vp-repro-debug-impi-cbwr-2
]:
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-debug-cbwr-dynpicard 398.80 321.82 43.06 different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-debug-cbwr-dynpicard 109.13 76.86 17.08 different-data
ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum
fails restart, baseline compare and bfbcomp ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day
fails bfbcompdifferences start at the second time step, and they do not start at the last decimal at all:
diff --git 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
index 5924f69..d515b0d 100644
--- 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305
+++ 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
@@ -922,47 +922,47 @@ heat used (W/m^2) = 2.70247926206599542 21.66078047047012589
istep1: 2 idate: 20050101 sec: 7200
(JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Arctic Antarctic
-total ice area (km^2) = 1.55991254493588358E+07 1.56018697755621299E+07
+total ice area (km^2) = 1.55989861815027408E+07 1.56018575740428381E+07
total ice extent(km^2) = 1.57251572666864432E+07 1.93395172319125347E+07
-total ice volume (m^3) = 1.48535756598763418E+13 2.40341818246218164E+13
-total snw volume (m^3) = 1.96453741997257983E+12 5.12234084165053809E+12
-tot kinetic energy (J) = 1.02831514062509187E+14 2.19297132090383406E+14
-rms ice speed (m/s) = 0.12005519150472969 0.13595187216180987
-average albedo = 0.96921950449670136 0.80142868450106208
-max ice volume (m) = 3.77905590440176198 2.86245209411921220
-max ice speed (m/s) = 0.49255344388362082 0.34786466500096180
+total ice volume (m^3) = 1.48535756598763867E+13 2.40341818246218164E+13
+total snw volume (m^3) = 1.96452855810116943E+12 5.12233980000631152E+12
+tot kinetic energy (J) = 1.10879416990511859E+14 2.30027827620289031E+14
+rms ice speed (m/s) = 0.12466465545409244 0.13923836296261000
+average albedo = 0.96921968750838638 0.80142904447165109
+max ice volume (m) = 3.77907249513972854 2.86247619373129503
+max ice speed (m/s) = 0.48479403870651594 0.35054796852363712
max strength (kN/m) = 129.27453302836647708 58.25651456094256275
----------------------------
arwt rain h2o kg in dt = 1.45524672839061462E+11 5.77214180149894043E+11
This is really hard for me to understand, I would expect any numerical error to accumulate slowly and start in the last decimals..
The above was mistakenly without I_MPI_CBWR
. I ran a third suite with the variable set [suite: decomp-vp-repro-debug-impi-cbwr-3
], again comparing to the first one:
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum run
FAIL ppp6_intel_restart_gbox180_16x1x6x6x60_debug_debugblocks_diag1_dspacecurve_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-debug-cbwr-dynpicard 398.80 321.82 43.06 different-data
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x1x1x800_debug_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-debug-cbwr-dynpicard different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-debug-cbwr-dynpicard 109.13 76.86 17.08 different-data
I then ran 2 suites with I_MPI_FABRICS=ofi
, which fixes the failures in MPI_WAITALL
for some reason [suites: decomp-vp-repro-fabrics-ofi[-2]
].
first suite:
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short test
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day bfbcomp ppp6_intel_smoke_gx3_4x2x25x29x4_debug_diag1_dslenderX2_dynpicard_reprosum_run2day different-data
one "bad departure point" in ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum
:
istep1: 58 idate: 20050103 sec: 36000
Warning: Departure points out of bounds in remap
my_task, i, j = 13 2 3
dpx, dpy = -96491.9732979196 213739.021158396
HTN(i,j), HTN(i+1,j) = 213803.742672313 214293.709816433
HTE(i,j), HTE(i,j+1) = 144794.922414856 143905.700041556
istep1, my_task, iblk = 58 13 58
Global block: 1155
Global i and j: 97 101
(abort_ice)ABORTED:
(abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 13 (rank 13 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 13
Second suite:
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 compare decomp-vp-repro-ofi 3.33 1.88 0.89 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 13.26 9.05 1.76 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 17.92 11.25 3.70 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_20x2x5x4x30_diag1_dsectrobin_dynpicard_reprosum_short compare decomp-vp-repro-ofi 10.86 7.92 0.86 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum compare decomp-vp-repro-ofi 12.81 8.83 1.71 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_1x4x25x29x16_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 44.34 32.05 6.93 different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_restart_gx3_1x8x30x20x32_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-ofi 50.06 39.24 4.69 different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day complog decomp-vp-repro-ofi different-data
FAIL ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day compare decomp-vp-repro-ofi 131.35 93.69 20.38 different-data
two "bad departure points":
ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum
:
istep1: 57 idate: 20050103 sec: 32400
Warning: Departure points out of bounds in remap
my_task, i, j = 12 2 2
dpx, dpy = -288562.564463527 126956.600618419
HTN(i,j), HTN(i+1,j) = 232320.512129915 232740.351201724
HTE(i,j), HTE(i,j+1) = 146934.097683249 145384.079407393
istep1, my_task, iblk = 57 12 120
Global block: 2499
Global i and j: 97 99
(abort_ice)ABORTED:
(abort_ice) error = (horizontal_remap)ERROR: bad departure points
ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum
:
istep1: 2 idate: 20050101 sec: 7200
(JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Warning: Departure points out of bounds in remap
my_task, i, j = 15 3 2
dpx, dpy = -127523.129089799 -25378.2381045416
HTN(i,j), HTN(i+1,j) = 116473.775950696 114412.135485485
HTE(i,j), HTE(i,j+1) = 163630.115888484 165329.256508164
istep1, my_task, iblk = 2 15 62
Global block: 1228
Global i and j: 11 109
(abort_ice)ABORTED:
(abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 15 (rank 15 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 15
I then took a step back and ran the nothread_suite
with my new code and without reprosum
[suites: nothread-vp-repro[-2]
]
I took the time to fix two bugs:
ice_ic='none'
) in
ppp6_intel_smoke_gx3_16x1_bgcz_debug_diag1_dynpicard
and ppp6_intel_smoke_gx3_24x1_bgcskl_debug_diag1_dynpicard
ppp6_intel_restart_gbox128_16x1_boxnodyn_short_diag1_dynpicard
and ppp6_intel_restart_gbox128_24x1_boxnodyn_debug_short_diag1_dynpicard
My initial fix for the first bug (https://github.com/phil-blain/CICE/commit/ef5858ece94a0d4431127182f54fc3639bb37574) was not sufficient as I still had two failures:
ppp6_intel_smoke_gx3_24x1_bgcskl_debug_diag1_dynpicard
(was still failing with the fix) ppp6_intel_smoke_gx3_32x1_alt05_debug_diag1_dynpicard_short
(which has ice_ic='internal'
)This lead me to complete the bugfix in 52fd683: (bx,by)
were uninitialized on cells with no ice.
First suite (note: compiled at ef5858e, only ppp6_intel_smoke_gx3_24x1_bgcskl_debug_diag1_dynpicard
and ppp6_intel_smoke_gx3_32x1_alt05_debug_diag1_dynpicard_short
recompiled at 52fd683)
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squareice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_roundrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectcart bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_spacecurve bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakepop bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_restart_gx3_1x1x50x58x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x116x1_diag1_dslenderX1_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_12x1x4x29x9_diag1_dspacecurve_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_6x1x50x58x1_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_8x1x19x19x5_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_20x1x5x29x20_diag1_dsectrobin_dynpicard_short bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_32x1x5x10x12_diag1_drakeX2_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard_maskhalo bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x29x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
reprosum
.Second suite:
$ sgrep FAIL results.log
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squareice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_slenderX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_roundrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectcart bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_sectrobin bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_spacecurve bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX2 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeX1 bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakepop bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_rakeice bfbcomp ppp6_intel_decomp_gx3_8x1x5x29x20_diag1_dynpicard_squarepop
FAIL ppp6_intel_restart_gx3_1x1x50x58x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x116x1_diag1_dslenderX1_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_12x1x4x29x9_diag1_dspacecurve_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_6x1x50x58x1_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_8x1x19x19x5_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_20x1x5x29x20_diag1_dsectrobin_dynpicard_short bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_32x1x5x10x12_diag1_drakeX2_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_16x1x8x10x10_diag1_droundrobin_dynpicard_maskhalo bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
FAIL ppp6_intel_restart_gx3_4x1x25x29x4_diag1_droundrobin_dynpicard bfbcomp ppp6_intel_restart_gx3_8x1x25x29x2_diag1_dslenderX2_dynpicard different-data
Then I ran again 2 nothread
suites (bgen/bcmp), but with reprosum
[suites: nothread-vp-repro-reprosum-[12]
].
All passed (bfbcomp, restart, compares).
Next, I ran the decomp
suite with reprosum
[suite: decomp-vp-repro-reprosum-1
]
$ sgrep FAIL results.log
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum run
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum run
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
Both run failures are "bad departure points"
I recompiled the first one with -init=snan,arrays
and this uncovered (by accident!) a bug in ice_grid
(l_readCenter
is not initialized unless we go through popgrid_nc
). This lead to TLAT
being NaN in gridbox_corners
. I'll fix that and retry.
EDIT PR for that bugfix is here: https://github.com/CICE-Consortium/CICE/pull/758
With that bug fixed (a4cf10e) I ran a second suite (bcmp) [suite: decomp-vp-repro-reprosum-2
]
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 bfbcomp ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_squarepop
FAIL ppp6_intel_decomp_gx3_4x2x25x29x5_diag1_dynpicard_reprosum_rakeX1 compare decomp-vp-repro-reprosum 3.35 1.97 0.86 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 20.32 13.45 3.79 different-data
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum compare decomp-vp-repro-reprosum 103.40 77.58 13.17 different-data
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 16.51 10.52 1.37 different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum complog decomp-vp-repro-reprosum different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum compare decomp-vp-repro-reprosum 12.86 8.47 1.10 different-data
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum bfbcomp ppp6_intel_restart_gx3_4x2x25x29x4_diag1_dslenderX2_dynpicard_reprosum different-data
I checked grep -r -L "min/max TLAT:" */*/logs/cice.runlog*
to find all tests in all suites for which the code did not go through Tlonlat
, because l_readCenter
happened to be initialized to .true
.
This did not reveal anything interesting as most runs were not in debug mode.
I checked grep r -l "bad departure points" */*/logs/cice.runlog*
to get a feel of which test case still experience "bad departure points":
decomp-vp-repro-fabrics-ofi-2/ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum.220809-085602/logs/cice.runlog.220809-125951
decomp-vp-repro-fabrics-ofi-2/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.220809-085602/logs/cice.runlog.220809-125952
decomp-vp-repro-fabrics-ofi/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.220809-082813/logs/cice.runlog.220809-123209
decomp-vp-repro-init-fxy/ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum.20220726-1/logs/cice.runlog.220726-151404
decomp-vp-repro-init-fxy/ppp6_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard_reprosum.20220726-1/logs/cice.runlog.220726-174327
decomp-vp-repro-reprosum/ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum.220810-133944/logs/cice.runlog.220810-174337
decomp-vp-repro-reprosum/ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum.220810-133944/logs/cice.runlog.220810-174336
decomp-vp-repro-rerun-1/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.20220726-2/logs/cice.runlog.220726-180226
decomp-vp-repro-rerun-1/ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum.20220726-2/logs/cice.runlog.220728-164400
it seems it is these 4 tests:
ppp6_intel_restart_gx3_16x2x2x2x200_diag1_droundrobin_dynpicard_reprosum
ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard_reprosum
ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo_reprosum
ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_reprosum
A few remarks:
reprosum
droundrobin
So I ran the decomp
suite without reprosum [suite: decomp-vp-repro-no-reprosum
].
$ sgrep FAIL results.log | sgrep ' test'
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard test
FAIL ppp6_intel_restart_gx3_5x2x33x23x4_diag1_droundrobin_dynpicard test
FAIL ppp6_intel_restart_gx3_4x2x19x19x10_diag1_droundrobin_dynpicard test
FAIL ppp6_intel_restart_gx3_8x2x8x10x20_diag1_droundrobin_dynpicard_maskhalo test
FAIL ppp6_intel_restart_gx3_16x2x3x3x100_diag1_droundrobin_dynpicard test
OK. let's get to the bottom of the "bad departure points" error.
I cooked myself up a stress test suite by creating one set_nml
option per output field (f_* = 'd'
) and then creating a suite that runs a smoke_gx3_16x2x3x3x100
test 144 times, once per output field option (this guarantees separate test directory names). Since histfreq
is not changed (from the default 'm'
) the additional option should play no role whatsoever as it's just an output field and anyway we run for less than one month.
I used the smoke
test instead of restart
just to simplify things (it usually fails in the first run of the restart test) and I added run10day
so it runs for the same length as the restart test.
dynpicard
(but not reprosum
), on the new code (stress-bad-departure-points1
)
stress-bad-departure-points2
)
dynpicard
on the new code, compiling in debug mode and I_MPI_LIBRARY_KIND=debug
(stress-bad-departure-points-debug
)
dynpicard
on the old code (stress-bad-departure-points-old
)
dynpicard
on the new code, but with 16x1x3x3x100
instead (stress-bad-departure-points-nothr
)
OK so this points to some weird OpenMP stuff in the new code.
So I scrutinized my commits, and found the error: 693fd29
diff --git a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index d90a2a8..87c87ec 100644
--- a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -878,6 +878,8 @@ subroutine anderson_solver (icellt , icellu , &
vrel (:,:,iblk))
! Compute nonlinear residual norm (PDE residual)
+ Fx = c0
+ Fy = c0
call matvec (nx_block , ny_block , &
icellu (iblk) , icellt (iblk), &
indxui (:,iblk) , indxuj (:,iblk), &
the problem is that we are inside an OpenMP loop here, but we initialize the whole F[xy]
arrays. This is similar to undefined behaviour, I think. What was happening when I had "bad departure points" was that the nlres_norm
computed afterwards:
call residual_vec (nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj (:,iblk), &
bx (:,:,iblk), by (:,:,iblk), &
Au (:,:,iblk), Av (:,:,iblk), &
Fx (:,:,iblk), Fy (:,:,iblk))
enddo
!$OMP END PARALLEL DO
nlres_norm = sqrt(global_sum_prod(Fx(:,:,:), Fx(:,:,:), distrb_info, field_loc_NEcorner) + &
global_sum_prod(Fy(:,:,:), Fy(:,:,:), distrb_info, field_loc_NEcorner))
if (my_task == master_task .and. monitor_nonlin) then
write(nu_diag, '(a,i4,a,d26.16)') "monitor_nonlin: iter_nonlin= ", it_nl, &
" nonlin_res_L2norm= ", nlres_norm
endif
was identically zero. I checked that by running my stress test suite with monitor_nonlin = .true.
. So somehow the F[xy]
arrays ended up being all zeros, even if the "last" thread to go through the code should have at least written correctly to its section of the arrays (weird!). And then we would exit the nonlinear iterations too early:
! Compute relative tolerance at first iteration
if (it_nl == 0) then
tol_nl = reltol_nonlin*nlres_norm
endif
! Check for nonlinear convergence
if (nlres_norm < tol_nl) then
exit
In the failing runs, the abort where after the solver exited after only 1 nonlinear iteration, so I guess the solution was not "solved" enough and that lead to the "bad departure points" error.
Fixed in be571c5
With this fix, the decomp_suite
PASSes completely [decomp-vp-repro-reprosum-init-fxy-fix
]! I ran it twice, (bgen/bcmp) and all compare tests also PASS [decomp-vp-repro-reprosum-[34]
] !
So it seems I got to the bottom of everything.
EDIT decomp-vp-repro-reprosum-[34]
were in fact ran with evp, not vp. I'll redo them.
OK, new suites decomp-vp-repro-reprosum-cbwr-[12]
, all PASS.
Excellent @phil-blain, looks like this was a real challenging bug to sort out!
Thanks! yeah OpenMP is tricky! It definitely did not help that the failures would disappear when compiling in debug mode!
A little recap with new suites (these are all with dynpicard
and the new code):
I_MPI_CBWR=1
, bfbflag=off
-> bfbcomp FAIL, compare PASS (decomp-vp-repro-no-reprosum-cbwr-[12]
)I_MPI_CBWR=0
, bfbflag=off
-> bfbcomp FAIL, compare FAIL (decomp-vp-repro-impi-rel-[12]
, nothread-vp-repro-impi-rel-[12]
)I_MPI_CBWR=1
, bfbflag=reprosum
-> bfbcomp PASS, compare PASS (decomp-vp-repro-reprosum-cbwr-[12]
)I_MPI_CBWR=0
, bfbflag=reprosum
-> bfbcomp PASS, compare PASS (nothread-vp-repro-reprosum-impi-rel-[12]
, decomp-vp-repro-reprosum-impi-rel[-2]
, nothread-vp-reprosum-no-cbwr-no-fabrics-[12]
, nothread-vp-reprosum-no-cbwr[-2]
, decomp-vp-repro-reprosum-no-icbwr[-[23]
, the last 6 being with I_MPI_LIBRARY_KIND=debug
)And here are similar tests with EVP, cmplog
such that bfbcomp
tests check log files instead of restarts, and this change such that baseline compares also check against log files even if the restart are bit4bit:
diff --git a/./configuration/scripts/tests/baseline.script b/./configuration/scripts/tests/baseline.script
index bb8f50a..82a770b 100644
--- a/./configuration/scripts/tests/baseline.script
+++ b/./configuration/scripts/tests/baseline.script
@@ -65,7 +65,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
${ICE_CASEDIR}/casescripts/comparebfb.csh ${base_dir} ${test_dir}
set bfbstatus = $status
- if ( ${bfbstatus} != 0 ) then
+ #if ( ${bfbstatus} != 0 ) then
set test_file = `ls -1t ${ICE_RUNDIR}/cice.runlog* | head -1`
set base_file = `ls -1t ${ICE_BASELINE}/${ICE_BASECOM}/${ICE_TESTNAME}/cice.runlog* | head -1`
@@ -97,7 +97,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
endif
endif
- endif
+ #endif
endif
I_MPI_CBWR=1
, bfbflag=off
-> bfbcomp FAIL, compare PASS (decomp-evp-cbwr-cmplog
, decomp-evp-cbwr-rstlog-[12]
)I_MPI_CBWR=0
, bfbflag=off
-> bfbcomp FAIL, compare (complog
) FAIL (3 failures) (decomp-evp-no-cbwr-cmplog
, decomp-evp-no-cbwr-rstlog-[12]
)I_MPI_CBWR=1
, bfbflag=reprosum
-> bfbcomp PASS, compare PASS (decomp-evp-cbwr-reprosum-cmplog
, decomp-evp-cbwr-reprosum-rstlog-[12]
)I_MPI_CBWR=0
, bfbflag=reprosum
-> bfbcomp PASS, compare PASSdecomp-evp-no-cbwr-reprosum-cmplog
, decomp-evp-no-cbwr-reprosum-rstlog-[12]
)So all in all the same behaviour as the Picard solver with respect to global sums.
I ran an EVP decomp suite pair with I_MPI_CBWR=0
, bfbflag=off
, on robert, on which nodes are exclusive. [decomp-evp-no-cbwr-rstlog-robert-[12]
.
The 3 complog
tests that were failing on ppp6 PASS, which pretty much confirms my theory above about the non-reproducibility being due to job placement on different procs when nodes are shared.
I reran the decomp_suite with VP, and bfbflag=ddpdd
, bfbflag=lsum16
. Both PASS. This is nice because lsum16
is probably a lot less of a performance-hit than the other two :)
Running the
decomp_suite
with the VP dynamics results in some segfaults (due to NaN initialisation), some errors ("bad departure points") and some non BFB restarts, see https://github.com/CICE-Consortium/CICE/issues/518.I'll use this issue to document my findings in investigating those.