ufs-community / ufs-weather-model

UFS Weather Model
Other
133 stars 239 forks source link

GEFS EP5 does not reproduce with different number of MPI tasks #2203

Open junwang-noaa opened 3 months ago

junwang-noaa commented 3 months ago

Description

George V. found that the GEFS EP5 test case does not reproduce with different number of ATM MPI tasks when he was testing scalability of GEFS. Further investigation showed that the test reproduces with different ATM tasks for atm-only, and the S2S configurations, but does not for S2SW when both cplwav and cplwav2atm are set to .true.

To Reproduce:

Run EP5 test and change atm layout from (16,16) to (16,24) and compare the atmf or sfcf files.

Additional context

Output

junwang-noaa commented 3 months ago

Denise and I looked into the GEFS EP5 case you provided. Since the case does not reproduce only when cplwav2atm=.true., we test the case with ATMW and with code from the latest develop branch for debugging. It is found that a point (313, 113) on tile 3 has different bottom temperature after the first integration step.

893:  mype=         893  in setup_export=   258.991302490234      i,j=         313 113
 893:  mype=         893  in setup_export=   259.155151367188      i,j=         313 113
vs
 589:  mype=         589  in setup_export=   258.991302490234      i,j=         313 113
 589:  mype=         589  in setup_export=   259.150024414062      i,j=         313 113

Further test showed that the z0 from wave model has a bad value, due to the decomposition, the test with decomposition 16x16 have the z0 updated in the fv3atm, but does not in the 16x24 decomposition test.

 893:  in assign_import,n=          17 found= T
 893:  in assign_import,n=          17 datar8(isc,jsc)=  -101947800.000000
vs
 589:  in assign_import,n=          17 found= T
 589:  in assign_import,n=          17 datar8(isc,jsc)=  0.000000000000000E+000
 589:  in assign_import,n=          17 T cplwav2atm= T findex=          17
 589:   in assign zorlwav=  -999.000000000000      ix=           1  nb=          13
 589:  zorlw=  0.317000000000000      lon=   137.359066603078      lat=
 589:    55.6649708379619      tbom=   258.991302490234

The z0 value of "-101947800.000000" from restart.ww3 is not correct. @bingfu-NOAA, may I ask how the test is set up and where the restart.ww3 comes from? Also would you please provide a test from the latest develop branch? The case failed when we tried to run with the latest develop branch. Thanks.

bingfu-NOAA commented 3 months ago

@junwang-noaa can you show the location of the file and your rundir?

junwang-noaa commented 3 months ago

On hera: /scratch1/NCEPDEV/stmp2/Jun.Wang/ep5/gefscase.atmw/atmwav.rundir16x24 /scratch1/NCEPDEV/stmp2/Jun.Wang/ep5/gefscase.atmw/atmwav.rundir

I will show you the case on wcoss2 when the switch is done.

junwang-noaa commented 3 months ago

@NeilBarton-NOAA FYI.

bingfu-NOAA commented 3 months ago

@junwang-noaa @NeilBarton-NOAA @JessicaMeixner-NOAA Just an update: I can reproduce 16x16 from 16x24 ATM layout using HR3 tag and replay ICs.

junwang-noaa commented 3 months ago

That's great! So far we found that in the EP5 case you gave to us, after remove the restart.ww3, MOM6 produces different results at fh=1hr. It's not clear what caused that. @bingfu-NOAA would you please share the run directory so that we can continue check the scalability of EP5? @GeorgeVandenberghe-NOAA FYI.

bingfu-NOAA commented 3 months ago

I saved the rundir on Dogwood here: /lfs/h2/emc/gefstemp/Bing.Fu/ep5rep but some files inside the rundir should be soft link.

DeniseWorthen commented 3 months ago

Denise and I looked into the GEFS EP5 case you provided. Since the case does not reproduce only when cplwav2atm=.true., we test the case with ATMW and with code from the latest develop branch for debugging. It is found that a point (313, 113) on tile 3 has different bottom temperature after the first integration step.

893:  mype=         893  in setup_export=   258.991302490234      i,j=         313 113
 893:  mype=         893  in setup_export=   259.155151367188      i,j=         313 113
vs
 589:  mype=         589  in setup_export=   258.991302490234      i,j=         313 113
 589:  mype=         589  in setup_export=   259.150024414062      i,j=         313 113

Further test showed that the z0 from wave model has a bad value, due to the decomposition, the test with decomposition 16x16 have the z0 updated in the fv3atm, but does not in the 16x24 decomposition test.

 893:  in assign_import,n=          17 found= T
 893:  in assign_import,n=          17 datar8(isc,jsc)=  -101947800.000000
vs
 589:  in assign_import,n=          17 found= T
 589:  in assign_import,n=          17 datar8(isc,jsc)=  0.000000000000000E+000
 589:  in assign_import,n=          17 T cplwav2atm= T findex=          17
 589:   in assign zorlwav=  -999.000000000000      ix=           1  nb=          13
 589:  zorlw=  0.317000000000000      lon=   137.359066603078      lat=
 589:    55.6649708379619      tbom=   258.991302490234

The z0 value of "-101947800.000000" from restart.ww3 is not correct. @bingfu-NOAA, may I ask how the test is set up and where the restart.ww3 comes from? Also would you please provide a test from the latest develop branch? The case failed when we tried to run with the latest develop branch. Thanks.

A second issue which came up in testing is that the elementMask values in the mesh file used by the wave mode has invalid values except on the first 1440 values. This corresponds to the first j=1 row. All other values are negative large integers.