ufs-community / ufs-weather-model

UFS Weather Model
Other
137 stars 247 forks source link

Regional application bitwise reproducibility problem using different MPI layout and/or threads #196

Closed RatkoVasic-NOAA closed 3 years ago

RatkoVasic-NOAA commented 4 years ago

Description

Regional FV3 is producing different results when using different MPI layout and/or different number of threads. This application cannot pass regression tests in ufs-weather-model. Current regression tests are testing only restart and quilting capabilities, so that problem probably existed for some time. Older version checked (03/2020) is showing same behavior.

To Reproduce:

We are seeing this problem on WCOSS machines and Hera. Jim Abeles managed to get bit identical result on Orion with old code (03/2020).

To replicate problem: 1. Go to ufs-weather-model/tests/ 2. Run rt.sh -fk , using short, 2-line version of rt.conf:

COMPILE | CCPP=Y SUITES=FV3_GFS_2017_gfdlmp_regional 32BIT=Y REPRO=Y | standard | | fv3 |
RUN     | fv3_ccpp_regional_control                                  | standard | | fv3 |

NOTE -k option in rt.sh saves run directory. 3. Go to run directory, save history files and submit job again (using _jobcard), but this time change only one line in input.nml: from layout = 4,6 to layout = 6,4 4. Compare saved and new results.

arunchawla-NOAA commented 4 years ago

Is this true for multiple physics suites or the specific ones listed here?

RatkoVasic-NOAA commented 4 years ago

Is this true for multiple physics suites or the specific ones listed here?

Any physics suite (two CCPP suites tested).

yangfanglin commented 4 years ago

What is the C* resolution? Can it be divided by both 4 and 6?

Fanglin

On Wed, Sep 2, 2020 at 5:27 PM RatkoVasic-NOAA notifications@github.com wrote:

Is this true for multiple physics suites or the specific ones listed here?

Any physics suite (two CCPP suites tested).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/196#issuecomment-686027962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY5N2NTEX47VPNMW3SXGDLSD22EDANCNFSM4QTZOFSA .

-- Fanglin Yang, Ph.D. Physical Scientist Environmental Modeling Center National Centers for Environmental Prediction 301-6833722; fanglin.yang@noaa.gov http://www.emc.ncep.noaa.gov/gmb/wx24fy/fyang/ http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/

RatkoVasic-NOAA commented 4 years ago

Both C768 and C96. It's one face, but number of points is:

       npx      = 211
       npy      = 193

I'll try with different layouts. Still doesn't explain differences in threads. Maybe after changes this will fix threads!? UPDATE: Unfortunately, that didn't help, I used layout with both nx-1 and ny-1 divisible with 2 and 3:

<        layout   = 2,3
---
>        layout   = 3,2

And results still differ.

climbfuji commented 4 years ago

We do have threading tests for the global runs, and these pass on all machines every time we merge a commit. So this must be something specific to the regional application of the code. I know that there is quite some code in the dycore (GFDL_atmos_cubed_sphere) that is only executed for regional and/or nested runs.

One thing we should do to further drill down on this is to test a nested config. It would be good to know if the problem exists only for ntiles=1 or also for ntiles=7.

In the past, I fixed some obviously wrong code in the dycore for regional applications (routine exchange_uv) that didn't cause a problem on any of the NOAA RDHPC systems, but on Cheyenne (run-to-run differences with exactly the same setup). In this case it was an error in the MPI code in that routine.

To my knowledge, there is no code in the CCPP physics that depends on the number of tiles or whether it is a global, regional or nested setup. Thus it seems more likely - but no guarantee, of course - that this is a problem with the dycore or the fv3atm model (not initializing everything properly for coldstarts/restarts) than with the CCPP physics.

Do you want me to help debugging this issue, or are you going to take care of it?

climbfuji commented 4 years ago

Here is an interesting twist. not sure if it is related or has to do with the jet software stack or build config.

When I create a new baseline on jet using ecflow and then verify against it, I get b4b differences for fv3_ccpp_decomp, i.e. when changing the decomposition. That is a global run. That said, I also get b4b differences for all tests when I don't use ecflow (i.e. compile on the login node as opposed to compile on the compute node), so there might be something buggy with the jet setup in general.

RatkoVasic-NOAA commented 4 years ago

Do you want me to help debugging this issue, or are you going to take care of it?

Dom, we would really appreciate your help in solving this problem. Maybe this can help, on Hera, I created small test site: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT with one source directory and two run directories differing only in layout:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>diff  run_*/input.nml
39c39
<        layout   = 2,3
---
>        layout   = 3,2

Job cards point to the same executable:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>ll run_*/job*
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 21:56 run_1/job_card
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 22:07 run_2/job_card
climbfuji commented 4 years ago

Do you want me to help debugging this issue, or are you going to take care of it?

Dom, we would really appreciate your help in solving this problem. Maybe this can help, on Hera, I created small test site: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT with one source directory and two run directories differing only in layout:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>diff  run_*/input.nml
39c39
<        layout   = 2,3
---
>        layout   = 3,2

Job cards point to the same executable:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/wrk/REG_RT>ll run_*/job*
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 21:56 run_1/job_card
-rwxr--r-- 1 Ratko.Vasic fv3-cam 624 Sep  2 22:07 run_2/job_card

Do you know if this runs with the release/public-v2 branch (essentially develop, just before the ESMF 8.1.0 bs21 update was made)?

RatkoVasic-NOAA commented 4 years ago

This is from the git log.

commit 1e4edf0ac90d8de714becfa362c36de8758b8281 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Wed Aug 26 09:40:41 2020 -0600

    develop: cleanup, remove legacy code, minor bugfixes (#190)

BTW, we tested older code (march 2020), and we had same results.

climbfuji commented 4 years ago

This is from the git log.

commit 1e4edf0ac90d8de714becfa362c36de8758b8281 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Wed Aug 26 09:40:41 2020 -0600

    develop: cleanup, remove legacy code, minor bugfixes (#190)

BTW, we tested older code (march 2020), and we had same results.

I'll use release/public-v2 then, most relevant for this problem. We can also bring bugfixes back to develop if needed.

climbfuji commented 4 years ago

Ok, I could reproduce the problem when compiling the code as

./compile_cmake.sh $PWD/.. hera.intel 'CCPP=Y SUITES=FV3_GFS_v15_thompson_mynn' '' NO NO 2>&1 | tee compile.log

Thus this also happens with double-precision dynamics.

llpcarson commented 4 years ago

Don't know if this is related, but the HWRF team recently had reproducibility issues show up on jet, due to the heterogeneous nodes and compiler optimizations, which showed up with an Intel version update (jet/HWRF had been using an old compiler version). So, it could be a processor/node difference in the specific jet cases (tjet, ujet, sjet, etc)

climbfuji commented 4 years ago

Does anyone know what type of nodes the login nodes on jet are? Are they the same as one of the jet partitions?

climbfuji commented 4 years ago

All, this is definitely something in the dycore and it happens right in the beginning. I am stopping the model inatmos_model.F90 around line 530, right after the call to

   call atmosphere_init (Atmos%Time_init, Atmos%Time, Atmos%Time_step,&
                         Atmos%grid, Atmos%area)

This is before CCPP (or IPD) are even initialized, only a first pass through the dycore was made for initialization. At this point, the tracer array Atm(mygrid)%q is already different. Some of the other diagnostic output is also different:

0:  After adi: W max =    1.16725366723497       min =  -0.438392996489926
0:  na_ini Z500   5754.19968088326        5731.28070372451
0:   0.000000000000000E+000   5868.19062118756

versus

0:  After adi: W max =    1.16727826557961       min =  -0.438352988116583
0:  na_ini Z500   5754.19968090299        5731.28070391164
0:   0.000000000000000E+000   5868.19062037472
DusanJovic-NOAA commented 4 years ago

This commit (https://github.com/ufs-community/ufs-weather-model/commit/1150bf5a393b3b3eecefa452f8e7ee94dc1b59aa) made on Jun 5, which is the last commit before FV3 dynamic core is updated to the GFDL 201912, gives bit-identical outputs with 4x6 and 6x4 layout, when configured with do_sat_adj = .F. Tested on Hera using fv3_ccpp_regional_control test.

climbfuji commented 4 years ago

@DusanJovic-NOAA FYI Chan-Hoo found a bug in the regional code in the dycore, missing regional boundary update/exchange of the extra (fourth) row for the velocities on the C and D grid. PR to come. This solves the threading and layout b4b differences for the GFDL-MP (and presumably Zhao-Carr) physics runs, but not yet for Thompson+MYNN. Means that there is another bug, but this time in the physics.

climbfuji commented 4 years ago

The halo boundary update bugfix in the FV3 dycore went in with PR https://github.com/ufs-community/ufs-weather-model/pull/208 (https://github.com/NOAA-EMC/GFDL_atmos_cubed_sphere/pull/40).

Other issues such as the rewrite of the MPI reduce function and the bug in the dynamics-physics update step for Thompson MP still need to be addressed.

arunchawla-NOAA commented 3 years ago

@climbfuji and @RatkoVasic-NOAA was this problem solved? I thought it was. If yes can you close this ticket?

RatkoVasic-NOAA commented 3 years ago

@arunchawla-NOAA Yes. I'll close the ticket.

climbfuji commented 3 years ago

This was solved only for the SRW app public release branch. Following discussion with GFDL, this solution should not be brought over to the main development branch, see issue https://github.com/NOAA-EMC/GFDL_atmos_cubed_sphere/issues/55 for more information. You can keep this issue closed, because we do have the issue open in the correct repository.