ufs-community / ufs-weather-model

UFS Weather Model
Other
140 stars 247 forks source link

Decomposition test failure #1103

Closed MinsukJi-NOAA closed 2 years ago

MinsukJi-NOAA commented 2 years ago

Description

cpld_decomp_p8 fails with a different domain decomposition

To Reproduce:

What compilers/machines are you seeing this with? Intel/Hera Give explicit steps to reproduce the behavior.

  1. Check out the latest ufs weather model (9b6b740)
  2. cd ufs-weather-model/tests; ./rt.sh -n cpld_decomp_p8. This test will PASS
  3. Modify ufs-weather-model/tests/tests/cpld_decomp_p8:

    diff --git a/tests/tests/cpld_decomp_p8 b/tests/tests/cpld_decomp_p8
    index cbf1b68f..bbbde45b 100644
    --- a/tests/tests/cpld_decomp_p8
    +++ b/tests/tests/cpld_decomp_p8
    @@ -66,8 +66,10 @@ export RESTART_INTERVAL="${RESTART_N} -1"
    
    export TASKS=$TASKS_cpl_dcmp
    export TPN=$TPN_cpl_dcmp
    -export INPES=$INPES_cpl_dcmp
    -export JNPES=$JNPES_cpl_dcmp
    +#export INPES=$INPES_cpl_dcmp
    +#export JNPES=$JNPES_cpl_dcmp
    +export INPES=8
    +export JNPES=3
    export THRD=$THRD_cpl_dcmp
    export WRTTASK_PER_GROUP=$WPG_cpl_dcmp
  4. Repeat step 2. This test will FAIL.

For a comparison,

  1. Check out the previous commit 38aa634 of the ufs weather model
  2. Repeat steps 2, 3, and 4 above. Both tests will PASS.
junwang-noaa commented 2 years ago

@MinsukJi-NOAA Is this issue showing up in previous revisions?

MinsukJi-NOAA commented 2 years ago

@MinsukJi-NOAA Is this issue showing up in previous revisions?

The previous version 38aa634 does not have this issue.

junwang-noaa commented 2 years ago

Since the physics suite is changed and there is no dycore update in 9b6b740, it might be related to physics.

@yangfanglin @JessicaMeixner-NOAA @DeniseWorthen FYI.

yangfanglin commented 2 years ago

Is there any uncoupled atmos decomposition test ? Is it working ?

JessicaMeixner-NOAA commented 2 years ago

Yes, there is an uncoupled decomposition test. I'm running it with he 8,3 combination now.

JessicaMeixner-NOAA commented 2 years ago

The standalone atm decomposition test w/the 8,3 combination passed.

JessicaMeixner-NOAA commented 2 years ago

So 3,8 is the what the baseline is created with. 4,6 also gives the same answer, but 8,3 does not. It's my understanding that the only guarantee is for repro mode, so are we compiling in repro mode? I don't see any instructions on that here: https://github.com/ufs-community/ufs-weather-model/wiki/Building-model

I believe @DeniseWorthen's usual suggestion for debugging something is to write out and check the mediator history files, so I believe for that we follow the instructions here: https://github.com/ufs-community/ufs-weather-model/wiki/Advanced-Topics#using-the-cmeps-mediator-to-understand-the-coupling-fields-under-construction correct?

MinsukJi-NOAA commented 2 years ago

@JessicaMeixner-NOAA I believe repro mode compilation can be done with -DREPRO=ON

DeniseWorthen commented 2 years ago

UFS discussion 934 contains the instructions for writing the mediator history files. In this case, you could add the following to nems.configure:

      history_n_atm_inst = 1
      history_option_atm_inst = nsteps
      history_n_ice_inst = 1
      history_option_ice_inst = nsteps
      history_n_ocn_inst = 1
      history_option_ocn_inst = nsteps
      history_tile_atm = 96

This will write the ATM mediator history as a single file containing all 6 tiles on every pass through the coupling loop. ICE and OCN will get their own history files.

JessicaMeixner-NOAA commented 2 years ago

Thanks @MinsukJi-NOAA and @DeniseWorthen

junwang-noaa commented 2 years ago

@JessicaMeixner-NOAA To clarify, 1) the cpld_control_p8 and cpld_decomp_p8 RT have decomposition 4x6 and 3x8, 2)ORT test runs cpld_control_p8 with 8x3, both run in PROD mode for some time. It was working in previous PRs, the 8x3 setting stopped working since PR#1071.

DeniseWorthen commented 2 years ago

@JessicaMeixner-NOAA If you can point me to a run directory containing cpld_control_p8, I can look start with the mediator history files.

JessicaMeixner-NOAA commented 2 years ago

I just started a run with the extra outputs, they will be here: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_13823

DeniseWorthen commented 2 years ago

I copied your run director and made a sandbox. I used it to create 3 run directories

/scratch1/NCEPDEV/stmp2/Denise.Worthen/decomp96/decomp38 /scratch1/NCEPDEV/stmp2/Denise.Worthen/decomp96/decomp83 /scratch1/NCEPDEV/stmp2/Denise.Worthen/decomp96/decomp46

where each varies only in the input.nml layout variable. I set fhmax=2 in model configure.

The decomp38 and decomp46 directories are b4b after 2 hours. The decomp38 and decomp83 directories differ on the 2nd coupling step in the coupling fields sent by ATM (rain, snow, shum, tbot, height). The differing values are randomly scattered on each tile---they are not associated w/ land fraction for example, which I've seen before. A diff file at the second timestep is in the run directory (atm.diff.23040.nc).

60: RMS atmImp_Faxa_rain                 4.2134E-19            NORMALIZED  1.1768E-14
76: RMS atmImp_Faxa_snow                 1.0935E-20            NORMALIZED  2.7293E-15
148: RMS atmImp_Sa_shum                   6.0702E-19            NORMALIZED  6.7205E-17
157: RMS atmImp_Sa_tbot                   2.5005E-15            NORMALIZED  8.7086E-18
180: RMS atmImp_Sa_z                      8.8095E-17            NORMALIZED  8.4206E-18
JessicaMeixner-NOAA commented 2 years ago

Repro mode did not help: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_10951 (comparing the output between the two directories, there are many diffs starting at fhr001).

pjpegion commented 2 years ago

I did some runs saving every time-step and different configurations.
Decomposition 4x6 and 6x4 both pass the regression test, only when switching to 3x8 is when it fails. The different in meteorology 1st appears in precipitation 30 minutes into the run on tile 1, but I see differences in the 1st time-step in several tracers related to aerosols (hydrophobic black carbon mixing ratio, so2 mixing ratio, sulfate mixing ratio etc.) and this is something I notice that is different between the 2 runs: for 4x6: Aspect Ratio : min: 0.10000000000000E+01 max: 0.10654828722430E+01 avg: 0.61158324642517E+01 and 3x8: Aspect Ratio : min: 0.10000000000000E+01 max: 0.10654828722430E+01 avg: 0.61158324642518E+01

I then compiled with REPRO=ON, and running with 3x8 and 4x6 gets the same answers, but these answers are different than the regression test, which I expect since the model was optimized differently. And in the REPRO cases, the aspec ration average is the same as the 3x8 case above.

pjpegion commented 2 years ago

I also changed cplchm=F and turned off the aerosols by changing the nems.configure and field_table, and 4x6 and 3x8 give same results with the original RT executable.

yangfanglin commented 2 years ago

Phil, Thanks. This is interesting. I believe Jessica was also testing the coupled without gocart aerosols. Even though there is no feedbacks between gocart aerosols and the met, the met tracers might have been affected.

Are Raffaele's tracer fix (related to Thompson MP) included in these tests ?

pjpegion commented 2 years ago

@yangfanglin I don't know. @rmontuoro?

yangfanglin commented 2 years ago

I am making a test now on WCOSS. I checked out the latest ufs-weatehr-model and pointed to https://github.com/rmontuoro/fv3atm/tree/bugfix/thompson-tracer-index to the fv3atm. I am only testing the cpld_decomp_p8 in the rt.sh. Is this sufficient ? Anyhow, I will report back when the test is done.

yangfanglin commented 2 years ago

My RT returns "+ echo REGRESSION TEST WAS SUCCESSFUL". Does this mean the decomposition bug is also fixed with Raffaele's tracer bug fix ? Should I change the layout manually to test different configurations ?

pjpegion commented 2 years ago

@yangfanglin follow the steps at the top of this thread to try a different processor layout

yangfanglin commented 2 years ago

@pjpegion Got it. Running step 3 now.

JessicaMeixner-NOAA commented 2 years ago

@yangfanglin I have run with @rmontuoro PR changes and it solves some of the decomp issues, but there's still something going on as the results do not completely reproduce. In trying to figure out what is going on, I'm going to try to attempt to run this test w/aerosols but with the older physics options to see if it's again something pointing to an interaction of physics/aerosols that is not otherwise being seen.

yangfanglin commented 2 years ago

@JessicaMeixner-NOAA Was your coupled model test without gocart aerosol successful ?

JessicaMeixner-NOAA commented 2 years ago

I haven't tried with the very top of develop, but with the 9b6b740 commit, I get consistent results with @pjpegion that it worked. Trying to figure out why since the updated code @rmontuoro seems to resolve some of these issues with leaving aerosols on.

yangfanglin commented 2 years ago

My test with 8x3 decomposition did not reproduce using the latest ufs-weather-model develop branch and Raffaele's fv3-atm.

yangfanglin commented 2 years ago

Please see https://docs.google.com/document/d/11vo2-DyrR2LWbQoqprlVoTxhSEnpvblWPw30WqOl6uA/edit for a track of changes made to fv3-atm and ccpp repos after March 4 and before March 10 when Minsuk first reported the decomposition failure. Can we reverse the " fix 2phases intermediate restart" update and see if the decomposition RT works ?

JessicaMeixner-NOAA commented 2 years ago

@yangfanglin I will try that and report back.

JessicaMeixner-NOAA commented 2 years ago

I tried the top of develop, but reverted FV3/module_fcst_grid_comp.F90 back to the code without the " fix 2phases intermediate restart" update and the issue remains. Code: /scratch2/NCEPDEV/climate/Jessica.Meixner/p8b/ufs-revertrestart regtest output:/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_23026

junwang-noaa commented 2 years ago

@JessicaMeixner-NOAA Thanks for running the test and confirming the results. The updates of "fix 2phases intermediate restart" is to fix when to write out restart files, it's expected not to change the forecast results but it will allow restart files to be written out at correct forecast time.

JessicaMeixner-NOAA commented 2 years ago

@JessicaMeixner-NOAA Thanks for running the test and confirming the results. The updates of "fix 2phases intermediate restart" is to fix when to write out restart files, it's expected not to change the forecast results but it will allow restart files to be written out at correct forecast time.

I did not think it would change results either, but always good to confirm. I'm out of ideas of things to try at the moment.

JessicaMeixner-NOAA commented 2 years ago

I made a few tests with different physics options as of late last week:

With the PR 1118 fix, work dir and output (on orion): /work/noaa/marine/jmeixner/p8b/ufs-bugfix/ work/noaa/stmp/jmeixner/stmp/jmeixner/FV3_RT/rt_161096

Here's the develop branch output: /work/noaa/marine/jmeixner/p8b/ufs-weather-model /work/noaa/stmp/jmeixner/stmp/jmeixner/FV3_RT/rt_163701

There are "p7" directories and "p8" directories. The p7 have older physics options the "p8" are the p8b setting options. I'm seeing some diffs in the p7 with different decomps diffs in mixing ratios? Not sure what that means. @rmontuoro has looked at the set-up on the aerosol side to make sure it looked okay (which was a concern of mine when I ran the tests).

Additionally, I also just ran the PR1118 fix on orion for the control_atm_aerosols test and made a "control_atm_aerosols_decom" test that switch 3,8 to 8,3 and that also gives differences, so perhaps we can simplify the investigations to atm+aero configurations? Might be worth making at p8b settings of the atm+aero test as well? atm-aero code: /work/noaa/marine/jmeixner/p8b/ufs-bugfix02 regtest output: /work/noaa/stmp/jmeixner/stmp/jmeixner/FV3_RT/rt_431926

yangfanglin commented 2 years ago

@ChunxiZhang-NOAA @JongilHan66
Could you please help by doing the following ? 1) repeat the four steps described at the top of the gitgub issue using the latest ufs-weather-model repo 2) modify the model source code to apply PBL mixing and convective transport only to the physics tracers (excluding gocart aerosol tracers). then repeat the four steps.

ChunxiZhang-NOAA commented 2 years ago

@yangfanglin Ok, I will do the tests ASAP.

SMoorthi-emc commented 2 years ago

I made 3 runs from my branch of UFS, with 4x6, 3x8 and 8x3 combinations. the first two produced identical "atmf" and "sfcf" files at 24 hours. 8x3 combination produced identical "sfcf" files, but produced differences in "atmf" files, apparently in chemistry variables only. For example, "nccmp -dgqSfs atmf024.tile1.nc /gpfs/dell2/ptmp/Shrinivas.Moorthi/FV3_RT/rt_15731/cpld_decomp_p8/atmf024.tile1.nc Variable Group Count Sum AbsSum Min Max Range Mean StdDev bc1 / 279394 4.67817e-05 0.000107683 -7.7486e-07 5.66244e-07 1.3411e-06 1.6744e-10 5.25382e-09 bc2 / 368465 2.99039e-05 6.60763e-05 -1.93715e-07 4.76837e-07 6.70552e-07 8.1158e-11 3.03227e-09 nh3 / 311299 -0.00131405 0.0185233 -0.000593299 0.000382695 0.000975994 -4.22119e-09 2.76725e-06 nh4a / 393495 0.0012206 0.0175854 -0.000355564 0.000326842 0.000682406 3.10194e-09 1.79046e-06 no3an1 / 404284 0.00383461 0.0588822 -0.00122151 0.00112325 0.00234476 9.48495e-09 6.03496e-06 no3an2 / 425826 -1.17921e-05 0.00126032 -4.84008e-05 2.47257e-05 7.31265e-05 -2.76923e-11 1.69026e-07 no3an3 / 407264 -1.14574e-06 1.64385e-05 -7.15976e-07 2.88986e-07 1.00496e-06 -2.81325e-12 2.11327e-09 oc1 / 279487 0.00018271 0.000417272 -2.98023e-06 4.05312e-06 7.03335e-06 6.53734e-10 2.57745e-08 oc2 / 337295 2.37357e-05 0.00121744 -1.07475e-05 5.65127e-06 1.63987e-05 7.03706e-11 6.30217e-08 pm10 / 340961 0.032615 0.0730636 -0.000365973 0.000364363 0.000730336 9.56561e-08 2.05613e-06 pm25 / 342344 0.00822349 0.0313517 -0.000371099 0.000365481 0.000736579 2.40211e-08 1.95818e-06 seas1 / 393562 1.88265e-05 4.05865e-05 -6.35628e-08 4.85452e-08 1.12108e-07 4.78363e-11 8.01938e-10 seas2 / 393369 0.000590602 0.0010174 -2.38419e-07 4.76837e-07 7.15256e-07 1.50139e-09 9.44201e-09 seas3 / 403463 0.0062014 0.0115045 -0.000112057 3.98159e-05 0.000151873 1.53704e-08 2.54474e-07 seas4 / 398696 0.0210636 0.0384857 -1.12057e-05 1.90735e-05 3.02792e-05 5.28312e-08 3.79756e-07 seas5 / 359134 0.00341279 0.00741849 -1.90735e-06 6.67572e-06 8.58307e-06 9.50283e-09 9.09196e-08 so2 / 321778 2.95863e-07 5.6456e-07 -1.86265e-09 1.49012e-08 1.67638e-08 9.19463e-13 3.23066e-11 so4 / 368191 0.000122484 0.000618262 -9.53674e-07 9.53674e-07 1.90735e-06 3.32663e-10 1.06459e-08" Moorthi

rmontuoro commented 2 years ago

It turns out this issue is caused by the CMake-generated compiler flags:

-g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -real-size 64 -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -no-prec-div -no-prec-sqrt -march=core-avx2 -O2 -fPIC

The culprits are likely to be options -no-prec-div and -no-prec-sqrt.

Results obtained building UFS-Aerosols (1ff6389) with either one of the Fortran compiler settings below are reproducible across all tested decompositions (8x3, 3x8, 4x6) for the fully coupled regression tests:

These results have been independently verified by @JessicaMeixner-NOAA.

arunchawla-NOAA commented 2 years ago

Does anyone have an idea why the -no-pre-div and -no-prec-sqrt are causing this problem ?

climbfuji commented 2 years ago

These flags came from the old NEMSfv3gfs and were meant to speed up the code when using double precision (they are not used when the dycore is built with 32bit, see https://github.com/ufs-community/ufs-weather-model/blob/1bd68cab708af76e4f9479ab6300990861aa24a2/cmake/Intel.cmake#L24). As the name says, they reduce the precision for divisions and square roots. @DusanJovic-NOAA and I had thought about removing them in the past.

One would need to do timing comparisons to see what difference it makes in the runtimes for double precision builds. For single precision, it won't matter.

SMoorthi-emc commented 2 years ago

Just to be clear, my runs were made with 32 bit dynamics! Moorthi

junwang-noaa commented 2 years ago

@rmontuoro I am a little confused. So we have the following compile options in REPRO mode in ufs-weather-model,

-g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -O2 -debug minimal -fp-model consistent -qoverride-limits

My understanding is that the GOCART will take those compile options if the model is built in REPRO mode. But here are Jessica's comments on running the test with REPRO mode to build the full model: "Repro mode did not help: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_10951 (comparing the output between the two directories, there are many diffs starting at fhr001)."

Now with your second option (the last three are the ufs-weather-model PROD options), if we just build the GOCART with the following options and the rest of code still built with PROD mode, the model results now reproduce?

-g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -fPIC

arunchawla-NOAA commented 2 years ago

ignoring the -fPIC and the -qopt flags the difference between the REPRO mode flags and the flags specified by Raffaele is -fp-model consistent vs -fp-model source.

Doing a google search on that option states that it determines the semantics of floating point calculations. More details here

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/floating-point-options/fp-model-fp.html

Could this be causing the problem ? Can anyone confirm if changing -fp-model option to source is providing bit identical results ?

SMoorthi-emc commented 2 years ago

Sorry, My runs were with 64 bit dynamics although my branch uses mixed mode FMS as the runs were made with regression test. Please ignore my last comment.

DusanJovic-NOAA commented 2 years ago

Someone should look at wall clock time of the full production resolution configuration using the default options (PROD) and REPRO options. If the difference is on the order of few percent I suggest we simply remove those 'prod' options and always use REPRO, making them default and remove that option. After all in production we also need full bit-by-bit reproducible restart runs, different decomposition etc. I which scenario are we willing to sacrifice full reproducibility for one or two percent faster execution. If the difference is on the order of, I don't know 7-10%, then that's a different story.

arunchawla-NOAA commented 2 years ago

can we confirm that ORT is getting passed using the flags of the REPRO mode ? or confirm what the flags should be for REPRO mode to get bit reproducibility.

JessicaMeixner-NOAA commented 2 years ago

Should the (8,3) decomposition test also be added as a normal regression test?

arunchawla-NOAA commented 2 years ago

if it is there on the ORT test can we just run that ?

arunchawla-NOAA commented 2 years ago

@JessicaMeixner-NOAA is the ORT passing using the flags Raffaele mentioned ?

JessicaMeixner-NOAA commented 2 years ago

I have not run the ORT yet, but will try to do that now. I personally find it easier to create new regression tests over running the ORT.

arunchawla-NOAA commented 2 years ago

The ORT is a requirement to PRs so I would like confirmation that this works.