ufs-community / ufs-weather-model

UFS Weather Model
Other
139 stars 247 forks source link

Decomposition test failure #1103

Closed MinsukJi-NOAA closed 2 years ago

MinsukJi-NOAA commented 2 years ago

Description

cpld_decomp_p8 fails with a different domain decomposition

To Reproduce:

What compilers/machines are you seeing this with? Intel/Hera Give explicit steps to reproduce the behavior.

  1. Check out the latest ufs weather model (9b6b740)
  2. cd ufs-weather-model/tests; ./rt.sh -n cpld_decomp_p8. This test will PASS
  3. Modify ufs-weather-model/tests/tests/cpld_decomp_p8:

    diff --git a/tests/tests/cpld_decomp_p8 b/tests/tests/cpld_decomp_p8
    index cbf1b68f..bbbde45b 100644
    --- a/tests/tests/cpld_decomp_p8
    +++ b/tests/tests/cpld_decomp_p8
    @@ -66,8 +66,10 @@ export RESTART_INTERVAL="${RESTART_N} -1"
    
    export TASKS=$TASKS_cpl_dcmp
    export TPN=$TPN_cpl_dcmp
    -export INPES=$INPES_cpl_dcmp
    -export JNPES=$JNPES_cpl_dcmp
    +#export INPES=$INPES_cpl_dcmp
    +#export JNPES=$JNPES_cpl_dcmp
    +export INPES=8
    +export JNPES=3
    export THRD=$THRD_cpl_dcmp
    export WRTTASK_PER_GROUP=$WPG_cpl_dcmp
  4. Repeat step 2. This test will FAIL.

For a comparison,

  1. Check out the previous commit 38aa634 of the ufs weather model
  2. Repeat steps 2, 3, and 4 above. Both tests will PASS.
JessicaMeixner-NOAA commented 2 years ago

It's running now I will post as soon as I have the answer to your question. I'm using the compiler flags @rmontuoro gave me, I'm not sure if that's will be the final decision for the flags or not. I will run the ORT and regression tests again once I know what the final flags are.

JessicaMeixner-NOAA commented 2 years ago

The ORT failed:

/scratch2/NCEPDEV/climate/Jessica.Meixner/p8b/ufs-decomp-03/tests$ ./opnReqTest -n cpld_control_c96_p8 
hecflow01
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_c96_p8
cases to compile: std bit dbg
compiling std with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
compiling bit with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -D32BIT=ON
Died with error code 2

will look after the meeting as to why.

JessicaMeixner-NOAA commented 2 years ago

Okay so this failed because we shouldn't be trying to compile the coupled model with -D32BIT=ON (yet).

I'll resubmit specifying which tests and report back how the tests go.

JessicaMeixner-NOAA commented 2 years ago

Next error:

compiling dbg with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -DDEBUG=ON
Died with error code 2

Debug build fails because of waves I'm assuming. I thought we could at least build though, so I'll look into debugging that issue, in the meantime:

Running now with the following options: ./opnReqTest -n cpld_control_p8 -c thr,mpi,dcp,rst

Please let me know if I should be testing/compiling with different compiler flags or using different options in the ORT test.

arunchawla-NOAA commented 2 years ago

Can we run the opnreqtest with the flags that @rmontuoro identified ?

DeniseWorthen commented 2 years ago

I think your debug failure for S2SWA requires the fix that Kyle provided here.

JessicaMeixner-NOAA commented 2 years ago

Thanks @DeniseWorthen I agree.
@arunchawla-NOAA I am running with the flags that @rmontuoro identified. It's unclear to me if everyone agrees that's the best/final option.

Next ORT failure:

./opnReqTest -n cpld_control_p8 -c thr,mpi,dcp,rst
hecflow01
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_p8
cases to compile: std
compiling std with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
cases to run: std_base thr mpi dcp rst
Running test for std_base with
    THRD: 1; INPES: 3; JNPES: 8; TASKS: 200; TPN: 40
Running test for thr with
    THRD: 2; INPES: 3; JNPES: 4; TASKS: 120; TPN: 20
Coupled application not yet implemented for mpi
Died with error code 1

Where can I find when test and what options are expected to work for the ORT tests? I don't see anything here: https://ufs-weather-model.readthedocs.io/en/latest/BuildingAndRunning.html#using-the-operational-requirement-test-script

arunchawla-NOAA commented 2 years ago

For now I would like to know if these flags do work. We have reached out to GDIT about appropriate flags. @MinsukJi-NOAA can you address the questions that @JessicaMeixner-NOAA has raised ?

JessicaMeixner-NOAA commented 2 years ago

@arunchawla-NOAA in terms of knowing if the flags worked, I did the following: I created an extra "cpld_decomp_p8_2" test which uses the 8,3 decomposition. I created a baseline (for the non-debug S2SAW tests) and then ran the regression tests against that baseline and the tests were successful. I did not test the debug test, because in previous tests, it gave the same answers whereas the non-debug tests gave different ones. This test was on hera. On orion, I ran the cpld_decomp_p8 set up with the 3,8 8,3 and 4,6 decompositions and did diffs of the directories and also got reproducibility. In the meantime, I'll keep working on the ORT tests, to provide further evidence that the tests are working.

JessicaMeixner-NOAA commented 2 years ago

The following test passed: ./opnReqTest -n cpld_control_c96_p8 -c thr,dcp,rst with the following compiler flags

GOCART$ git diff CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 12d5a32..4371571 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -33,7 +33,9 @@ if (UFS_GOCART)
 # Ensure we build as 32-bit
   message ("Force 32-bit build for GOCART")
   if (CMAKE_Fortran_COMPILER_ID MATCHES "Intel")
-    string (REPLACE "-real-size 64" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
+#   string (REPLACE "-real-size 64" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
+    set (CMAKE_Fortran_FLAGS "-g -traceback -fp-model source -ftz -align array64byte -qno-opt-dynamic-align")
+#   set (CMAKE_Fortran_FLAGS "-g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -prec-div -prec-sqrt -mp1 -O2 -fPIC")
   elseif (CMAKE_Fortran_COMPILER_ID MATCHES "GNU")
     string (REPLACE "-fdefault-real-8" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
   endif()

I'll work on making the debug test pass as well as any other desired combination/test.

arunchawla-NOAA commented 2 years ago

so if we were to replace fp-model flag with source instead of consistent in repro mode the ORT test would pass ? (instead of having to create these seperate flags just for UFS GOCART

JessicaMeixner-NOAA commented 2 years ago

This is the original code I tried: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissue which I have also now confirmed with the fix @DeniseWorthen pointed me to dbg also works.

I'm now trying the suggested update @arunchawla-NOAA is curious about with repro mode and replacing fp-model flag with source instead of consistent. The code is here: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissuerepro and I'm trying the following test: ./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz

I'll report back when I have more information.

JessicaMeixner-NOAA commented 2 years ago

Here's my output of the ORT with turning on repro mode:

$ ./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz
hfe10
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_c96_p8
cases to compile: std dbg
compiling std with compile option -DAPP=S2SWA -DREPRO=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
compiling dbg with compile option -DAPP=S2SWA -DREPRO=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -DDEBUG=ON
done compiling dbg
cases to run: std_base thr dcp rst dbg_base fhz
Running test for std_base with
    THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Running test for thr with
    THRD: 2; INPES: 3; JNPES: 4; TASKS: 120; TPN: 20
Running test for dcp with
    THRD: 1; INPES: 8; JNPES: 3; TASKS: 192; TPN: 40
Running test for rst with
    THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Running test for dbg_base with
    THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Died with error code 1

So the debug test failed, but perhaps I should be building with both -DREPRO=ON and -DDEBUG=ON, but it looks to me like all the other tests passed although the log file didn't get created.

rmontuoro commented 2 years ago

I was able to successfully run the operational requirement test for cpld_control_c96_p8 on Hera following the steps below:

  1. Checkout ufs-weather-model rev. 1bd68cab
  2. Add @kgerheiser's fix to WW3/CMakeLists.txt
  3. Append -DREPRO=ON to all options in the ORT script:

    diff --git a/tests/opnReqTest b/tests/opnReqTest
    index e6aa93e0..802e98e7 100755
    --- a/tests/opnReqTest
    +++ b/tests/opnReqTest
    @@ -102,6 +102,7 @@ build_opnReqTests() {
         ;;
     esac
     MAKE_OPT=$(echo $MAKE_OPT | sed -e 's/^ *//' -e 's/ *$//')
    +    MAKE_OPT="${MAKE_OPT} -DREPRO=ON"
     export COMPILE_NR=${name}
    
       cat <<-EOF > ${RUNDIR_ROOT}/compile_${name}.env
  4. Run:
    ./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz

    The original REPRO flags were used in this test. These include -fp-model consistent: https://github.com/ufs-community/ufs-weather-model/blob/1bd68cab708af76e4f9479ab6300990861aa24a2/cmake/Intel.cmake#L19-L23

junwang-noaa commented 2 years ago

@rmontuoro Thanks for confirming the ORT does work with REPRO mode. Can you build GOCART only with REPRO compile option and the rest of model in PROD mode to see if it still passes?

JessicaMeixner-NOAA commented 2 years ago

By removing -no-prec-div and -no-prec-sqrt in the prod option for non 32 bit, (see: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/twoflags) the ORT test: ./opnReqTest -n cpld_control_c96_p8 -c thr,dcp,rst,dbg,fhz passes. If this does not affect timing, would this be an acceptable solution which would not require REPRO mode for any or all of the components? I saw in other PRs that effect on timing seemed to be judged by timing of regression tests. Would that be sufficient in this case or will other timing tests be required? (FYI: @arunchawla-NOAA @rmontuoro )

arunchawla-NOAA commented 2 years ago

I would think this would be a better solution than having conditional builds with different flags for different components. @junwang-noaa and @DusanJovic-NOAA what is the best way to assess the impact of this change on the timing of the runs. Since this change keeps the AVX options for GOCART which seems to be critical for speed based on what @climbfuji has stated, this should work right ?

junwang-noaa commented 2 years ago

I'd suggest running RT and maybe one 35 day benchmark test to confirm the timing is not impacted much.

rmontuoro commented 2 years ago

Preliminary tests performed on Hera using cpld_bmark_p8 returned the following timings based on 3 identical runs per case:

Case Run # Total wall time
current flags 01 1044.692037
current flags 02 1049.168395
current flags 03 1034.183660
Average 1042.68
Std. Dev. 7.69
remove -no-prec flags 01 1039.495619
remove -no-prec flags 02 1030.581493
remove -no-prec flags 03 1041.315700
Average 1037.13
Std. Dev. 5.74

Note that -no-prec flags were removed as:

diff --git a/cmake/Intel.cmake b/cmake/Intel.cmake
index a91bc30c..b71b172d 100644
--- a/cmake/Intel.cmake
+++ b/cmake/Intel.cmake
@@ -24,7 +24,7 @@ else()
     if(32BIT)
       set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3")
     else()
-      set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -no-prec-div -no-prec-sqrt")
+      set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3")
     endif()
     set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O2 -debug minimal")
     set(FAST "-fast-transcendentals")

These flags are still used in GFDL_atmos_cubed_sphere: https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/43f7ed39fcd302e8404a152011f7a02c5d76ddc9/cmake/compiler_flags_Intel_Fortran.cmake#L4

JessicaMeixner-NOAA commented 2 years ago

Okay, so I some how majorly goofed. So many apologies for this, but I misread or read what I wanted to see, but the removing just the two-flags at the top level does in fact not work for the decomposition tests. I kept re-running tests last night because they all of a sudden were not working and I kept thinking I had done something wrong, but then this morning I rechecked in the original log file which does in fact say the test failed: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/blob/bug/twoflags/tests/OpnReqTests_cpld_control_c96_p8_hera.intel.log#L390-L394

I still think it's valuable information for what @rmontuoro ran and discovered, because if it holds that other configurations also do not have a boost from using these flags then perhaps they should be removed. However, it still appears we need to find a clean solution for the decomposition issue.

rmontuoro commented 2 years ago

Below are results from my latest ORT on Hera:

Test Case Result
dbg_base cpld_control_c96_p8 PASS
dcp cpld_control_c96_p8 PASS
fhz cpld_control_c96_p8 PASS
rst cpld_control_c96_p8 PASS
std_base cpld_control_c96_p8 PASS
thr cpld_control_c96_p8 PASS
MinsukJi-NOAA commented 2 years ago

Using the latest weather model (846f0e4, #1171), the decomposition test is successful: the cpld_decomp_p8 regression test works not only as it is (INPES=4 and JNPES=6), but also with INPES=8 and JNPES=3.

junwang-noaa commented 2 years ago

@MinsukJi-NOAA Thanks for testing. Since the problem is resolved, I will close the issue.

junwang-noaa commented 2 years ago

Sure, I will do some testing. But my understanding is that the changes are independent to decomposition, it is to fix the writing time for restart files in the two phases run.

On Sun, Mar 20, 2022 at 12:05 AM Fanglin Yang @.***> wrote:

Please see https://docs.google.com/document/d/11vo2-DyrR2LWbQoqprlVoTxhSEnpvblWPw30WqOl6uA/edit for a track of changes made to fv3-atm and ccpp repos after March 4 and before March 10 when Minsuk first reported the decomposition failure. Can we reverse the " fix 2phases intermediate restart" update and see if the decomposition RT works ?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/1103#issuecomment-1073162038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TMTFLFDU3VT6U3WDJDVA2PWXANCNFSM5QNZ5QIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>