Closed MinsukJi-NOAA closed 2 years ago
It's running now I will post as soon as I have the answer to your question. I'm using the compiler flags @rmontuoro gave me, I'm not sure if that's will be the final decision for the flags or not. I will run the ORT and regression tests again once I know what the final flags are.
The ORT failed:
/scratch2/NCEPDEV/climate/Jessica.Meixner/p8b/ufs-decomp-03/tests$ ./opnReqTest -n cpld_control_c96_p8
hecflow01
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_c96_p8
cases to compile: std bit dbg
compiling std with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
compiling bit with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -D32BIT=ON
Died with error code 2
will look after the meeting as to why.
Okay so this failed because we shouldn't be trying to compile the coupled model with -D32BIT=ON (yet).
I'll resubmit specifying which tests and report back how the tests go.
Next error:
compiling dbg with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -DDEBUG=ON
Died with error code 2
Debug build fails because of waves I'm assuming. I thought we could at least build though, so I'll look into debugging that issue, in the meantime:
Running now with the following options:
./opnReqTest -n cpld_control_p8 -c thr,mpi,dcp,rst
Please let me know if I should be testing/compiling with different compiler flags or using different options in the ORT test.
Can we run the opnreqtest with the flags that @rmontuoro identified ?
I think your debug failure for S2SWA requires the fix that Kyle provided here.
Thanks @DeniseWorthen I agree.
@arunchawla-NOAA I am running with the flags that @rmontuoro identified. It's unclear to me if everyone agrees that's the best/final option.
Next ORT failure:
./opnReqTest -n cpld_control_p8 -c thr,mpi,dcp,rst
hecflow01
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_p8
cases to compile: std
compiling std with compile option -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
cases to run: std_base thr mpi dcp rst
Running test for std_base with
THRD: 1; INPES: 3; JNPES: 8; TASKS: 200; TPN: 40
Running test for thr with
THRD: 2; INPES: 3; JNPES: 4; TASKS: 120; TPN: 20
Coupled application not yet implemented for mpi
Died with error code 1
Where can I find when test and what options are expected to work for the ORT tests? I don't see anything here: https://ufs-weather-model.readthedocs.io/en/latest/BuildingAndRunning.html#using-the-operational-requirement-test-script
For now I would like to know if these flags do work. We have reached out to GDIT about appropriate flags. @MinsukJi-NOAA can you address the questions that @JessicaMeixner-NOAA has raised ?
@arunchawla-NOAA in terms of knowing if the flags worked, I did the following: I created an extra "cpld_decomp_p8_2" test which uses the 8,3 decomposition. I created a baseline (for the non-debug S2SAW tests) and then ran the regression tests against that baseline and the tests were successful. I did not test the debug test, because in previous tests, it gave the same answers whereas the non-debug tests gave different ones. This test was on hera. On orion, I ran the cpld_decomp_p8 set up with the 3,8 8,3 and 4,6 decompositions and did diffs of the directories and also got reproducibility. In the meantime, I'll keep working on the ORT tests, to provide further evidence that the tests are working.
The following test passed:
./opnReqTest -n cpld_control_c96_p8 -c thr,dcp,rst
with the following compiler flags
GOCART$ git diff CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 12d5a32..4371571 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -33,7 +33,9 @@ if (UFS_GOCART)
# Ensure we build as 32-bit
message ("Force 32-bit build for GOCART")
if (CMAKE_Fortran_COMPILER_ID MATCHES "Intel")
- string (REPLACE "-real-size 64" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
+# string (REPLACE "-real-size 64" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
+ set (CMAKE_Fortran_FLAGS "-g -traceback -fp-model source -ftz -align array64byte -qno-opt-dynamic-align")
+# set (CMAKE_Fortran_FLAGS "-g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -prec-div -prec-sqrt -mp1 -O2 -fPIC")
elseif (CMAKE_Fortran_COMPILER_ID MATCHES "GNU")
string (REPLACE "-fdefault-real-8" "" CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS}")
endif()
I'll work on making the debug test pass as well as any other desired combination/test.
so if we were to replace fp-model flag with source instead of consistent in repro mode the ORT test would pass ? (instead of having to create these seperate flags just for UFS GOCART
This is the original code I tried: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissue which I have also now confirmed with the fix @DeniseWorthen pointed me to dbg also works.
I'm now trying the suggested update @arunchawla-NOAA is curious about with repro mode and replacing fp-model flag with source instead of consistent. The code is here: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissuerepro and I'm trying the following test:
./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz
I'll report back when I have more information.
Here's my output of the ORT with turning on repro mode:
$ ./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz
hfe10
ECF_HOST = hecflow01, ECF_PORT = 26510
test name: cpld_control_c96_p8
cases to compile: std dbg
compiling std with compile option -DAPP=S2SWA -DREPRO=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8
done compiling std
compiling dbg with compile option -DAPP=S2SWA -DREPRO=ON -DCCPP_SUITES=FV3_GFS_v16_coupled_nsstNoahmpUGWPv1,FV3_GFS_v16_coupled_p7_rrtmgp,FV3_GFS_v17_coupled_p8 -DDEBUG=ON
done compiling dbg
cases to run: std_base thr dcp rst dbg_base fhz
Running test for std_base with
THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Running test for thr with
THRD: 2; INPES: 3; JNPES: 4; TASKS: 120; TPN: 20
Running test for dcp with
THRD: 1; INPES: 8; JNPES: 3; TASKS: 192; TPN: 40
Running test for rst with
THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Running test for dbg_base with
THRD: 1; INPES: 3; JNPES: 8; TASKS: 192; TPN: 40
Died with error code 1
So the debug test failed, but perhaps I should be building with both -DREPRO=ON and -DDEBUG=ON, but it looks to me like all the other tests passed although the log file didn't get created.
I was able to successfully run the operational requirement test for cpld_control_c96_p8
on Hera following the steps below:
WW3/CMakeLists.txt
Append -DREPRO=ON
to all options in the ORT script:
diff --git a/tests/opnReqTest b/tests/opnReqTest
index e6aa93e0..802e98e7 100755
--- a/tests/opnReqTest
+++ b/tests/opnReqTest
@@ -102,6 +102,7 @@ build_opnReqTests() {
;;
esac
MAKE_OPT=$(echo $MAKE_OPT | sed -e 's/^ *//' -e 's/ *$//')
+ MAKE_OPT="${MAKE_OPT} -DREPRO=ON"
export COMPILE_NR=${name}
cat <<-EOF > ${RUNDIR_ROOT}/compile_${name}.env
./opnReqTest -n cpld_control_c96_p8 -c std,thr,dcp,rst,dbg,fhz
The original REPRO
flags were used in this test. These include -fp-model consistent
:
https://github.com/ufs-community/ufs-weather-model/blob/1bd68cab708af76e4f9479ab6300990861aa24a2/cmake/Intel.cmake#L19-L23
@rmontuoro Thanks for confirming the ORT does work with REPRO mode. Can you build GOCART only with REPRO compile option and the rest of model in PROD mode to see if it still passes?
By removing -no-prec-div and -no-prec-sqrt in the prod option for non 32 bit, (see: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/twoflags) the ORT test:
./opnReqTest -n cpld_control_c96_p8 -c thr,dcp,rst,dbg,fhz
passes. If this does not affect timing, would this be an acceptable solution which would not require REPRO mode for any or all of the components? I saw in other PRs that effect on timing seemed to be judged by timing of regression tests. Would that be sufficient in this case or will other timing tests be required? (FYI: @arunchawla-NOAA @rmontuoro )
I would think this would be a better solution than having conditional builds with different flags for different components. @junwang-noaa and @DusanJovic-NOAA what is the best way to assess the impact of this change on the timing of the runs. Since this change keeps the AVX options for GOCART which seems to be critical for speed based on what @climbfuji has stated, this should work right ?
I'd suggest running RT and maybe one 35 day benchmark test to confirm the timing is not impacted much.
Preliminary tests performed on Hera using cpld_bmark_p8
returned the following timings based on 3 identical runs per case:
Case | Run # | Total wall time |
---|---|---|
current flags | 01 | 1044.692037 |
current flags | 02 | 1049.168395 |
current flags | 03 | 1034.183660 |
Average | 1042.68 | |
Std. Dev. | 7.69 | |
remove -no-prec flags | 01 | 1039.495619 |
remove -no-prec flags | 02 | 1030.581493 |
remove -no-prec flags | 03 | 1041.315700 |
Average | 1037.13 | |
Std. Dev. | 5.74 |
Note that -no-prec
flags were removed as:
diff --git a/cmake/Intel.cmake b/cmake/Intel.cmake
index a91bc30c..b71b172d 100644
--- a/cmake/Intel.cmake
+++ b/cmake/Intel.cmake
@@ -24,7 +24,7 @@ else()
if(32BIT)
set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3")
else()
- set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3 -no-prec-div -no-prec-sqrt")
+ set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3")
endif()
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O2 -debug minimal")
set(FAST "-fast-transcendentals")
These flags are still used in GFDL_atmos_cubed_sphere: https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/43f7ed39fcd302e8404a152011f7a02c5d76ddc9/cmake/compiler_flags_Intel_Fortran.cmake#L4
Okay, so I some how majorly goofed. So many apologies for this, but I misread or read what I wanted to see, but the removing just the two-flags at the top level does in fact not work for the decomposition tests. I kept re-running tests last night because they all of a sudden were not working and I kept thinking I had done something wrong, but then this morning I rechecked in the original log file which does in fact say the test failed: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/blob/bug/twoflags/tests/OpnReqTests_cpld_control_c96_p8_hera.intel.log#L390-L394
I still think it's valuable information for what @rmontuoro ran and discovered, because if it holds that other configurations also do not have a boost from using these flags then perhaps they should be removed. However, it still appears we need to find a clean solution for the decomposition issue.
Below are results from my latest ORT on Hera:
ufs-weather-model
revision a46c7edcWW3/CMakeLists.txt
fix included
diff --git a/cmake/Intel.cmake b/cmake/Intel.cmake
index a91bc30c..d9d5f767 100644
--- a/cmake/Intel.cmake
+++ b/cmake/Intel.cmake
@@ -24,7 +24,7 @@ else()
if(32BIT)
set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -O2 -debug minimal -fp-model source -qoverride-limits -qopt-prefetch=3")
else()
Test | Case | Result |
---|---|---|
dbg_base | cpld_control_c96_p8 | PASS |
dcp | cpld_control_c96_p8 | PASS |
fhz | cpld_control_c96_p8 | PASS |
rst | cpld_control_c96_p8 | PASS |
std_base | cpld_control_c96_p8 | PASS |
thr | cpld_control_c96_p8 | PASS |
Using the latest weather model (846f0e4, #1171), the decomposition test is successful: the cpld_decomp_p8
regression test works not only as it is (INPES=4
and JNPES=6
), but also with INPES=8
and JNPES=3
.
@MinsukJi-NOAA Thanks for testing. Since the problem is resolved, I will close the issue.
Sure, I will do some testing. But my understanding is that the changes are independent to decomposition, it is to fix the writing time for restart files in the two phases run.
On Sun, Mar 20, 2022 at 12:05 AM Fanglin Yang @.***> wrote:
Please see https://docs.google.com/document/d/11vo2-DyrR2LWbQoqprlVoTxhSEnpvblWPw30WqOl6uA/edit for a track of changes made to fv3-atm and ccpp repos after March 4 and before March 10 when Minsuk first reported the decomposition failure. Can we reverse the " fix 2phases intermediate restart" update and see if the decomposition RT works ?
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/1103#issuecomment-1073162038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TMTFLFDU3VT6U3WDJDVA2PWXANCNFSM5QNZ5QIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
Description
cpld_decomp_p8
fails with a different domain decompositionTo Reproduce:
What compilers/machines are you seeing this with? Intel/Hera Give explicit steps to reproduce the behavior.
9b6b740
)cd ufs-weather-model/tests; ./rt.sh -n cpld_decomp_p8
. This test will PASSModify
ufs-weather-model/tests/tests/cpld_decomp_p8
:For a comparison,
38aa634
of the ufs weather model