ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 244 forks source link

failure of regression test control_c48 on wcoss_dell_p3 #811

Closed JamesAbeles-NOAA closed 3 years ago

JamesAbeles-NOAA commented 3 years ago

I cloned the latest ufs weather model

git clone -q --recursive https://github.com/ufs-community/ufs-weather-model cd ufs-weather-model/tests ./rt.sh -l rt.conf -k -n control_c48 >&rt.test& It says the test failed. The directory is here: /gpfs/dell2/emc/modeling/noscrub/James.A.Abeles/ufs-weather-model/tests/ The job ran to completion but did not validate

climbfuji commented 3 years ago

Hmm. This is form the latest commit:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/RegressionTests_wcoss_dell_p3.log

baseline dir = /gpfs/dell2/emc/modeling/noscrub/emc.nemspara/RT/NEMSfv3gfs/develop-20210907/control_c48
working dir  = /gpfs/dell2/ptmp/Dom.Heinzeller/FV3_RT/rt_10554/control_c48
Checking test 024 control_c48 results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf024.nc .........OK
 Comparing atmf000.nc .........OK
 Comparing atmf024.nc .........OK
 Comparing RESTART/coupler.res .........OK
 Comparing RESTART/fv_core.res.nc .........OK
 Comparing RESTART/fv_core.res.tile1.nc .........OK
 Comparing RESTART/fv_core.res.tile2.nc .........OK
 Comparing RESTART/fv_core.res.tile3.nc .........OK
 Comparing RESTART/fv_core.res.tile4.nc .........OK
 Comparing RESTART/fv_core.res.tile5.nc .........OK
 Comparing RESTART/fv_core.res.tile6.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile1.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile2.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile3.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile4.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile5.nc .........OK
 Comparing RESTART/fv_srf_wnd.res.tile6.nc .........OK
 Comparing RESTART/fv_tracer.res.tile1.nc .........OK
 Comparing RESTART/fv_tracer.res.tile2.nc .........OK
 Comparing RESTART/fv_tracer.res.tile3.nc .........OK
 Comparing RESTART/fv_tracer.res.tile4.nc .........OK
 Comparing RESTART/fv_tracer.res.tile5.nc .........OK
 Comparing RESTART/fv_tracer.res.tile6.nc .........OK
 Comparing RESTART/phy_data.tile1.nc .........OK
 Comparing RESTART/phy_data.tile2.nc .........OK
 Comparing RESTART/phy_data.tile3.nc .........OK
 Comparing RESTART/phy_data.tile4.nc .........OK
 Comparing RESTART/phy_data.tile5.nc .........OK
 Comparing RESTART/phy_data.tile6.nc .........OK
 Comparing RESTART/sfc_data.tile1.nc .........OK
 Comparing RESTART/sfc_data.tile2.nc .........OK
 Comparing RESTART/sfc_data.tile3.nc .........OK
 Comparing RESTART/sfc_data.tile4.nc .........OK
 Comparing RESTART/sfc_data.tile5.nc .........OK
 Comparing RESTART/sfc_data.tile6.nc .........OK

[0] The total amount of wall time                        = 434.260632

Test 024 control_c48 PASS
JamesAbeles-NOAA commented 3 years ago

If I could get back on Mars, I would post what message I got.

climbfuji commented 3 years ago

Mars is WCOSS Cray?

On Sep 16, 2021, at 12:32 PM, JamesAbeles-NOAA @.***> wrote:

If I could get back on Mars, I would post what message I got.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/811#issuecomment-921145919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C2RI3Q44Y25LSMCZKVC3UCIZ5NANCNFSM5EFLLQDA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

JamesAbeles-NOAA commented 3 years ago

Yes, I remembered I had a window. Here is what I see: cat fail_test control_c48 001 failed in check_result baseline dir = /gpfs/dell2/emc/modeling/noscrub/emc.nemspara/RT/NEMSfv3gfs/develop-20210907/control_c48 working dir = /gpfs/dell2/ptmp/James.A.Abeles/FV3_RT/rt_5253/control_c48 Checking test 001 control_c48 results .... Comparing sfcf000.nc ............ALT CHECK......NOT OK Comparing sfcf024.nc ............ALT CHECK......NOT OK Comparing atmf000.nc ............ALT CHECK......NOT OK Comparing atmf024.nc ............ALT CHECK......NOT OK Comparing RESTART/coupler.res .........OK Comparing RESTART/fv_core.res.nc .........OK Comparing RESTART/fv_core.res.tile1.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_core.res.tile2.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_core.res.tile3.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_core.res.tile4.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_core.res.tile5.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_core.res.tile6.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile1.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile2.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile3.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile4.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile5.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_srf_wnd.res.tile6.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile1.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile2.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile3.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile4.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile5.nc ............ALT CHECK......NOT OK Comparing RESTART/fv_tracer.res.tile6.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile1.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile2.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile3.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile4.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile5.nc ............ALT CHECK......NOT OK Comparing RESTART/phy_data.tile6.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile1.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile2.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile3.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile4.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile5.nc ............ALT CHECK......NOT OK Comparing RESTART/sfc_data.tile6.nc ............ALT CHECK......NOT OK

[0] The total amount of wall time = 433.593927

Test 001 control_c48 FAIL

MinsukJi-NOAA commented 3 years ago

@JamesAbeles-NOAA, can you try ./rt.sh -k -n control_c48 >&rt.test& ? The -n option has not been tried together with the -l option. The -n option will use rt.conf as default. If needed, using both -n and -l at the same time can be easily implemented in the future.

MinsukJi-NOAA commented 3 years ago

This may have to do with not using the ecflow for these single jobs (-n option). I am running a test now, and will report back.

JamesAbeles-NOAA commented 3 years ago

I did this: ./rt.sh -k -n control_c48 >&rt.test.1 and it failed again. Same location: /gpfs/dell2/emc/modeling/noscrub/James.A.Abeles/ufs-weather-model/tests/

MinsukJi-NOAA commented 3 years ago

@JamesAbeles-NOAA I am not able to verify the failure. I git cloned the latest develop (hash 9007b8) and invoked ./rt.sh -k -n control_c48 >out 2>&1 & and it passed: see /gpfs/dell2/emc/modeling/noscrub/Minsuk.Ji/ISS811/tests/RegressionTests_wcoss_dell_p3.log

climbfuji commented 3 years ago

@JamesAbeles-NOAA please check for any automatically loaded modules in your environment (.bashrc, .bash_profile, etc.).

JamesAbeles-NOAA commented 3 years ago

I need to do module purge before running the regression test?

MinsukJi-NOAA commented 3 years ago

Usually, No. That should be taken care of by NEMS/src/conf/module-setup.sh.inc

MinsukJi-NOAA commented 3 years ago

I notice that you have modified cmake/Intel.cmake in your source directory.

--- a/cmake/Intel.cmake
+++ b/cmake/Intel.cmake
@@ -29,9 +29,9 @@ else()
     set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O2 -debug minimal")
     set(FAST "-fast-transcendentals")
     if(AVX2)
-        set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -march=core-avx2")
-        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=core-avx2")
-        set(CMAKE_Fortran_FLAGS_OPT "-no-prec-div -no-prec-sqrt -xCORE-AVX2")
+        set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -march=core-avx2 -mtune=core-avx2")
+        set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=core-avx2 -mtune=core-avx2" )
+        set(CMAKE_Fortran_FLAGS_OPT "-no-prec-div -no-prec-sqrt -march=core-avx2 -mtune=core-avx2")
     elseif(SIMDMULTIARCH)
         set(CMAKE_Fortran_FLAGS "${CMAKE_Fortran_FLAGS} -axSSE4.2,CORE-AVX2")
         set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -axSSE4.2,CORE-AVX2")
JamesAbeles-NOAA commented 3 years ago

Oh, yes you are correct. I have added mtune since that is what we are recommended to use for performance on wcoss2. Sorry I forgot about that

MinsukJi-NOAA commented 3 years ago

Let me know how your test goes with the cmake file reverted.